with one click
pdf-editing
Complete guide for reading and editing PDF documents with PyMuPDF.
Menu
Complete guide for reading and editing PDF documents with PyMuPDF.
| name | pdf-editing |
| description | Complete guide for reading and editing PDF documents with PyMuPDF. |
NEVER DO THESE:
TWO APPROACHES - CHOOSE THE RIGHT ONE:
For REPLACING text (e.g., updating name, email, DOB):
draw_rect() with white fill to cover old textinsert_text() at the SAME positionFor TRUE REDACTION of sensitive data (e.g., student ID):
add_redact_annot(rect, fill=(1,1,1)) with WHITE fillapply_redactions() to REMOVE text from PDF structureinsert_text() to add masked value (e.g., "****5678")USE PYTHON WITH PyMuPDF (fitz) - it is pre-installed and produces the best results.
PyMuPDF preserves the text layer properly, making text extractable after editing. JavaScript libraries like pdf-lib may create text that tools like pypdf cannot extract.
# PyMuPDF is already installed - just use it
python3 -c "import fitz; print('PyMuPDF ready')"
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Extract all text to understand the document
text = page.get_text()
print(text)
# search_for() returns list of rectangles where text is found
rects = page.search_for("Label Text")
if rects:
rect = rects[0]
# rect.x0, rect.y0 = top-left corner
# rect.x1, rect.y1 = bottom-right corner
print(f"Found at: ({rect.x0}, {rect.y0}) to ({rect.x1}, {rect.y1})")
# Insert text at a specific position
page.insert_text(
(x_position, y_position), # coordinates
"text to insert",
fontsize=11,
color=(0, 0, 0) # black
)
doc.save("output.pdf")
When a form field is empty (no existing value), insert text next to the label:
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Find label, insert value to the right of it
label_rect = page.search_for("FIELD LABEL:")[0]
page.insert_text((label_rect.x1 + 5, label_rect.y1), "value", fontsize=11)
# For today's date
from datetime import datetime
date_rect = page.search_for("Date")[0]
today = datetime.now().strftime("%Y/%m/%d")
page.insert_text((date_rect.x1 + 5, date_rect.y1), today, fontsize=11)
# For signatures - insert the person's name
sig_rect = page.search_for("signature")[0]
page.insert_text((sig_rect.x1 + 5, sig_rect.y1), "Full Name", fontsize=12)
doc.save("output.pdf")
When you need to replace text that's already in the PDF with new/correct values:
WRONG - DO NOT DO THIS:
# WRONG: Adding text next to old value
page.insert_text((old_rect.x1 + 10, old_rect.y1), new_value) # NO!
# WRONG: Using strikethrough
page.draw_line(start, end, color=(0,0,0)) # NO!
# WRONG: Rasterizing to image
pix = page.get_pixmap() # NO! Destroys text layer
RIGHT - DO THIS:
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# First, extract text to see what's in the PDF
current_text = page.get_text()
print(current_text) # Examine what values exist
# To replace a value you found that needs changing:
old_value = "..." # The value you found in the PDF that's wrong
new_value = "..." # The correct value from your input source
rects = page.search_for(old_value)
if rects:
rect = rects[0]
# STEP 1: Draw WHITE rectangle to COMPLETELY COVER old text
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1), width=0)
# STEP 2: Insert new text at the SAME position (not offset!)
page.insert_text((rect.x0, rect.y1), new_value, fontsize=11, color=(0, 0, 0))
doc.save("output.pdf")
Key points:
draw_rect(rect, fill=(1,1,1), width=0) draws a WHITE filled rectangle(1, 1, 1) is white in RGB (0-1 scale)rect.x0 (same X position), NOT rect.x1 + offsetFor sensitive data like student IDs, you must use TRUE REDACTION that removes the original text from the PDF structure. A white rectangle only VISUALLY covers text - tools like pypdf can still extract the hidden text!
CRITICAL DISTINCTION:
draw_rect() = Visual cover only (text still extractable by machines)add_redact_annot() + apply_redactions() = TRUE redaction (text removed from PDF)Example: "A12345678" should become "****5678" (show only last 4 digits)
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# 1. First find what value is in the PDF by reading the text
pdf_text = page.get_text()
# Find the ID in the text (e.g., "A88888888")
original_in_pdf = "A88888888" # Value you found in the PDF
# Extract just the digits for masking
digits = ''.join(c for c in original_in_pdf if c.isdigit())
masked = "****" + digits[-4:] # Result: "****8888"
rects = page.search_for(original_in_pdf)
if rects:
rect = rects[0]
# STEP 1: Create TIGHT bounding box to avoid covering nearby labels
tight_rect = fitz.Rect(rect.x0, rect.y0 + 8, rect.x1, rect.y1 - 2)
# STEP 2: Add redaction annotation with WHITE fill (not black!)
page.add_redact_annot(tight_rect, fill=(1, 1, 1)) # WHITE fill
# STEP 2: Apply redactions - this REMOVES the text from PDF structure
page.apply_redactions()
# STEP 3: Insert masked value at the same position
page.insert_text((rect.x0, rect.y1), masked, fontsize=11, color=(0, 0, 0))
doc.save("output.pdf")
Why this works:
add_redact_annot(rect, fill=(1, 1, 1)) - marks area with WHITE rectangleapply_redactions() - REMOVES the underlying text from PDF structureinsert_text() - adds the masked valueWRONG approaches:
# WRONG: Black box redaction (ugly and suspicious)
page.add_redact_annot(rect, fill=(0, 0, 0)) # NO! Use white fill
# WRONG: Only using draw_rect (text still extractable!)
page.draw_rect(rect, fill=(1, 1, 1)) # NO! This only covers visually
page.insert_text(...) # Original text is still in PDF structure!
# WRONG: Adding masked value NEXT TO original
page.insert_text((rect.x1 + 10, rect.y1), masked) # NO!
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
# Step 1: Read PDF to see current content
pdf_text = page.get_text()
print(pdf_text)
# Step 2: Read your input source to get correct values
# (parse input file, extract the values you need)
# Step 3: For each value that differs, replace it
# Build a dict of {old_value_in_pdf: correct_value}
replacements = {} # Populate by comparing PDF content with correct values
for old_val, new_val in replacements.items():
if old_val == new_val:
continue # Skip if already correct
rects = page.search_for(old_val)
if rects:
rect = rects[0]
# Cover old text with white rectangle
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1), width=0)
# Insert new text
page.insert_text((rect.x0, rect.y1), new_val, fontsize=11, color=(0, 0, 0))
doc.save("output.pdf")
WARNING: pdf-lib may create text that cannot be extracted by pypdf. Use Python with PyMuPDF instead whenever possible.
If you must use Node.js/JavaScript, use pdf-lib with the same approach:
const { PDFDocument, rgb } = require('pdf-lib');
const fs = require('fs');
async function editPdf() {
const pdfBytes = fs.readFileSync('input.pdf');
const pdfDoc = await PDFDocument.load(pdfBytes);
const page = pdfDoc.getPages()[0];
const { height } = page.getSize();
// To replace text at known coordinates:
// STEP 1: Draw WHITE rectangle to cover old text
page.drawRectangle({
x: oldTextX,
y: oldTextY,
width: oldTextWidth,
height: oldTextHeight,
color: rgb(1, 1, 1), // WHITE
});
// STEP 2: Draw new text at the SAME position
page.drawText('New Value', {
x: oldTextX, // SAME X position, not offset!
y: oldTextY,
size: 11,
color: rgb(0, 0, 0),
});
fs.writeFileSync('output.pdf', await pdfDoc.save());
}
WRONG with pdf-lib:
// WRONG: Adding text to the right of old value
page.drawText('New Value', {
x: oldTextX + oldTextWidth + 10, // NO! Don't offset
...
});
// WRONG: Drawing strikethrough line
page.drawLine({
start: { x: x1, y: y },
end: { x: x2, y: y },
color: rgb(0, 0, 0), // NO strikethrough!
});
Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.
Build deterministic, verifiable data visualizations with D3.js (v6). Generate standalone HTML/SVG (and optional PNG) from local data files without external network dependencies. Use when tasks require charts, plots, axes/scales, legends, tooltips, or data-driven SVG output.
A library for building, validating, visualizing, and serializing dialogue graphs. Use this when parsing scripts or creating branching narrative structures.
World-class Java and Spring Boot development skill for enterprise applications, microservices, and cloud-native systems. Expertise in Spring Framework, Spring Boot 3.x, Spring Cloud, JPA/Hibernate, and reactive programming with WebFlux. Includes project scaffolding, dependency management, security implementation, and performance optimization.
World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, real-time streaming, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, Flink, Kinesis, and modern data stack. Includes data modeling, pipeline orchestration, data quality, streaming quality monitoring, and DataOps. Use when designing data architectures, building batch or streaming data pipelines, optimizing data workflows, or implementing data governance.
This skill should be considered when you need to answer reflow machine maintenance questions or provide detailed guidance based on thermocouple data, MES data or defect data and reflow technical handbooks. This skill covers how to obtain important concepts, calculations, definitions, thresholds, and others from the handbook and how to do cross validations between handbook and datasets.