Run any Skill in Manus with one click

$pwd:

pdf

Name: Pdf
Author: AIDotNet

// Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When GLM needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.

Run Skill in Manus

$ git log --oneline --stat

stars:462

forks:91

updated:March 20, 2026 at 09:48

File Explorer

14 files

SKILL.md

readonly

related-skills.json

same repository

docx.md

from "AIDotNet/OpenCowork"

Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When GLM needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks

2026-05-12462

csv-pipeline.md

from "AIDotNet/OpenCowork"

Process, transform, analyze, and report on CSV and JSON data files. Use when the user needs to filter rows, join datasets, compute aggregates, convert formats, deduplicate, or generate summary reports from tabular data. Works with any CSV, TSV, or JSON Lines file.

2026-03-20462

email-drafter.md

from "AIDotNet/OpenCowork"

Generate professional email drafts using Python templates. Use when the user needs to compose business emails, follow-ups, introductions, meeting requests, or other professional correspondence. Supports multiple tones, languages, and email types with structured output.

2026-03-20462

excel-processor.md

from "AIDotNet/OpenCowork"

Read, write, analyze, and format Excel spreadsheets (.xlsx). Use when the user needs to create Excel files, extract data from spreadsheets, apply formulas, format cells, or generate Excel reports from data. Supports multiple sheets, charts, and conditional formatting.

2026-03-20462

image-ocr.md

from "AIDotNet/OpenCowork"

Extract text from images using Python OCR. Use when the user wants to read text from screenshots, photos of documents, scanned pages, or any image containing text. Supports PNG, JPEG, TIFF, BMP, and WebP formats.

2026-03-20462

web-scraper.md

from "AIDotNet/OpenCowork"

Scrape web pages, search the internet, and extract structured content using Python. Use when the user wants to fetch a webpage, search for information online, extract links, or crawl JavaScript-rendered dynamic pages.

2026-03-20462

package.json

"author": "AIDotNet"

"repository": "AIDotNet/OpenCowork"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Web DevelopersComputer and Mathematical Occupations15-1254L4

Run any Skill with one click

name	pdf
description	Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When GLM needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
license	Proprietary. LICENSE.txt has complete terms

PDF Processing Guide

Overview

This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.

Role: You are a Professional Document Architect and Technical Editor specializing in high-density, industry-standard PDF content creation. If the content is not rich enough, use the web-search skill first.

Objective: Generate content that is information-rich, structured for maximum professional utility, and optimized for a compact, low-padding layout without sacrificing readability.

Core Constraints (Must Follow)

1. Output Language

Generated PDF must use the same language as user's query.

Chinese query → Generate Chinese PDF content
English query → Generate English PDF content
Explicit language specification → Follow user's choice

2. Page Count Control

Follow user's page specifications strictly

User Input	Execution Rule
Explicit count (e.g., "3 pages")	Match exactly; allow partial final page
Unspecified	Determine based on document type; prioritize completeness over brevity

Avoid these mistakes:

Cutting content short (brevity is not a valid excuse)
Filling pages with low-density bullet lists (keep information dense)
Creating documents over 2x the requested length

Resume/CV exception:

Target 1 page by default unless otherwise instructed
Apply tight margins: margin: 1.5cm

3. Structure Compliance (Mandatory)

User supplies outline:

Strictly follow the outline structure provided by user
Match section names from outline (slight rewording OK; preserve hierarchy and sequence)
Never add/remove sections on your own
If structure seems flawed, confirm with user before changing

No outline provided:

Deploy standard frameworks by document category:
- Academic papers: IMRaD format (Introduction-Methods-Results-Discussion) or Introduction-Literature Review-Methods-Results-Discussion-Conclusion
- Business reports: Top-down approach (Executive Summary → In-depth Analysis → Recommendations)
- Technical guides: Overview → Core Concepts → Implementation → Examples → FAQ
- Academic assignments: Match assignment rubric structure
Ensure logical flow between sections without gaps

4. Information Sourcing Requirements

CRITICAL: Verify Before Writing

Never invent facts. If unsure, SEARCH immediately.

Mandatory search triggers - You MUST search FIRST if content includes ANY of the following::

Quantitative data, metrics, percentages, rankings
Legal/regulatory frameworks, policies, industry standards
Scholarly findings, theoretical models, research methods
Recent news, emerging trends
Any information you cannot verify with certainty

5. Character Safety Rule (Mandatory)

Golden Rule: Every character in the final PDF must come from following sources:

CJK characters rendered by registered Chinese fonts (SimHei / Microsoft YaHei)
Mathematical/relational operators (e.g., ＋ ,− , ×, ÷, ±, ≤,√, ∑,≅, ∫, π, ∠, etc.)

FORBIDDEN unicode escape sequence (DO NOT USE):

Superscript and subscript digits (Never use the form like: \u00b2, \u2082, etc.)
Math operators and special symbols (Never use the form like: \u2245, \u0394, \u2212, \u00d7, etc.)
Emoji characters (Never use the form like: \u2728, \u2705, etc.)

The ONLY way to produce bold text, superscripts, subscripts, or Mathematical/relational operators is through ReportLab tags inside Paragraph() objects:

Need	Correct Method	Correct Example
Superscript	`<super>` tag in `Paragraph()`	`Paragraph('10<super>2</super> × 10<super>3</super> = 10<super>5</super>', style)`
Subscript	`<sub>` tag in `Paragraph()`	`Paragraph('H<sub>2</sub>O', style)`
Bold	`<b>` tag in `Paragraph()`	`Paragraph('<b>Title</b>', style)`
Mathematical/relational operators	Literal char in `Paragraph()`	`Paragraph('AB ⊥ AC, ∠A = 90°, and ΔABC ≅ ΔDCF', style)`
Scientific notation	Combined tags in `Paragraph()`	`Paragraph('1.2 × 10<super>8</super> kg/m<super>3</super>', style)`

from reportlab.platypus import Paragraph
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib.enums import TA_LEFT, TA_CENTER

body_style = enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
    name='CoverTitle',
    fontName='Times New Roman',
    fontSize=42,
    leading=50,
    alignment=TA_CENTER,
    spaceAfter=36
)

# Superscript: area unit
Paragraph('Total area: 500 m<super>2</super>', body_style)

# Subscript: chemical formula
Paragraph('The reaction produces CO<sub>2</sub> and H<sub>2</sub>O', body_style)

# Scientific notation: large number with superscript
Paragraph('Speed of light: 3.0 × 10<super>8</super> m/s', body_style)

# Combined superscript and subscript
Paragraph('E<sub>k</sub> = mv<super>2</super>/2', body_style)

# Bold heading
Paragraph('<b>Chapter 1: Introduction</b>', header_style)

# Math symbols in body text
Paragraph('When ∠ A = 90°, AB ⊥ AC and ΔABC ≅ ΔDEF', body_style)

Pre-generation check — before writing ANY string, ask:

"Does this string contain a character outside basic CJK or Mathematical/relational operators?" If YES → it MUST be inside a Paragraph() with the appropriate tag. If it is a superscript/subscript digit in raw unicode escape sequence form → REPLACE with <super>/ tag.

NEVER rely on post-generation scanning. Prevent at the point of writing.

Font Setup (Guaranteed Success Method)

CRITICAL: Allowed Fonts Only

You MUST ONLY use the following registered fonts. Using ANY other font (such as Arial, Helvetica, Courier, Georgia, etc.) is STRICTLY FORBIDDEN and will cause rendering failures.

Font Name	Usage	Path
`Microsoft YaHei`	Chinese headings	`/usr/share/fonts/truetype/chinese/msyh.ttf`
`SimHei`	Chinese body text	`/usr/share/fonts/truetype/chinese/SimHei.ttf`
`SarasaMonoSC`	Chinese code blocks	`/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf`
`Times New Roman`	English text, numbers, tables	`/usr/share/fonts/truetype/english/Times-New-Roman.ttf`
`Calibri`	English alternative	`/usr/share/fonts/truetype/english/calibri-regular.ttf`
`DejaVuSans`	Formulas, symbols, code	`/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf`

FORBIDDEN fonts (DO NOT USE):

❌ Arial, Arial-Bold, Arial-Italic
❌ Helvetica, Helvetica-Bold, Helvetica-Oblique
❌ Courier, Courier-Bold
❌ Any font not listed in the table above

For bold text and superscript/subscript:

Must call registerFontFamily() after registering fonts
Then use , <super></super>,  tags in Paragraph
CRITICAL: These tags ONLY work inside Paragraph() objects, NOT in plain strings

Font Registration Template

from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfbase.pdfmetrics import registerFontFamily

# Chinese fonts
pdfmetrics.registerFont(TTFont('Microsoft YaHei', '/usr/share/fonts/truetype/chinese/msyh.ttf'))
pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont("SarasaMonoSC", '/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf'))

# English fonts
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
pdfmetrics.registerFont(TTFont('Calibri', '/usr/share/fonts/truetype/english/calibri-regular.ttf'))

# Symbol/Formula font
pdfmetrics.registerFont(TTFont("DejaVuSans", '/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf'))

# CRITICAL: Register font families to enable <b>, <super>, <sub> tags
registerFontFamily('Microsoft YaHei', normal='Microsoft YaHei', bold='Microsoft YaHei')
registerFontFamily('SimHei', normal='SimHei', bold='SimHei')
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
registerFontFamily('Calibri', normal='Calibri', bold='Calibri')
registerFontFamily('DejaVuSans', normal='DejaVuSans', bold='DejaVuSans')

Font Configuration by Document Type

For Chinese PDFs:

Body text: SimHei or Microsoft YaHei
Headings: Microsoft YaHei (MUST use for Chinese headings)
Code blocks: SarasaMonoSC
Formulas/symbols: DejaVuSans
In tables: ALL Chinese content and numbers MUST use SimHei

For English PDFs:

Body text: Times New Roman
Headings: Times New Roman (MUST use for English headings)
Code blocks: DejaVuSans
In tables: ALL English content and numbers MUST use Times New Roman

For Mixed Chinese-English PDFs (CRITICAL):

Chinese text and numbers: Use SimHei
English text: Use Times New Roman
ALWAYS apply this rule when generating PDFs containing both Chinese and English text
In tables: ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman
Mixed Chinese-English Text Font Handling: When a single string contains both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)"): MUST split the string by language and apply different fonts to each part using ReportLab's inline  tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with  inline tags.

from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",  # Base font for English
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    'RoutinAI Copilot (<font name="SimHei">RoutinAI 协作助手</font>) is built by routin.ai'
    'My name is RoutinAI Analyst (<font name="SimHei">洞察顾问</font>)',
    '<font name="SimHei">企业智能文档</font> services are powered by routin.ai.',
    enbody_style
))

# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
    name="CNBodyStyle",
    fontName="SimHei",  # Base font for Chinese
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    '本报告使用 <font name="Times New Roman">GPT-4</font> '
    '和 <font name="Times New Roman">GLM</font> 进行测试。',
    cnbody_style
))

Chinese Plot PNG Method

If using Python to generate PNGs containing Chinese characters:

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Available Font Paths

Run fc-list to get more fonts. Font files are typically located under:

/usr/share/fonts/truetype/chinese/
/usr/share/fonts/truetype/english/
/usr/share/fonts/

Guidelines for Output

Information Density: Prioritize depth and conciseness. Avoid fluff or excessive introductory filler. Use professional, precise terminology.
Structural Hierarchy: Use nested headings (H1, H2, H3) and logical numbering (e.g., 1.1, 1.1.1) to organize complex data.
Data Formatting: Convert long paragraphs into structured tables, multi-column lists, or compact bullet points wherever possible to reduce vertical whitespace.
Visual Rhythm: Use horizontal rules (---) to separate major sections. Ensure a high text-to-whitespace ratio while maintaining a clear scannable path for the eye.
Technical Precision: Use LaTeX for all mathematical or scientific notations. Ensure all tables are formatted with clear headers.
Tone: Academic, corporate, and authoritative. Adapt to the specific professional field (e.g., Legal, Engineering, Financial) as requested.
Data Presentation:
- When comparing data or showing trends, use charts instead of plain text lists
- Tables use the standard color scheme defined below
Links & References:
- URLs must be clickable hyperlinks
- Multiple figures/tables add numbering and cross-references ("see Figure 1", "as shown in Table 2")
- Academic/legal/data analysis citation scenarios implement correct in-text click-to-jump references with corresponding footnotes/endnotes

Layout & Spacing Control

Page Breaks

NEVER insert page breaks between sections (H1，H2, H3) or within chapters
Let content flow naturally; avoid forcing new pages
Specific allowed locations:
- Between the cover page and table of contents (if TOC exists)
- Between the cover page and main content (if NO TOC exists)
- Between the table of contents and main content (if TOC exists)
- Between the main content and back cover page (if back cover page exists)

Vertical Spacing Standards

Before tables: Spacer(1, 18) after preceding text content (symmetric with table+caption block bottom spacing)
After tables: Spacer(1, 6) before table caption
After table captions: Spacer(1, 18) before next content (larger gap for table+caption blocks)
Between paragraphs: Spacer(1, 12) (approximately 1 line)
Between H3 subsections: Spacer(1, 12)
Between H2 sections: Spacer(1, 18) (approximately 1.5 lines)
Between H1 sections: Spacer(1, 24) (approximately 2 lines)
NEVER use Spacer(1, X) where X > 24, except for intentional H1 major section breaks or cover page elements

Cover Page Specifications

When creating PDFs with cover pages, use the following enlarged specifications:

Title Formatting:

Main title font size: 36-48pt (vs normal heading 18-20pt)
Subtitle font size: 18-24pt
Author/date font size: 14-16pt
ALL titles MUST be bold: Use  tags in Paragraph (requires registerFontFamily() call first)

Cover Page Spacing:

Top margin to title: Spacer(1, 120) or more (push title to upper-middle area)
After main title: Spacer(1, 36) before subtitle
After subtitle: Spacer(1, 48) before author/institution info
Between author lines: Spacer(1, 18)
After author block: Spacer(1, 60) before date
Use PageBreak() after cover page content

Alignment:

All text or image in cover page must use TA_CENTER

Cover Page Style Example:

# Cover page styles
cover_title_style = ParagraphStyle(
    name='CoverTitle',
    fontName='Microsoft YaHei',  # or 'Times New Roman' for English
    fontSize=42,
    leading=50,
    alignment=TA_CENTER,
    spaceAfter=36
)

cover_subtitle_style = ParagraphStyle(
    name='CoverSubtitle',
    fontName='SimHei',  # or 'Times New Roman' for English
    fontSize=20,
    leading=28,
    alignment=TA_CENTER,
    spaceAfter=48
)

cover_author_style = ParagraphStyle(
    name='CoverAuthor',
    fontName='SimHei',  # or 'Times New Roman' for English
    fontSize=14,
    leading=22,
    alignment=TA_CENTER,
    spaceAfter=18
)

# Cover page construction
story.append(Spacer(1, 120))  # Push down from top
story.append(Paragraph("报告主标题", cover_title_style))
story.append(Spacer(1, 36))
story.append(Paragraph("副标题或说明文字", cover_subtitle_style))
story.append(Spacer(1, 48))
story.append(Paragraph("作者姓名", cover_author_style))
story.append(Paragraph("所属机构", cover_author_style))
story.append(Spacer(1, 60))
story.append(Paragraph("2025年2月", cover_author_style))
story.append(PageBreak())  # Always page break after cover

Table & Content Flow

Standard sequence: Spacer(1, 18) → Table → Spacer(1, 6) → Caption (centered) → Spacer(1, 18) → Next content
Keep related content together: table + caption + immediate analysis
Avoid orphan headings at page bottom

Alignment and Typography

CJK body: Use TA_LEFT + 2-char indent. Headings: no indent.
Font sizes: Body 11pt, subheadings 14pt, headings 18-20pt
Line height: 1.5-1.6 (keep line leading at 1.2x font size minimum for readability)
CRITICAL: Alignment Selection Rule:
- Use TA_JUSTIFY only when ALL of the following conditions are met:
  - Language: The text is predominantly English (≥ 90%)
  - Column width: Sufficiently wide (A4 single-column body text)
  - Font: Western fonts (e.g. Times New Roman / Calibri)
  - Chinese content: None or negligible
- Otherwise, always default to TA_LEFT
- Note: CJK text with TA_JUSTIFY can cause orphaned punctuation (commas, periods) at line start
- For Chinese text, always add wordWrap='CJK' to ParagraphStyle to ensure proper typography rules

Style Configuration

Normal paragraph: spaceBefore=0, spaceAfter=6-12
Headings: spaceBefore=12-18, spaceAfter=6-12
Headings must be bold: Use  tags in Paragraph (requires registerFontFamily() call first)
Table captions: spaceBefore=3, spaceAfter=6, alignment=TA_CENTER
CRITICAL: For Chinese text, always add wordWrap='CJK' to ParagraphStyle
- Prevents closing punctuation from appearing at line start
- Prevents opening brackets from appearing at line end
- Ensures proper Chinese typography rules

Table Formatting

Standard Table Color Scheme (MUST USE for ALL tables)

# Define standard colors for consistent table styling
TABLE_HEADER_COLOR = colors.HexColor('#1F4E79')  # Dark blue for header
TABLE_HEADER_TEXT = colors.white                  # White text for header
TABLE_ROW_EVEN = colors.white                     # White for even rows
TABLE_ROW_ODD = colors.HexColor('#F5F5F5')        # Light gray for odd rows

A table caption must be added immediately after the table (centered)
The entire table must be centered on the page
Header Row Formatting (CRITICAL):
- Background: Dark blue (#1F4E79)
- Text color: White (set via ParagraphStyle with textColor=colors.white)
- Font weight: Bold (use  tags in Paragraph after calling registerFontFamily())
- IMPORTANT: Bold tags ONLY work inside Paragraph() objects. Plain strings like 'Text' will NOT render bold.
Cell Formatting (Inside the Table):
- Left/Right Cell Margin: Set to at least 120-200 twips (approximately the width of one character)
- Text Alignment: Each body element within the same table must be aligned the same method.
- Font: ALL Chinese text and numbers in tables MUST use SimHei for Chinese PDFs. ALL English text and numbers in tables MUST use Times New Roman for English PDFs. ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDFs.
Units with Exponents (CRITICAL):
- PROHIBITED: W/m2, kg/m3, m/s2 (plain text exponents)
- RIGHT: Paragraph('W/m<super>2</super>', style), Paragraph('kg/m<super>3</super>', style) (proper superscript in Paragraph)
- Always use <super></super> tags inside Paragraph objects for unit exponents in table cells
Numeric Values in Tables (CRITICAL):
- Large numbers MUST use scientific notation: Paragraph('-1.246 × 10<super>8</super>', style) not -124600000
- Small decimals MUST use scientific notation: Paragraph('2.5 × 10<super>-3</super>', style) not 0.0025
- Threshold: Use scientific notation when |value| ≥ 10000 or |value| ≤ 0.001
- Format: Paragraph('coefficient × 10<super>exponent</super>', style) (e.g., Paragraph('-1.246 × 10<super>8</super>', style))

Table Cell Paragraph Wrapping (MANDATORY - REVIEW BEFORE EVERY TABLE)

STOP AND CHECK: Before creating ANY table, verify that ALL text cells use Paragraph().

# 1) key point in Chinese: wordWrap="CJK"
tbl_center = ParagraphStyle(
    "tbl_center",
    fontName="SimHei",
    fontSize=9,
    leading=12,
    alignment=TA_CENTER,
    wordWrap="CJK",
)

# 2) ALL content MUST be wrapped in Paragraph - NO EXCEPTIONS for text
findings_data = []
for a, b, c in findings:
    findings_data.append([
        Paragraph(a, tbl_center),
        Paragraph(b, tbl_center),
        Paragraph(c, tbl_center),   # ALL content MUST be wrapped in Paragraph
    ])

findings_table = Table(findings_data, colWidths=[1.8*cm, 3*cm, 9*cm])

Complete Table Example:

from reportlab.platypus import Table, TableStyle, Paragraph, Image
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT, TA_JUSTIFY

# Define styles for table cells
header_style = ParagraphStyle(
    name='TableHeader',
    fontName='Times New Roman',
    fontSize=11,
    textColor=colors.white,
    alignment=TA_CENTER
)

cell_style = ParagraphStyle(
    name='TableCell',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_CENTER
)

cell_style_jus = ParagraphStyle(
    name='TableCellLeft',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY
)

cell_style_right = ParagraphStyle(
    name='TableCellRight',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_RIGHT
)

# ✅ CORRECT: All text content wrapped in Paragraph()
data = [
    # Header row - bold text with Paragraph
    [
        Paragraph('<b>Parameter</b>', header_style),
        Paragraph('<b>Unit</b>', header_style),
        Paragraph('<b>Value</b>', header_style),
        Paragraph('<b>Note</b>', header_style)
    ],
    # Data rows - all text in Paragraph
    [
        Paragraph('Temperature', cell_style_jus),
        Paragraph('°C', cell_style),
        Paragraph('25.5', cell_style_jus),
        Paragraph('Ambient', cell_style)
    ],
    [
        Paragraph('Pressure', cell_style_jus),
        Paragraph('Pa', cell_style),
        Paragraph('1.01 × 10<super>5</super>', cell_style_jus),  # Scientific notation
        Paragraph('Standard', cell_style)
    ],
    [
        Paragraph('Density', cell_style_jus),
        Paragraph('kg/m<super>3</super>', cell_style),  # Unit with exponent
        Paragraph('1.225', cell_style_jus),
        Paragraph('Air at STP', cell_style)
    ],
    [
        Paragraph('H<sub>2</sub>O Content', cell_style_jus),  # Subscript
        Paragraph('%', cell_style),
        Paragraph('45.2', cell_style_jus),
        Paragraph('Relative humidity', cell_style)
    ]
]

# ❌ PROHIBITED: Plain strings - NEVER DO THIS
# data = [
#     ['<b>Parameter</b>', '<b>Unit</b>', '<b>Value</b>'],  # Bold won't work!
#     ['Pressure', 'Pa', '1.01 × 10<super>5</super>'],      # Superscript won't work!
# ]

# Create table
table = Table(data, colWidths=[120, 80, 100, 120])
table.setStyle(TableStyle([
    # Header styling
    ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
    # Alternating row colors
    ('BACKGROUND', (0, 1), (-1, 1), colors.white),
    ('BACKGROUND', (0, 2), (-1, 2), colors.HexColor('#F5F5F5')),
    ('BACKGROUND', (0, 3), (-1, 3), colors.white),
    ('BACKGROUND', (0, 4), (-1, 4), colors.HexColor('#F5F5F5')),
    # Grid and alignment
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('VALIGN', (0, 0), (-1, -1), 'MIDDLE'),
    ('LEFTPADDING', (0, 0), (-1, -1), 8),
    ('RIGHTPADDING', (0, 0), (-1, -1), 8),
    ('TOPPADDING', (0, 0), (-1, -1), 6),
    ('BOTTOMPADDING', (0, 0), (-1, -1), 6),
]))

# Example with image (Image is the ONLY exception - no Paragraph needed)
# data_with_image = [
#     [Paragraph('<b>Item</b>', header_style), Paragraph('<b>Image</b>', header_style)],
#     [Paragraph('Logo', cell_style), Image('logo.png', width=50, height=50)],  # Image directly, no Paragraph
# ]

PDF Metadata (REQUIRED)

CRITICAL: ALL PDFs MUST have proper metadata set during creation.

Required Metadata Fields

Title: MUST match the filename (without .pdf extension)
Author: MUST be set to "RoutinAI"
Creator: MUST be set to "RoutinAI"
Subject: SHOULD describe the document purpose/content
Description/Keywords: SHOULD include relevant topics for searchability

For reportlab (Creating New PDFs)

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
import os

# Extract filename without extension for metadata title
pdf_filename = "financial_report_2024.pdf"
title_for_metadata = os.path.splitext(pdf_filename)[0]  # "financial_report_2024"

doc = SimpleDocTemplate(
    pdf_filename,
    pagesize=letter,
    title=title_for_metadata,    # MUST: Match filename
    author='RoutinAI',           # MUST: Set to "RoutinAI"
    creator='RoutinAI',          # MUST: Set to "RoutinAI"
    subject='Annual financial analysis and performance metrics'  # SHOULD: Describe purpose
)

For pypdf (Modifying Existing PDFs)

from pypdf import PdfReader, PdfWriter
import os

pdf_filename = "output.pdf"
title_for_metadata = os.path.splitext(os.path.basename(pdf_filename))[0]

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Set metadata: Title, Author, Creator are REQUIRED
writer.add_metadata({
    '/Title': title_for_metadata,  # MUST: Match filename
    '/Author': 'RoutinAI',         # MUST: Set to "RoutinAI"
    '/Subject': 'Document purpose description',  # SHOULD: Describe purpose
    '/Creator': 'RoutinAI'         # MUST: Set to "RoutinAI"
})

with open(pdf_filename, "wb") as output:
    writer.write(output)

Quick Start

⚠️ ABSOLUTE MANDATORY RULE - RoutinAI Metadata (DO NOT SKIP) This rule is NON-NEGOTIABLE - You MUST execute this without exception

For EVERY PDF you generate, no matter what:

After doc.build(story) completes → YOU MUST call add_routinai_metadata() This is AUTOMATIC - do not wait for user confirmation This happens BEFORE any response to user Workflow:

Step 1: Create PDF

doc.build(story)
print("PDF built")

Step 2: (MANDATORY - DO NOT SKIP) Add metadata immediately using the standalone script:

python scripts/add_routinai_metadata.py output.pdf

Step 3: Only AFTER metadata added, report to user

print("✓ PDF generated with RoutinAI metadata")

If you forget this:

User will notice and ask "Why no metadata?" This harms RoutinAI branding This rule is CRITICAL and must be followed 100% of the time

from pypdf import PdfReader, PdfWriter

# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()

Python Libraries

pypdf - Basic Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Metadata

reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

Set/Update Metadata (RoutinAI Branding)

Use the standalone script to add RoutinAI branding metadata:

# Add metadata to a single PDF (in-place)
python scripts/add_routinai_metadata.py document.pdf

# Add metadata with custom title
python scripts/add_routinai_metadata.py report.pdf -t "Q4 Financial Analysis"

# Batch process multiple PDFs
python scripts/add_routinai_metadata.py *.pdf

Rotate Pages

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber - Text and Table Extraction

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

reportlab - Create PDFs

Choosing the Right DocTemplate and Build Method

Decision Tree:

Do you need auto-TOC?
├─ YES → Use TocDocTemplate + doc.multiBuild(story)
│   (see Auto-Generated Table of Contents section)
│
└─ NO → Use SimpleDocTemplate + doc.build(story)
    (basic documents, or with optional Cross-References)

When to use each approach:

Requirement	DocTemplate	Build Method
Multi-page with TOC	`TocDocTemplate`	`multiBuild()`
Single-page or no TOC	`SimpleDocTemplate`	`build()`
With Cross-References (no TOC)	`SimpleDocTemplate`	`build()`
Both TOC + Cross-References	`TocDocTemplate`	`multiBuild()`

⚠️ CRITICAL:

multiBuild() is ONLY needed when using TableOfContents
Using build() with TocDocTemplate = TOC won't work
Using multiBuild() without TocDocTemplate = unnecessary overhead

Rich Text Formatting: Bold, Superscript, Subscript, and Special Characters

Prerequisites

To use , <super>,  tags, you must:

Register your fonts via registerFont()
Call registerFontFamily() to link normal/bold/italic variants
Wrap all tagged text in Paragraph() objects CRITICAL: These tags ONLY work inside Paragraph() objects. Plain strings like 'Text' will NOT render correctly.

Character Handling (see Core Constraint #5)

All superscript, subscript, and Mathematical/relational operators rules are defined in Core Constraint #5 — Character Safety Rule.

Quick reminder when writing Rich Text:

, <super>,  tags ONLY work inside Paragraph() objects
Must call registerFontFamily() first to enable these tags
Plain strings like 'Text' will NOT render — always use Paragraph()
For scientific notation: Paragraph('coefficient × 10<super>exponent</super>', style)
For chemical formulas: Paragraph('H2O', style)

Do NOT use any unicode escape sequence(e.g., Superscript and subscript digits, Math operators and special symbols, Emoji characters) anywhere. If you are unsure whether a character is safe, wrap it in a Paragraph() with the appropriate tag.

Complete Python Example

# --- Register fonts and font family ---
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# CRITICAL: Must call registerFontFamily() to enable <b> and <i> tags
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')

# --- Define styles ---
body_style = ParagraphStyle(
    name='BodyStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY,
)
bold_style = ParagraphStyle(
    name='BoldStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
    name='HeaderStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.white,
    alignment=TA_JUSTIFY,
)

# --- Body text examples ---
# Bold title
title = Paragraph('<b>Scientific Formulas and Chemical Expressions</b>', bold_style)

# Math formula with superscript and mathematical symbol ×
math_text = Paragraph(
    'The Einstein mass-energy equivalence is expressed as E = mc<super>2</super>. '
    'In applied physics, the gravitational force is F = 6.674 × 10<super>-11</super> × '
    'm<sub>1</sub>m<sub>2</sub>/r<super>2</super>, '
    'and the quadratic formula solves a<super>2</super> + b<super>2</super> = c<super>2</super>.',
    body_style,
)

# Chemical expressions with subscript
chem_text = Paragraph(
    'The combustion of methane: CH<sub>4</sub> + 2O<sub>2</sub> '
    '= CO<sub>2</sub> + 2H<sub>2</sub>O. '
    'Sulfuric acid (H<sub>2</sub>SO<sub>4</sub>) reacts with sodium hydroxide to produce '
    'Na<sub>2</sub>SO<sub>4</sub> and water.',
    body_style,
)

Preventing Unwanted Line Breaks

Problem 1: English names broken at awkward positions

# PROHIBITED: "K.G. Palepu" may break after "K.G."
text = Paragraph("Professors (K.G. Palepu) proposed...",style)

# RIGHT: Use non-breaking space (U+00A0) to prevent breaking
text = Paragraph("Professors (K.G.\u00A0Palepu) proposed...",style)

Problem 2: Punctuation at line start

# RIGHT: Add wordWrap='CJK' for proper typography
styles.add(ParagraphStyle(
    name='BodyStyle',
    fontName='SimHei',
    fontSize=10.5,
    leading=18,
    alignment=TA_LEFT,
    wordWrap='CJK'  # Prevents orphaned punctuation
))

Problem 3: Creating intentional line breaks

# PROHIBITED: Normal newline character does NOT create line breaks
text = Paragraph("Line 1\nLine 2\nLine 3", style)  # Will render as single line!

# RIGHT: Use <br/> tag for line breaks
text = Paragraph("Line 1<br/>Line 2<br/>Line 3", style)

# Alternative: Split into multiple Paragraph objects
story.append(Paragraph("Line 1", style))
story.append(Paragraph("Line 2", style))
story.append(Paragraph("Line 3", style))

Basic PDF Creation

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")

# Add a line
c.line(100, height - 140, 400, height - 140)

# Save
c.save()

Auto-Generated Table of Contents

⚠️ CRITICAL WARNINGS

❌ FORBIDDEN: Manual Table of Contents

NEVER manually create TOC like this:

# ❌ PROHIBIT - DO NOT USE
toc_entries = [("1. Title", "5"), ("2. Section", "10")]
for entry, page in toc_entries:
    story.append(Paragraph(f"{entry} {'.'*50} {page}", style))

Why it's PROHIBIT:

Hardcoded page numbers become incorrect when content changes
No clickable hyperlinks
Manual leader dots are fragile
Must be manually updated with every document change

✅ ALWAYS use auto-generated TOC:

Key Implementation Requirements:

Custom TocDocTemplate class: Override afterFlowable() to capture TOC entries
Bookmark attributes: Set bookmark_name, bookmark_level, bookmark_text on each heading
Use doc.multiBuild(story): NOT doc.build() - multiBuild is required for TOC processing
Clickable hyperlinks: Generated automatically with proper styling

Helper Function Pattern:

def add_heading(text, style, level=0):
    """Create heading with bookmark for auto-TOC"""
    p = Paragraph(text, style)
    p.bookmark_name = text
    p.bookmark_level = level
    p.bookmark_text = text
    return p

# Usage:
story.append(add_heading("1. Introduction", styles['Heading1'], 0))
story.append(Paragraph('Content...', styles['Normal']))

Complete TOC Implementation Example

Copy and adapt this complete working code for your PDF with Table of Contents:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak, Spacer
from reportlab.platypus.tableofcontents import TableOfContents
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

class TocDocTemplate(SimpleDocTemplate):
    def __init__(self, *args, **kwargs):
        SimpleDocTemplate.__init__(self, *args, **kwargs)

    def afterFlowable(self, flowable):
        """Capture TOC entries after each flowable is rendered"""
        if hasattr(flowable, 'bookmark_name'):
            level = getattr(flowable, 'bookmark_level', 0)
            text = getattr(flowable, 'bookmark_text', '')
            self.notify('TOCEntry', (level, text, self.page))

# Create document
doc = TocDocTemplate("document.pdf", pagesize=letter)
story = []
styles = getSampleStyleSheet()

# Create Table of Contents
toc = TableOfContents()
toc.levelStyles = [
    ParagraphStyle(name='TOCHeading1', fontSize=14, leftIndent=20,
                   fontName='Times New Roman'),
    ParagraphStyle(name='TOCHeading2', fontSize=12, leftIndent=40,
                   fontName='Times New Roman'),
]
story.append(Paragraph("<b>Table of Contents</b>", styles['Title']))
story.append(Spacer(1, 0.2*inch))
story.append(toc)
story.append(PageBreak())

# Helper function: Create heading with TOC bookmark
def add_heading(text, style, level=0):
    p = Paragraph(text, style)
    p.bookmark_name = text
    p.bookmark_level = level
    p.bookmark_text = text
    return p

# Chapter 1: Introduction
story.append(add_heading("Chapter 1: Introduction", styles['Heading1'], 0))
story.append(Paragraph("This is the introduction chapter with some example content.",
                       styles['Normal']))
story.append(Spacer(1, 0.2*inch))

story.append(add_heading("1.1 Background", styles['Heading2'], 1))
story.append(Paragraph("Background information goes here.", styles['Normal']))


# Chapter 2: Conclusion
story.append(add_heading("Chapter 2: Conclusion", styles['Heading1'], 0))
story.append(Paragraph("This concludes our document.", styles['Normal']))
story.append(Spacer(1, 0.2*inch))

story.append(add_heading("2.1 Summary", styles['Heading2'], 1))
story.append(Paragraph("Summary of the document.", styles['Normal']))

# Build the document (must use multiBuild for TOC to work)
doc.multiBuild(story)

print("PDF with Table of Contents created successfully!")

Cross-References (Figures, Tables, Bibliography)

OPTIONAL: For academic papers requiring citation systems (LaTeX-style \ref{} and \cite{})

Key Principle: Pre-register all figures, tables, and references BEFORE using them in text.

Simple Implementation Pattern:

from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_CENTER
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib import colors
from reportlab.platypus import Table, TableStyle


class CrossReferenceDocument:
    """Manages cross-references throughout the document"""

    def __init__(self):
        self.figures = {}
        self.tables = {}
        self.refs = {}
        self.figure_counter = 0
        self.table_counter = 0
        self.ref_counter = 0

    def add_figure(self, name):
        """Add a figure and return its number"""
        if name not in self.figures:
            self.figure_counter += 1
            self.figures[name] = self.figure_counter
        return self.figures[name]

    def add_table(self, name):
        """Add a table and return its number"""
        if name not in self.tables:
            self.table_counter += 1
            self.tables[name] = self.table_counter
        return self.tables[name]

    def add_reference(self, name):
        """Add a reference and return its number"""
        if name not in self.refs:
            self.ref_counter += 1
            self.refs[name] = self.ref_counter
        return self.refs[name]


def build_document():
    doc = SimpleDocTemplate("cross_ref.pdf", pagesize=letter)
    xref = CrossReferenceDocument()
    styles = getSampleStyleSheet()

    # Caption style
    styles.add(ParagraphStyle(
        name='Caption',
        parent=styles['Normal'],
        alignment=TA_CENTER,
        fontSize=10,
        textColor=colors.HexColor('#333333')
    ))

    story = []

    # Step 1: Register all figures, tables, and references FIRST
    fig1 = xref.add_figure('sample')
    table1 = xref.add_table('data')
    ref1 = xref.add_reference('author2024')

    # Step 2: Use them in text
    intro = f"""
    See Figure {fig1} for details and Table {table1} for data<sup>[{ref1}]</sup>.
    """
    story.append(Paragraph(intro, styles['Normal']))
    story.append(Spacer(1, 0.2*inch))

    # Step 3: Create figures and tables with numbered captions
    story.append(Paragraph(f"<b>Figure {fig1}.</b> Sample Figure Caption",
        styles['Caption']
    ))

    # Table example
    header_style = ParagraphStyle(
    name='TableHeader',
    fontName='Times New Roman',
    fontSize=11,
    textColor=colors.white,
    alignment=TA_CENTER
    )

    cell_style = ParagraphStyle(
        name='TableCell',
        fontName='Times New Roman',
        fontSize=10,
        textColor=colors.black,
        alignment=TA_CENTER
    )

    # All text content wrapped in Paragraph()
    data = [
        [Paragraph('<b>Item</b>', header_style), Paragraph('<b>Value</b>', header_style)],
        [Paragraph('A', cell_style), Paragraph('10', cell_style)],
        [Paragraph('B', cell_style), Paragraph('20', cell_style)],
    ]
    t = Table(data, colWidths=[2*inch, 2*inch])
    t.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ]))
    story.append(t)
    story.append(Spacer(1, 6))
    story.append(Paragraph(f"<b>Table {table1}.</b> Sample Data Table",
        styles['Caption']
    ))

    story.append(PageBreak())

    # Step 4: Reference again in discussion
    discussion = f"""
    As shown in Figure {fig1} and Table {table1}, results are clear<sup>[{ref1}]</sup>.
    """
    story.append(Paragraph(discussion, styles['Normal']))

    # Step 5: Bibliography section
    story.append(PageBreak())
    story.append(Paragraph("<b>References</b>", styles['Heading1']))
    story.append(Paragraph(
        f"[{ref1}] Author, A. (2024). Example Reference. <i>Journal Name</i>.",
        styles['Normal']
    ))

    doc.build(story)
    print("PDF with cross-references created!")


if __name__ == '__main__':
    build_document()

Usage Notes:

Pre-registration is critical: Call add_figure()/add_table()/add_reference() at the START of your document
Citation format: Use Paragraph('[{ref_num}]') for inline citations
Caption format: Use Paragraph('Figure {num}.') or Paragraph('Table {num}.') with centered caption style
Combine with TOC: Use TocDocTemplate + doc.multiBuild(story) if both cross-refs and auto-TOC are needed

Command-Line Tools

pdftotext (poppler-utils)

# Extract text
pdftotext input.pdf output.txt

# Extract text preserving layout
pdftotext -layout input.pdf output.txt

# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt  # Pages 1-5

qpdf

# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1  # Rotate page 1 by 90 degrees

# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Common Tasks

Brand PDFs with RoutinAI Metadata

⚠️ CRITICAL MANDATORY RULE - PDF Metadata MUST be Added After Every PDF Generation

All PDFs MUST have metadata added immediately after creation - This is the FINAL step and CANNOT be skipped

Usage - Standalone Script:

# Add metadata to a single PDF (in-place)
python scripts/add_routinai_metadata.py document.pdf

# Add metadata to a single PDF (create new file)
python scripts/add_routinai_metadata.py input.pdf -o output.pdf

# Add metadata with custom title
python scripts/add_routinai_metadata.py report.pdf -t "Q4 Financial Analysis"

# Batch process all PDFs in current directory
python scripts/add_routinai_metadata.py *.pdf

# Quiet mode (no output)
python scripts/add_routinai_metadata.py document.pdf -q

# Show help
python scripts/add_routinai_metadata.py --help

Requirements:

After doc.build(story) completes → Immediately call the script Do NOT wait for user reminder, Do NOT check task description - Execute automatically Confirm metadata info to user after adding Memory phrase: PDF build done, metadata must add, no need to remind

Extract Text from Scanned PDFs

# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

Add Watermark

from pypdf import PdfReader, PdfWriter

# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]

# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Add password
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Critical Reminders (MUST Follow)

Font Rules

FONT RESTRICTION: ONLY use the six registered fonts. NEVER use Arial, Helvetica, Courier, or any unregistered fonts.
In tables: ALL Chinese text and numbers MUST use SimHei for Chinese PDF. ALL English text and numbers MUST use Times New Roman for English PDF. ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDF.
CRITICAL: Must call registerFontFamily() after registering fonts to enable , <super>,  tags.
Mixed Chinese-English Text Font Handling: When a single string contains both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)"): MUST split the string by language and apply different fonts to each part using ReportLab's inline  tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with  inline tags.

from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",  # Base font for English
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    'RoutinAI Copilot (<font name="SimHei">RoutinAI 协作助手</font>) is built by routin.ai'
    'My name is RoutinAI Analyst (<font name="SimHei">洞察顾问</font>)',
    '<font name="SimHei">企业智能文档</font> services are powered by routin.ai.',
    enbody_style
))

# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
    name="CNBodyStyle",
    fontName="SimHei",  # Base font for Chinese
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    '本报告使用 <font name="Times New Roman">GPT-4</font> '
    '和 <font name="Times New Roman">GLM</font> 进行测试。',
    cnbody_style
))

Rich Text Tags (``, `<super>`, ``)

These tags ONLY work inside Paragraph() objects — plain strings will NOT render them.
Character Safety: Follow Core Constraint #5 strictly. Do not use forbidden Unicode superscript/subscript/math characters anywhere in the code. Always use <super>, , tags inside Paragraph().
Scientific Notation in Tables: Paragraph('1.246 × 10<super>8</super>', style) — never write large numbers as plain digits.

Line Breaks in Paragraph

CRITICAL: Paragraph does not treat a normal newline character (\n) as a line break. To create line breaks, you must use   (or split the content into multiple Paragraph objects).

sms3 = \\\"\\\"\\\"Hi [FIRST_NAME]
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.
[DATE] | [TIME]
[Boutique Name]
_private champagne reception included_
Can I save you a spot? Just let me know!
[Your Name]\\\"\\\"\\\"
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])

# IMPORTANT:
# Paragraph does NOT treat '\n' as a line break.
# Use <br/> to force line breaks.
sms3 = """Hi [FIRST_NAME]<br/><br/>
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.<br/><br/>
[DATE] | [TIME]<br/>
[Boutique Name]<br/><br/>
<i>private champagne reception included</i><br/><br/>
Can I save you a spot? Just let me know!<br/><br/>
[Your Name]"""
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])

Body Title & Heading Styles

All titles and sub-titles (except for Table headers): Must be bold with black text - use Paragraph('Title', style) + textColor=colors.black.

Table Cell Content Rule (MANDATORY)

ALL text content in table cells MUST be wrapped in Paragraph(). This is NON-NEGOTIABLE.

❌ PROHIBITED - Plain strings in table cells:

# NEVER DO THIS - formatting will NOT work
data = [
    ['<b>Header</b>', 'Value'],           # Bold won't render
    ['Temperature', '25°C'],               # No style control
    ['Pressure', '1.01 × 10<super>5</super>'],  # Superscript won't work
]

✅ REQUIRED - All table text MUST wrapped in Paragraph:

# ALWAYS DO THIS
data = [
    [Paragraph('<b>Header</b>', header_style), Paragraph('Value', header_style)],
    [Paragraph('Temperature', cell_style), Paragraph('25°C', cell_style)],
    [Paragraph('Pressure', cell_style), Paragraph('1.01 × 10<super>5</super>', cell_style)],
]

Why this is mandatory:

Rendering formatting tags (, <super>, , )
Proper font application
Correct text alignment within cells
Consistent styling across the table

The ONLY exception: Image() objects can be placed directly in table cells without Paragraph wrapping.

Table Style Specifications

Header style: Must be bold with white text on dark blue background - use Paragraph('Header', header_style) + textColor=colors.white.
Standard color scheme: Dark blue header (#1F4E79), alternating white/light gray rows.
Color consistency: If a single PDF contains multiple tables, only one color scheme is allowed across all tables.
Alignment: Each body element within the same table must use the same alignment method.
Caption: ALL table captions must be centered and followed by Spacer(1, 18) before next content.
Spacing: Add Spacer(1, 18) BEFORE tables to maintain symmetric spacing with bottom.

Document Structure

A PDF can contain ONLY ONE cover page and ONE back cover page.
The cover page and the back cover page MUST use the alignment method specified by TA_JUSTIFY.
PDF Metadata (REQUIRED): Title MUST match filename; Author and Creator MUST be "RoutinAI"; Subject SHOULD describe purpose.

Image Handling

Preserve aspect ratio: Never adjust image aspect ratio. Must insert according to the original ratio.

from PIL import Image as PILImage
from reportlab.platypus import Image
# Get original dimensions
pil_img = PILImage.open('image.png')
orig_w, orig_h = pil_img.size
# Scale to fit width while preserving aspect ratio
target_width = 400
scale = target_width / orig_w
img = Image('image.png', width=target_width, height=orig_h * scale)

Final Code Check

Verify function parameter order against documentation.
Confirm list/array element type consistency; test-run immediately.
Use Paragraph (not Preformatted) for body text and formulas.

MANDATORY: Post-Generation Forbidden Character Sanitization

After the complete Python code is written and BEFORE executing it, you MUST sanitize the code using the pre-built script located at:

scripts/sanitize_code.py

This script catches any forbidden Unicode characters (superscript/subscript digits, math operators, emoji, HTML entities, literal \uXXXX escapes) that may have slipped through despite the prevention rules. It converts them to safe ReportLab <super>/ tags or ASCII equivalents.

⚠️ CRITICAL RULE: You MUST ALWAYS write PDF generation code to a .py file first, then sanitize it, then execute it. NEVER use python -c "..." or heredoc (python3 << 'EOF') to run PDF generation code directly — these patterns bypass the sanitization step and risk forbidden characters reaching the final PDF.

Mandatory workflow (NO EXCEPTIONS):

# Step 1: ALWAYS write code to a .py file first
cat > generate_pdf.py << 'PYEOF'
# ... your PDF generation code here ...
PYEOF

# Step 2: Sanitize forbidden characters (MUST run before execution)
python scripts/sanitize_code.py generate_pdf.py

# Step 3: Execute the sanitized code
python generate_pdf.py

Forbidden patterns — NEVER do any of the following:

# ❌ PROHIBITED: python -c with inline code (cannot be sanitized)
python -c "from reportlab... doc.build(story)"

# ❌ PROHIBITED: heredoc without saving to file first (cannot be sanitized)
python3 << 'EOF'
from reportlab...
EOF

# ❌ PROHIBITED: executing the .py file WITHOUT sanitizing first
python generate_pdf.py  # Missing sanitization step!

✅ CORRECT: The ONLY allowed execution pattern:

# 1. Write to file  →  2. Sanitize  →  3. Execute
cat > generate_pdf.py << 'PYEOF'
...code...
PYEOF
python scripts/sanitize_code.py generate_pdf.py
python generate_pdf.py

⚠️ This sanitization step is NON-OPTIONAL. Even if you believe the code contains no forbidden characters, you MUST still run the sanitization script. It serves as a safety net to catch any characters that bypassed prevention rules.

Quick Reference

Task	Best Tool	Command/Code
Merge PDFs	pypdf	`writer.add_page(page)`
Split PDFs	pypdf	One page per file
Extract text	pdfplumber	`page.extract_text()`
Extract tables	pdfplumber	`page.extract_tables()`
Create PDFs	reportlab	Canvas or Platypus
Command line merge	qpdf	`qpdf --empty --pages ...`
OCR scanned PDFs	pytesseract	Convert to image first
Fill PDF forms	pdf-lib or pypdf (see forms.md)	See forms.md

Next Steps

For advanced pypdfium2 usage, see reference.md
For JavaScript libraries (pdf-lib), see reference.md
If you need to fill out a PDF form, follow the instructions in forms.md
For troubleshooting guides, see reference.md
For advanced table of content template, see reference.md

name	pdf
description	Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When GLM needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
license	Proprietary. LICENSE.txt has complete terms

PDF Processing Guide

Overview

Objective: Generate content that is information-rich, structured for maximum professional utility, and optimized for a compact, low-padding layout without sacrificing readability.

Core Constraints (Must Follow)

1. Output Language

Generated PDF must use the same language as user's query.

Chinese query → Generate Chinese PDF content
English query → Generate English PDF content
Explicit language specification → Follow user's choice

2. Page Count Control

Follow user's page specifications strictly

User Input	Execution Rule
Explicit count (e.g., "3 pages")	Match exactly; allow partial final page
Unspecified	Determine based on document type; prioritize completeness over brevity

Avoid these mistakes:

Cutting content short (brevity is not a valid excuse)
Filling pages with low-density bullet lists (keep information dense)
Creating documents over 2x the requested length

Resume/CV exception:

Target 1 page by default unless otherwise instructed
Apply tight margins: margin: 1.5cm

3. Structure Compliance (Mandatory)

User supplies outline:

Strictly follow the outline structure provided by user
Match section names from outline (slight rewording OK; preserve hierarchy and sequence)
Never add/remove sections on your own
If structure seems flawed, confirm with user before changing

No outline provided:

Deploy standard frameworks by document category:
- Academic papers: IMRaD format (Introduction-Methods-Results-Discussion) or Introduction-Literature Review-Methods-Results-Discussion-Conclusion
- Business reports: Top-down approach (Executive Summary → In-depth Analysis → Recommendations)
- Technical guides: Overview → Core Concepts → Implementation → Examples → FAQ
- Academic assignments: Match assignment rubric structure
Ensure logical flow between sections without gaps

4. Information Sourcing Requirements

CRITICAL: Verify Before Writing

Never invent facts. If unsure, SEARCH immediately.

Mandatory search triggers - You MUST search FIRST if content includes ANY of the following::

Quantitative data, metrics, percentages, rankings
Legal/regulatory frameworks, policies, industry standards
Scholarly findings, theoretical models, research methods
Recent news, emerging trends
Any information you cannot verify with certainty

5. Character Safety Rule (Mandatory)

Golden Rule: Every character in the final PDF must come from following sources:

CJK characters rendered by registered Chinese fonts (SimHei / Microsoft YaHei)
Mathematical/relational operators (e.g., ＋ ,− , ×, ÷, ±, ≤,√, ∑,≅, ∫, π, ∠, etc.)

FORBIDDEN unicode escape sequence (DO NOT USE):

Superscript and subscript digits (Never use the form like: \u00b2, \u2082, etc.)
Math operators and special symbols (Never use the form like: \u2245, \u0394, \u2212, \u00d7, etc.)
Emoji characters (Never use the form like: \u2728, \u2705, etc.)

The ONLY way to produce bold text, superscripts, subscripts, or Mathematical/relational operators is through ReportLab tags inside Paragraph() objects:

Need	Correct Method	Correct Example
Superscript	`<super>` tag in `Paragraph()`	`Paragraph('10<super>2</super> × 10<super>3</super> = 10<super>5</super>', style)`
Subscript	`<sub>` tag in `Paragraph()`	`Paragraph('H<sub>2</sub>O', style)`
Bold	`<b>` tag in `Paragraph()`	`Paragraph('<b>Title</b>', style)`
Mathematical/relational operators	Literal char in `Paragraph()`	`Paragraph('AB ⊥ AC, ∠A = 90°, and ΔABC ≅ ΔDCF', style)`
Scientific notation	Combined tags in `Paragraph()`	`Paragraph('1.2 × 10<super>8</super> kg/m<super>3</super>', style)`

from reportlab.platypus import Paragraph
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib.enums import TA_LEFT, TA_CENTER

body_style = enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
    name='CoverTitle',
    fontName='Times New Roman',
    fontSize=42,
    leading=50,
    alignment=TA_CENTER,
    spaceAfter=36
)

# Superscript: area unit
Paragraph('Total area: 500 m<super>2</super>', body_style)

# Subscript: chemical formula
Paragraph('The reaction produces CO<sub>2</sub> and H<sub>2</sub>O', body_style)

# Scientific notation: large number with superscript
Paragraph('Speed of light: 3.0 × 10<super>8</super> m/s', body_style)

# Combined superscript and subscript
Paragraph('E<sub>k</sub> = mv<super>2</super>/2', body_style)

# Bold heading
Paragraph('<b>Chapter 1: Introduction</b>', header_style)

# Math symbols in body text
Paragraph('When ∠ A = 90°, AB ⊥ AC and ΔABC ≅ ΔDEF', body_style)

Pre-generation check — before writing ANY string, ask:

"Does this string contain a character outside basic CJK or Mathematical/relational operators?" If YES → it MUST be inside a Paragraph() with the appropriate tag. If it is a superscript/subscript digit in raw unicode escape sequence form → REPLACE with <super>/ tag.

NEVER rely on post-generation scanning. Prevent at the point of writing.

Font Setup (Guaranteed Success Method)

CRITICAL: Allowed Fonts Only

You MUST ONLY use the following registered fonts. Using ANY other font (such as Arial, Helvetica, Courier, Georgia, etc.) is STRICTLY FORBIDDEN and will cause rendering failures.

Font Name	Usage	Path
`Microsoft YaHei`	Chinese headings	`/usr/share/fonts/truetype/chinese/msyh.ttf`
`SimHei`	Chinese body text	`/usr/share/fonts/truetype/chinese/SimHei.ttf`
`SarasaMonoSC`	Chinese code blocks	`/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf`
`Times New Roman`	English text, numbers, tables	`/usr/share/fonts/truetype/english/Times-New-Roman.ttf`
`Calibri`	English alternative	`/usr/share/fonts/truetype/english/calibri-regular.ttf`
`DejaVuSans`	Formulas, symbols, code	`/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf`

FORBIDDEN fonts (DO NOT USE):

❌ Arial, Arial-Bold, Arial-Italic
❌ Helvetica, Helvetica-Bold, Helvetica-Oblique
❌ Courier, Courier-Bold
❌ Any font not listed in the table above

For bold text and superscript/subscript:

Must call registerFontFamily() after registering fonts
Then use , <super></super>,  tags in Paragraph
CRITICAL: These tags ONLY work inside Paragraph() objects, NOT in plain strings

Font Registration Template

from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfbase.pdfmetrics import registerFontFamily

# Chinese fonts
pdfmetrics.registerFont(TTFont('Microsoft YaHei', '/usr/share/fonts/truetype/chinese/msyh.ttf'))
pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont("SarasaMonoSC", '/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf'))

# English fonts
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
pdfmetrics.registerFont(TTFont('Calibri', '/usr/share/fonts/truetype/english/calibri-regular.ttf'))

# Symbol/Formula font
pdfmetrics.registerFont(TTFont("DejaVuSans", '/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf'))

# CRITICAL: Register font families to enable <b>, <super>, <sub> tags
registerFontFamily('Microsoft YaHei', normal='Microsoft YaHei', bold='Microsoft YaHei')
registerFontFamily('SimHei', normal='SimHei', bold='SimHei')
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
registerFontFamily('Calibri', normal='Calibri', bold='Calibri')
registerFontFamily('DejaVuSans', normal='DejaVuSans', bold='DejaVuSans')

Font Configuration by Document Type

For Chinese PDFs:

Body text: SimHei or Microsoft YaHei
Headings: Microsoft YaHei (MUST use for Chinese headings)
Code blocks: SarasaMonoSC
Formulas/symbols: DejaVuSans
In tables: ALL Chinese content and numbers MUST use SimHei

For English PDFs:

Body text: Times New Roman
Headings: Times New Roman (MUST use for English headings)
Code blocks: DejaVuSans
In tables: ALL English content and numbers MUST use Times New Roman

For Mixed Chinese-English PDFs (CRITICAL):

Chinese text and numbers: Use SimHei
English text: Use Times New Roman
ALWAYS apply this rule when generating PDFs containing both Chinese and English text
In tables: ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman
Mixed Chinese-English Text Font Handling: When a single string contains both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)"): MUST split the string by language and apply different fonts to each part using ReportLab's inline  tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with  inline tags.

from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",  # Base font for English
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    'RoutinAI Copilot (<font name="SimHei">RoutinAI 协作助手</font>) is built by routin.ai'
    'My name is RoutinAI Analyst (<font name="SimHei">洞察顾问</font>)',
    '<font name="SimHei">企业智能文档</font> services are powered by routin.ai.',
    enbody_style
))

# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
    name="CNBodyStyle",
    fontName="SimHei",  # Base font for Chinese
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    '本报告使用 <font name="Times New Roman">GPT-4</font> '
    '和 <font name="Times New Roman">GLM</font> 进行测试。',
    cnbody_style
))

Chinese Plot PNG Method

If using Python to generate PNGs containing Chinese characters:

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Available Font Paths

Run fc-list to get more fonts. Font files are typically located under:

/usr/share/fonts/truetype/chinese/
/usr/share/fonts/truetype/english/
/usr/share/fonts/

Guidelines for Output

Information Density: Prioritize depth and conciseness. Avoid fluff or excessive introductory filler. Use professional, precise terminology.
Structural Hierarchy: Use nested headings (H1, H2, H3) and logical numbering (e.g., 1.1, 1.1.1) to organize complex data.
Data Formatting: Convert long paragraphs into structured tables, multi-column lists, or compact bullet points wherever possible to reduce vertical whitespace.
Visual Rhythm: Use horizontal rules (---) to separate major sections. Ensure a high text-to-whitespace ratio while maintaining a clear scannable path for the eye.
Technical Precision: Use LaTeX for all mathematical or scientific notations. Ensure all tables are formatted with clear headers.
Tone: Academic, corporate, and authoritative. Adapt to the specific professional field (e.g., Legal, Engineering, Financial) as requested.
Data Presentation:
- When comparing data or showing trends, use charts instead of plain text lists
- Tables use the standard color scheme defined below
Links & References:
- URLs must be clickable hyperlinks
- Multiple figures/tables add numbering and cross-references ("see Figure 1", "as shown in Table 2")
- Academic/legal/data analysis citation scenarios implement correct in-text click-to-jump references with corresponding footnotes/endnotes

Layout & Spacing Control

Page Breaks

NEVER insert page breaks between sections (H1，H2, H3) or within chapters
Let content flow naturally; avoid forcing new pages
Specific allowed locations:
- Between the cover page and table of contents (if TOC exists)
- Between the cover page and main content (if NO TOC exists)
- Between the table of contents and main content (if TOC exists)
- Between the main content and back cover page (if back cover page exists)

Vertical Spacing Standards

Before tables: Spacer(1, 18) after preceding text content (symmetric with table+caption block bottom spacing)
After tables: Spacer(1, 6) before table caption
After table captions: Spacer(1, 18) before next content (larger gap for table+caption blocks)
Between paragraphs: Spacer(1, 12) (approximately 1 line)
Between H3 subsections: Spacer(1, 12)
Between H2 sections: Spacer(1, 18) (approximately 1.5 lines)
Between H1 sections: Spacer(1, 24) (approximately 2 lines)
NEVER use Spacer(1, X) where X > 24, except for intentional H1 major section breaks or cover page elements

Cover Page Specifications

When creating PDFs with cover pages, use the following enlarged specifications:

Title Formatting:

Main title font size: 36-48pt (vs normal heading 18-20pt)
Subtitle font size: 18-24pt
Author/date font size: 14-16pt
ALL titles MUST be bold: Use  tags in Paragraph (requires registerFontFamily() call first)

Cover Page Spacing:

Top margin to title: Spacer(1, 120) or more (push title to upper-middle area)
After main title: Spacer(1, 36) before subtitle
After subtitle: Spacer(1, 48) before author/institution info
Between author lines: Spacer(1, 18)
After author block: Spacer(1, 60) before date
Use PageBreak() after cover page content

Alignment:

All text or image in cover page must use TA_CENTER

Cover Page Style Example:

# Cover page styles
cover_title_style = ParagraphStyle(
    name='CoverTitle',
    fontName='Microsoft YaHei',  # or 'Times New Roman' for English
    fontSize=42,
    leading=50,
    alignment=TA_CENTER,
    spaceAfter=36
)

cover_subtitle_style = ParagraphStyle(
    name='CoverSubtitle',
    fontName='SimHei',  # or 'Times New Roman' for English
    fontSize=20,
    leading=28,
    alignment=TA_CENTER,
    spaceAfter=48
)

cover_author_style = ParagraphStyle(
    name='CoverAuthor',
    fontName='SimHei',  # or 'Times New Roman' for English
    fontSize=14,
    leading=22,
    alignment=TA_CENTER,
    spaceAfter=18
)

# Cover page construction
story.append(Spacer(1, 120))  # Push down from top
story.append(Paragraph("报告主标题", cover_title_style))
story.append(Spacer(1, 36))
story.append(Paragraph("副标题或说明文字", cover_subtitle_style))
story.append(Spacer(1, 48))
story.append(Paragraph("作者姓名", cover_author_style))
story.append(Paragraph("所属机构", cover_author_style))
story.append(Spacer(1, 60))
story.append(Paragraph("2025年2月", cover_author_style))
story.append(PageBreak())  # Always page break after cover

Table & Content Flow

Standard sequence: Spacer(1, 18) → Table → Spacer(1, 6) → Caption (centered) → Spacer(1, 18) → Next content
Keep related content together: table + caption + immediate analysis
Avoid orphan headings at page bottom

Alignment and Typography

CJK body: Use TA_LEFT + 2-char indent. Headings: no indent.
Font sizes: Body 11pt, subheadings 14pt, headings 18-20pt
Line height: 1.5-1.6 (keep line leading at 1.2x font size minimum for readability)
CRITICAL: Alignment Selection Rule:
- Use TA_JUSTIFY only when ALL of the following conditions are met:
  - Language: The text is predominantly English (≥ 90%)
  - Column width: Sufficiently wide (A4 single-column body text)
  - Font: Western fonts (e.g. Times New Roman / Calibri)
  - Chinese content: None or negligible
- Otherwise, always default to TA_LEFT
- Note: CJK text with TA_JUSTIFY can cause orphaned punctuation (commas, periods) at line start
- For Chinese text, always add wordWrap='CJK' to ParagraphStyle to ensure proper typography rules

Style Configuration

Normal paragraph: spaceBefore=0, spaceAfter=6-12
Headings: spaceBefore=12-18, spaceAfter=6-12
Headings must be bold: Use  tags in Paragraph (requires registerFontFamily() call first)
Table captions: spaceBefore=3, spaceAfter=6, alignment=TA_CENTER
CRITICAL: For Chinese text, always add wordWrap='CJK' to ParagraphStyle
- Prevents closing punctuation from appearing at line start
- Prevents opening brackets from appearing at line end
- Ensures proper Chinese typography rules

Table Formatting

Standard Table Color Scheme (MUST USE for ALL tables)

# Define standard colors for consistent table styling
TABLE_HEADER_COLOR = colors.HexColor('#1F4E79')  # Dark blue for header
TABLE_HEADER_TEXT = colors.white                  # White text for header
TABLE_ROW_EVEN = colors.white                     # White for even rows
TABLE_ROW_ODD = colors.HexColor('#F5F5F5')        # Light gray for odd rows

A table caption must be added immediately after the table (centered)
The entire table must be centered on the page
Header Row Formatting (CRITICAL):
- Background: Dark blue (#1F4E79)
- Text color: White (set via ParagraphStyle with textColor=colors.white)
- Font weight: Bold (use  tags in Paragraph after calling registerFontFamily())
- IMPORTANT: Bold tags ONLY work inside Paragraph() objects. Plain strings like 'Text' will NOT render bold.
Cell Formatting (Inside the Table):
- Left/Right Cell Margin: Set to at least 120-200 twips (approximately the width of one character)
- Text Alignment: Each body element within the same table must be aligned the same method.
- Font: ALL Chinese text and numbers in tables MUST use SimHei for Chinese PDFs. ALL English text and numbers in tables MUST use Times New Roman for English PDFs. ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDFs.
Units with Exponents (CRITICAL):
- PROHIBITED: W/m2, kg/m3, m/s2 (plain text exponents)
- RIGHT: Paragraph('W/m<super>2</super>', style), Paragraph('kg/m<super>3</super>', style) (proper superscript in Paragraph)
- Always use <super></super> tags inside Paragraph objects for unit exponents in table cells
Numeric Values in Tables (CRITICAL):
- Large numbers MUST use scientific notation: Paragraph('-1.246 × 10<super>8</super>', style) not -124600000
- Small decimals MUST use scientific notation: Paragraph('2.5 × 10<super>-3</super>', style) not 0.0025
- Threshold: Use scientific notation when |value| ≥ 10000 or |value| ≤ 0.001
- Format: Paragraph('coefficient × 10<super>exponent</super>', style) (e.g., Paragraph('-1.246 × 10<super>8</super>', style))

Table Cell Paragraph Wrapping (MANDATORY - REVIEW BEFORE EVERY TABLE)

STOP AND CHECK: Before creating ANY table, verify that ALL text cells use Paragraph().

# 1) key point in Chinese: wordWrap="CJK"
tbl_center = ParagraphStyle(
    "tbl_center",
    fontName="SimHei",
    fontSize=9,
    leading=12,
    alignment=TA_CENTER,
    wordWrap="CJK",
)

# 2) ALL content MUST be wrapped in Paragraph - NO EXCEPTIONS for text
findings_data = []
for a, b, c in findings:
    findings_data.append([
        Paragraph(a, tbl_center),
        Paragraph(b, tbl_center),
        Paragraph(c, tbl_center),   # ALL content MUST be wrapped in Paragraph
    ])

findings_table = Table(findings_data, colWidths=[1.8*cm, 3*cm, 9*cm])

Complete Table Example:

from reportlab.platypus import Table, TableStyle, Paragraph, Image
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT, TA_JUSTIFY

# Define styles for table cells
header_style = ParagraphStyle(
    name='TableHeader',
    fontName='Times New Roman',
    fontSize=11,
    textColor=colors.white,
    alignment=TA_CENTER
)

cell_style = ParagraphStyle(
    name='TableCell',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_CENTER
)

cell_style_jus = ParagraphStyle(
    name='TableCellLeft',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY
)

cell_style_right = ParagraphStyle(
    name='TableCellRight',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_RIGHT
)

# ✅ CORRECT: All text content wrapped in Paragraph()
data = [
    # Header row - bold text with Paragraph
    [
        Paragraph('<b>Parameter</b>', header_style),
        Paragraph('<b>Unit</b>', header_style),
        Paragraph('<b>Value</b>', header_style),
        Paragraph('<b>Note</b>', header_style)
    ],
    # Data rows - all text in Paragraph
    [
        Paragraph('Temperature', cell_style_jus),
        Paragraph('°C', cell_style),
        Paragraph('25.5', cell_style_jus),
        Paragraph('Ambient', cell_style)
    ],
    [
        Paragraph('Pressure', cell_style_jus),
        Paragraph('Pa', cell_style),
        Paragraph('1.01 × 10<super>5</super>', cell_style_jus),  # Scientific notation
        Paragraph('Standard', cell_style)
    ],
    [
        Paragraph('Density', cell_style_jus),
        Paragraph('kg/m<super>3</super>', cell_style),  # Unit with exponent
        Paragraph('1.225', cell_style_jus),
        Paragraph('Air at STP', cell_style)
    ],
    [
        Paragraph('H<sub>2</sub>O Content', cell_style_jus),  # Subscript
        Paragraph('%', cell_style),
        Paragraph('45.2', cell_style_jus),
        Paragraph('Relative humidity', cell_style)
    ]
]

# ❌ PROHIBITED: Plain strings - NEVER DO THIS
# data = [
#     ['<b>Parameter</b>', '<b>Unit</b>', '<b>Value</b>'],  # Bold won't work!
#     ['Pressure', 'Pa', '1.01 × 10<super>5</super>'],      # Superscript won't work!
# ]

# Create table
table = Table(data, colWidths=[120, 80, 100, 120])
table.setStyle(TableStyle([
    # Header styling
    ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
    # Alternating row colors
    ('BACKGROUND', (0, 1), (-1, 1), colors.white),
    ('BACKGROUND', (0, 2), (-1, 2), colors.HexColor('#F5F5F5')),
    ('BACKGROUND', (0, 3), (-1, 3), colors.white),
    ('BACKGROUND', (0, 4), (-1, 4), colors.HexColor('#F5F5F5')),
    # Grid and alignment
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('VALIGN', (0, 0), (-1, -1), 'MIDDLE'),
    ('LEFTPADDING', (0, 0), (-1, -1), 8),
    ('RIGHTPADDING', (0, 0), (-1, -1), 8),
    ('TOPPADDING', (0, 0), (-1, -1), 6),
    ('BOTTOMPADDING', (0, 0), (-1, -1), 6),
]))

# Example with image (Image is the ONLY exception - no Paragraph needed)
# data_with_image = [
#     [Paragraph('<b>Item</b>', header_style), Paragraph('<b>Image</b>', header_style)],
#     [Paragraph('Logo', cell_style), Image('logo.png', width=50, height=50)],  # Image directly, no Paragraph
# ]

PDF Metadata (REQUIRED)

CRITICAL: ALL PDFs MUST have proper metadata set during creation.

Required Metadata Fields

Title: MUST match the filename (without .pdf extension)
Author: MUST be set to "RoutinAI"
Creator: MUST be set to "RoutinAI"
Subject: SHOULD describe the document purpose/content
Description/Keywords: SHOULD include relevant topics for searchability

For reportlab (Creating New PDFs)

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
import os

# Extract filename without extension for metadata title
pdf_filename = "financial_report_2024.pdf"
title_for_metadata = os.path.splitext(pdf_filename)[0]  # "financial_report_2024"

doc = SimpleDocTemplate(
    pdf_filename,
    pagesize=letter,
    title=title_for_metadata,    # MUST: Match filename
    author='RoutinAI',           # MUST: Set to "RoutinAI"
    creator='RoutinAI',          # MUST: Set to "RoutinAI"
    subject='Annual financial analysis and performance metrics'  # SHOULD: Describe purpose
)

For pypdf (Modifying Existing PDFs)

from pypdf import PdfReader, PdfWriter
import os

pdf_filename = "output.pdf"
title_for_metadata = os.path.splitext(os.path.basename(pdf_filename))[0]

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Set metadata: Title, Author, Creator are REQUIRED
writer.add_metadata({
    '/Title': title_for_metadata,  # MUST: Match filename
    '/Author': 'RoutinAI',         # MUST: Set to "RoutinAI"
    '/Subject': 'Document purpose description',  # SHOULD: Describe purpose
    '/Creator': 'RoutinAI'         # MUST: Set to "RoutinAI"
})

with open(pdf_filename, "wb") as output:
    writer.write(output)

Quick Start

⚠️ ABSOLUTE MANDATORY RULE - RoutinAI Metadata (DO NOT SKIP) This rule is NON-NEGOTIABLE - You MUST execute this without exception

For EVERY PDF you generate, no matter what:

After doc.build(story) completes → YOU MUST call add_routinai_metadata() This is AUTOMATIC - do not wait for user confirmation This happens BEFORE any response to user Workflow:

Step 1: Create PDF

doc.build(story)
print("PDF built")

Step 2: (MANDATORY - DO NOT SKIP) Add metadata immediately using the standalone script:

python scripts/add_routinai_metadata.py output.pdf

Step 3: Only AFTER metadata added, report to user

print("✓ PDF generated with RoutinAI metadata")

If you forget this:

User will notice and ask "Why no metadata?" This harms RoutinAI branding This rule is CRITICAL and must be followed 100% of the time

from pypdf import PdfReader, PdfWriter

# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()

Python Libraries

pypdf - Basic Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Metadata

reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

Set/Update Metadata (RoutinAI Branding)

Use the standalone script to add RoutinAI branding metadata:

# Add metadata to a single PDF (in-place)
python scripts/add_routinai_metadata.py document.pdf

# Add metadata with custom title
python scripts/add_routinai_metadata.py report.pdf -t "Q4 Financial Analysis"

# Batch process multiple PDFs
python scripts/add_routinai_metadata.py *.pdf

Rotate Pages

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber - Text and Table Extraction

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

reportlab - Create PDFs

Choosing the Right DocTemplate and Build Method

Decision Tree:

Do you need auto-TOC?
├─ YES → Use TocDocTemplate + doc.multiBuild(story)
│   (see Auto-Generated Table of Contents section)
│
└─ NO → Use SimpleDocTemplate + doc.build(story)
    (basic documents, or with optional Cross-References)

When to use each approach:

Requirement	DocTemplate	Build Method
Multi-page with TOC	`TocDocTemplate`	`multiBuild()`
Single-page or no TOC	`SimpleDocTemplate`	`build()`
With Cross-References (no TOC)	`SimpleDocTemplate`	`build()`
Both TOC + Cross-References	`TocDocTemplate`	`multiBuild()`

⚠️ CRITICAL:

multiBuild() is ONLY needed when using TableOfContents
Using build() with TocDocTemplate = TOC won't work
Using multiBuild() without TocDocTemplate = unnecessary overhead

Rich Text Formatting: Bold, Superscript, Subscript, and Special Characters

Prerequisites

To use , <super>,  tags, you must:

Register your fonts via registerFont()
Call registerFontFamily() to link normal/bold/italic variants
Wrap all tagged text in Paragraph() objects CRITICAL: These tags ONLY work inside Paragraph() objects. Plain strings like 'Text' will NOT render correctly.

Character Handling (see Core Constraint #5)

All superscript, subscript, and Mathematical/relational operators rules are defined in Core Constraint #5 — Character Safety Rule.

Quick reminder when writing Rich Text:

, <super>,  tags ONLY work inside Paragraph() objects
Must call registerFontFamily() first to enable these tags
Plain strings like 'Text' will NOT render — always use Paragraph()
For scientific notation: Paragraph('coefficient × 10<super>exponent</super>', style)
For chemical formulas: Paragraph('H2O', style)

Complete Python Example

# --- Register fonts and font family ---
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# CRITICAL: Must call registerFontFamily() to enable <b> and <i> tags
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')

# --- Define styles ---
body_style = ParagraphStyle(
    name='BodyStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY,
)
bold_style = ParagraphStyle(
    name='BoldStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.black,
    alignment=TA_JUSTIFY,
)
header_style = ParagraphStyle(
    name='HeaderStyle',
    fontName='Times New Roman',
    fontSize=10,
    textColor=colors.white,
    alignment=TA_JUSTIFY,
)

# --- Body text examples ---
# Bold title
title = Paragraph('<b>Scientific Formulas and Chemical Expressions</b>', bold_style)

# Math formula with superscript and mathematical symbol ×
math_text = Paragraph(
    'The Einstein mass-energy equivalence is expressed as E = mc<super>2</super>. '
    'In applied physics, the gravitational force is F = 6.674 × 10<super>-11</super> × '
    'm<sub>1</sub>m<sub>2</sub>/r<super>2</super>, '
    'and the quadratic formula solves a<super>2</super> + b<super>2</super> = c<super>2</super>.',
    body_style,
)

# Chemical expressions with subscript
chem_text = Paragraph(
    'The combustion of methane: CH<sub>4</sub> + 2O<sub>2</sub> '
    '= CO<sub>2</sub> + 2H<sub>2</sub>O. '
    'Sulfuric acid (H<sub>2</sub>SO<sub>4</sub>) reacts with sodium hydroxide to produce '
    'Na<sub>2</sub>SO<sub>4</sub> and water.',
    body_style,
)

Preventing Unwanted Line Breaks

Problem 1: English names broken at awkward positions

# PROHIBITED: "K.G. Palepu" may break after "K.G."
text = Paragraph("Professors (K.G. Palepu) proposed...",style)

# RIGHT: Use non-breaking space (U+00A0) to prevent breaking
text = Paragraph("Professors (K.G.\u00A0Palepu) proposed...",style)

Problem 2: Punctuation at line start

# RIGHT: Add wordWrap='CJK' for proper typography
styles.add(ParagraphStyle(
    name='BodyStyle',
    fontName='SimHei',
    fontSize=10.5,
    leading=18,
    alignment=TA_LEFT,
    wordWrap='CJK'  # Prevents orphaned punctuation
))

Problem 3: Creating intentional line breaks

# PROHIBITED: Normal newline character does NOT create line breaks
text = Paragraph("Line 1\nLine 2\nLine 3", style)  # Will render as single line!

# RIGHT: Use <br/> tag for line breaks
text = Paragraph("Line 1<br/>Line 2<br/>Line 3", style)

# Alternative: Split into multiple Paragraph objects
story.append(Paragraph("Line 1", style))
story.append(Paragraph("Line 2", style))
story.append(Paragraph("Line 3", style))

Basic PDF Creation

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")

# Add a line
c.line(100, height - 140, 400, height - 140)

# Save
c.save()

Auto-Generated Table of Contents

⚠️ CRITICAL WARNINGS

❌ FORBIDDEN: Manual Table of Contents

NEVER manually create TOC like this:

# ❌ PROHIBIT - DO NOT USE
toc_entries = [("1. Title", "5"), ("2. Section", "10")]
for entry, page in toc_entries:
    story.append(Paragraph(f"{entry} {'.'*50} {page}", style))

Why it's PROHIBIT:

Hardcoded page numbers become incorrect when content changes
No clickable hyperlinks
Manual leader dots are fragile
Must be manually updated with every document change

✅ ALWAYS use auto-generated TOC:

Key Implementation Requirements:

Custom TocDocTemplate class: Override afterFlowable() to capture TOC entries
Bookmark attributes: Set bookmark_name, bookmark_level, bookmark_text on each heading
Use doc.multiBuild(story): NOT doc.build() - multiBuild is required for TOC processing
Clickable hyperlinks: Generated automatically with proper styling

Helper Function Pattern:

def add_heading(text, style, level=0):
    """Create heading with bookmark for auto-TOC"""
    p = Paragraph(text, style)
    p.bookmark_name = text
    p.bookmark_level = level
    p.bookmark_text = text
    return p

# Usage:
story.append(add_heading("1. Introduction", styles['Heading1'], 0))
story.append(Paragraph('Content...', styles['Normal']))

Complete TOC Implementation Example

Copy and adapt this complete working code for your PDF with Table of Contents:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak, Spacer
from reportlab.platypus.tableofcontents import TableOfContents
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

class TocDocTemplate(SimpleDocTemplate):
    def __init__(self, *args, **kwargs):
        SimpleDocTemplate.__init__(self, *args, **kwargs)

    def afterFlowable(self, flowable):
        """Capture TOC entries after each flowable is rendered"""
        if hasattr(flowable, 'bookmark_name'):
            level = getattr(flowable, 'bookmark_level', 0)
            text = getattr(flowable, 'bookmark_text', '')
            self.notify('TOCEntry', (level, text, self.page))

# Create document
doc = TocDocTemplate("document.pdf", pagesize=letter)
story = []
styles = getSampleStyleSheet()

# Create Table of Contents
toc = TableOfContents()
toc.levelStyles = [
    ParagraphStyle(name='TOCHeading1', fontSize=14, leftIndent=20,
                   fontName='Times New Roman'),
    ParagraphStyle(name='TOCHeading2', fontSize=12, leftIndent=40,
                   fontName='Times New Roman'),
]
story.append(Paragraph("<b>Table of Contents</b>", styles['Title']))
story.append(Spacer(1, 0.2*inch))
story.append(toc)
story.append(PageBreak())

# Helper function: Create heading with TOC bookmark
def add_heading(text, style, level=0):
    p = Paragraph(text, style)
    p.bookmark_name = text
    p.bookmark_level = level
    p.bookmark_text = text
    return p

# Chapter 1: Introduction
story.append(add_heading("Chapter 1: Introduction", styles['Heading1'], 0))
story.append(Paragraph("This is the introduction chapter with some example content.",
                       styles['Normal']))
story.append(Spacer(1, 0.2*inch))

story.append(add_heading("1.1 Background", styles['Heading2'], 1))
story.append(Paragraph("Background information goes here.", styles['Normal']))


# Chapter 2: Conclusion
story.append(add_heading("Chapter 2: Conclusion", styles['Heading1'], 0))
story.append(Paragraph("This concludes our document.", styles['Normal']))
story.append(Spacer(1, 0.2*inch))

story.append(add_heading("2.1 Summary", styles['Heading2'], 1))
story.append(Paragraph("Summary of the document.", styles['Normal']))

# Build the document (must use multiBuild for TOC to work)
doc.multiBuild(story)

print("PDF with Table of Contents created successfully!")

Cross-References (Figures, Tables, Bibliography)

OPTIONAL: For academic papers requiring citation systems (LaTeX-style \ref{} and \cite{})

Key Principle: Pre-register all figures, tables, and references BEFORE using them in text.

Simple Implementation Pattern:

from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.enums import TA_CENTER
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib import colors
from reportlab.platypus import Table, TableStyle


class CrossReferenceDocument:
    """Manages cross-references throughout the document"""

    def __init__(self):
        self.figures = {}
        self.tables = {}
        self.refs = {}
        self.figure_counter = 0
        self.table_counter = 0
        self.ref_counter = 0

    def add_figure(self, name):
        """Add a figure and return its number"""
        if name not in self.figures:
            self.figure_counter += 1
            self.figures[name] = self.figure_counter
        return self.figures[name]

    def add_table(self, name):
        """Add a table and return its number"""
        if name not in self.tables:
            self.table_counter += 1
            self.tables[name] = self.table_counter
        return self.tables[name]

    def add_reference(self, name):
        """Add a reference and return its number"""
        if name not in self.refs:
            self.ref_counter += 1
            self.refs[name] = self.ref_counter
        return self.refs[name]


def build_document():
    doc = SimpleDocTemplate("cross_ref.pdf", pagesize=letter)
    xref = CrossReferenceDocument()
    styles = getSampleStyleSheet()

    # Caption style
    styles.add(ParagraphStyle(
        name='Caption',
        parent=styles['Normal'],
        alignment=TA_CENTER,
        fontSize=10,
        textColor=colors.HexColor('#333333')
    ))

    story = []

    # Step 1: Register all figures, tables, and references FIRST
    fig1 = xref.add_figure('sample')
    table1 = xref.add_table('data')
    ref1 = xref.add_reference('author2024')

    # Step 2: Use them in text
    intro = f"""
    See Figure {fig1} for details and Table {table1} for data<sup>[{ref1}]</sup>.
    """
    story.append(Paragraph(intro, styles['Normal']))
    story.append(Spacer(1, 0.2*inch))

    # Step 3: Create figures and tables with numbered captions
    story.append(Paragraph(f"<b>Figure {fig1}.</b> Sample Figure Caption",
        styles['Caption']
    ))

    # Table example
    header_style = ParagraphStyle(
    name='TableHeader',
    fontName='Times New Roman',
    fontSize=11,
    textColor=colors.white,
    alignment=TA_CENTER
    )

    cell_style = ParagraphStyle(
        name='TableCell',
        fontName='Times New Roman',
        fontSize=10,
        textColor=colors.black,
        alignment=TA_CENTER
    )

    # All text content wrapped in Paragraph()
    data = [
        [Paragraph('<b>Item</b>', header_style), Paragraph('<b>Value</b>', header_style)],
        [Paragraph('A', cell_style), Paragraph('10', cell_style)],
        [Paragraph('B', cell_style), Paragraph('20', cell_style)],
    ]
    t = Table(data, colWidths=[2*inch, 2*inch])
    t.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ]))
    story.append(t)
    story.append(Spacer(1, 6))
    story.append(Paragraph(f"<b>Table {table1}.</b> Sample Data Table",
        styles['Caption']
    ))

    story.append(PageBreak())

    # Step 4: Reference again in discussion
    discussion = f"""
    As shown in Figure {fig1} and Table {table1}, results are clear<sup>[{ref1}]</sup>.
    """
    story.append(Paragraph(discussion, styles['Normal']))

    # Step 5: Bibliography section
    story.append(PageBreak())
    story.append(Paragraph("<b>References</b>", styles['Heading1']))
    story.append(Paragraph(
        f"[{ref1}] Author, A. (2024). Example Reference. <i>Journal Name</i>.",
        styles['Normal']
    ))

    doc.build(story)
    print("PDF with cross-references created!")


if __name__ == '__main__':
    build_document()

Usage Notes:

Pre-registration is critical: Call add_figure()/add_table()/add_reference() at the START of your document
Citation format: Use Paragraph('[{ref_num}]') for inline citations
Caption format: Use Paragraph('Figure {num}.') or Paragraph('Table {num}.') with centered caption style
Combine with TOC: Use TocDocTemplate + doc.multiBuild(story) if both cross-refs and auto-TOC are needed

Command-Line Tools

pdftotext (poppler-utils)

# Extract text
pdftotext input.pdf output.txt

# Extract text preserving layout
pdftotext -layout input.pdf output.txt

# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt  # Pages 1-5

qpdf

# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1  # Rotate page 1 by 90 degrees

# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Common Tasks

Brand PDFs with RoutinAI Metadata

⚠️ CRITICAL MANDATORY RULE - PDF Metadata MUST be Added After Every PDF Generation

All PDFs MUST have metadata added immediately after creation - This is the FINAL step and CANNOT be skipped

Usage - Standalone Script:

# Add metadata to a single PDF (in-place)
python scripts/add_routinai_metadata.py document.pdf

# Add metadata to a single PDF (create new file)
python scripts/add_routinai_metadata.py input.pdf -o output.pdf

# Add metadata with custom title
python scripts/add_routinai_metadata.py report.pdf -t "Q4 Financial Analysis"

# Batch process all PDFs in current directory
python scripts/add_routinai_metadata.py *.pdf

# Quiet mode (no output)
python scripts/add_routinai_metadata.py document.pdf -q

# Show help
python scripts/add_routinai_metadata.py --help

Requirements:

Extract Text from Scanned PDFs

# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)

Add Watermark

from pypdf import PdfReader, PdfWriter

# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]

# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# Add password
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Critical Reminders (MUST Follow)

Font Rules

FONT RESTRICTION: ONLY use the six registered fonts. NEVER use Arial, Helvetica, Courier, or any unregistered fonts.
In tables: ALL Chinese text and numbers MUST use SimHei for Chinese PDF. ALL English text and numbers MUST use Times New Roman for English PDF. ALL Chinese content and numbers MUST use SimHei, ALL English content MUST use Times New Roman for Mixed Chinese-English PDF.
CRITICAL: Must call registerFontFamily() after registering fonts to enable , <super>,  tags.
Mixed Chinese-English Text Font Handling: When a single string contains both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)"): MUST split the string by language and apply different fonts to each part using ReportLab's inline  tags within Paragraph objects. English fonts (e.g., Times New Roman) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., SimHei) render English with poor spacing. Must set ParagraphStyle.fontName to your base font, then wrap segments of the other language with  inline tags.

from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf'))
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))

# Base font is English; wrap Chinese parts:
enbody_style = ParagraphStyle(
    name="ENBodyStyle",
    fontName="Times New Roman",  # Base font for English
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    'RoutinAI Copilot (<font name="SimHei">RoutinAI 协作助手</font>) is built by routin.ai'
    'My name is RoutinAI Analyst (<font name="SimHei">洞察顾问</font>)',
    '<font name="SimHei">企业智能文档</font> services are powered by routin.ai.',
    enbody_style
))

# Base font is Chinese; wrap English parts:
cnbody_style = ParagraphStyle(
    name="CNBodyStyle",
    fontName="SimHei",  # Base font for Chinese
    fontSize=10.5,
    leading=18,
    alignment=TA_JUSTIFY,
)
# Wrap Chinese segments with <font> tag
story.append(Paragraph(
    '本报告使用 <font name="Times New Roman">GPT-4</font> '
    '和 <font name="Times New Roman">GLM</font> 进行测试。',
    cnbody_style
))

Rich Text Tags (``, `<super>`, ``)

These tags ONLY work inside Paragraph() objects — plain strings will NOT render them.
Character Safety: Follow Core Constraint #5 strictly. Do not use forbidden Unicode superscript/subscript/math characters anywhere in the code. Always use <super>, , tags inside Paragraph().
Scientific Notation in Tables: Paragraph('1.246 × 10<super>8</super>', style) — never write large numbers as plain digits.

Line Breaks in Paragraph

CRITICAL: Paragraph does not treat a normal newline character (\n) as a line break. To create line breaks, you must use   (or split the content into multiple Paragraph objects).

sms3 = \\\"\\\"\\\"Hi [FIRST_NAME]
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.
[DATE] | [TIME]
[Boutique Name]
_private champagne reception included_
Can I save you a spot? Just let me know!
[Your Name]\\\"\\\"\\\"
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])

# IMPORTANT:
# Paragraph does NOT treat '\n' as a line break.
# Use <br/> to force line breaks.
sms3 = """Hi [FIRST_NAME]<br/><br/>
You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.<br/><br/>
[DATE] | [TIME]<br/>
[Boutique Name]<br/><br/>
<i>private champagne reception included</i><br/><br/>
Can I save you a spot? Just let me know!<br/><br/>
[Your Name]"""
sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400])

Body Title & Heading Styles

All titles and sub-titles (except for Table headers): Must be bold with black text - use Paragraph('Title', style) + textColor=colors.black.

Table Cell Content Rule (MANDATORY)

ALL text content in table cells MUST be wrapped in Paragraph(). This is NON-NEGOTIABLE.

❌ PROHIBITED - Plain strings in table cells:

# NEVER DO THIS - formatting will NOT work
data = [
    ['<b>Header</b>', 'Value'],           # Bold won't render
    ['Temperature', '25°C'],               # No style control
    ['Pressure', '1.01 × 10<super>5</super>'],  # Superscript won't work
]

✅ REQUIRED - All table text MUST wrapped in Paragraph:

# ALWAYS DO THIS
data = [
    [Paragraph('<b>Header</b>', header_style), Paragraph('Value', header_style)],
    [Paragraph('Temperature', cell_style), Paragraph('25°C', cell_style)],
    [Paragraph('Pressure', cell_style), Paragraph('1.01 × 10<super>5</super>', cell_style)],
]

Why this is mandatory:

Rendering formatting tags (, <super>, , )
Proper font application
Correct text alignment within cells
Consistent styling across the table

The ONLY exception: Image() objects can be placed directly in table cells without Paragraph wrapping.

Table Style Specifications

Header style: Must be bold with white text on dark blue background - use Paragraph('Header', header_style) + textColor=colors.white.
Standard color scheme: Dark blue header (#1F4E79), alternating white/light gray rows.
Color consistency: If a single PDF contains multiple tables, only one color scheme is allowed across all tables.
Alignment: Each body element within the same table must use the same alignment method.
Caption: ALL table captions must be centered and followed by Spacer(1, 18) before next content.
Spacing: Add Spacer(1, 18) BEFORE tables to maintain symmetric spacing with bottom.

Document Structure

A PDF can contain ONLY ONE cover page and ONE back cover page.
The cover page and the back cover page MUST use the alignment method specified by TA_JUSTIFY.
PDF Metadata (REQUIRED): Title MUST match filename; Author and Creator MUST be "RoutinAI"; Subject SHOULD describe purpose.

Image Handling

Preserve aspect ratio: Never adjust image aspect ratio. Must insert according to the original ratio.

from PIL import Image as PILImage
from reportlab.platypus import Image
# Get original dimensions
pil_img = PILImage.open('image.png')
orig_w, orig_h = pil_img.size
# Scale to fit width while preserving aspect ratio
target_width = 400
scale = target_width / orig_w
img = Image('image.png', width=target_width, height=orig_h * scale)

Final Code Check

Verify function parameter order against documentation.
Confirm list/array element type consistency; test-run immediately.
Use Paragraph (not Preformatted) for body text and formulas.

MANDATORY: Post-Generation Forbidden Character Sanitization

After the complete Python code is written and BEFORE executing it, you MUST sanitize the code using the pre-built script located at:

scripts/sanitize_code.py

Mandatory workflow (NO EXCEPTIONS):

# Step 1: ALWAYS write code to a .py file first
cat > generate_pdf.py << 'PYEOF'
# ... your PDF generation code here ...
PYEOF

# Step 2: Sanitize forbidden characters (MUST run before execution)
python scripts/sanitize_code.py generate_pdf.py

# Step 3: Execute the sanitized code
python generate_pdf.py

Forbidden patterns — NEVER do any of the following:

# ❌ PROHIBITED: python -c with inline code (cannot be sanitized)
python -c "from reportlab... doc.build(story)"

# ❌ PROHIBITED: heredoc without saving to file first (cannot be sanitized)
python3 << 'EOF'
from reportlab...
EOF

# ❌ PROHIBITED: executing the .py file WITHOUT sanitizing first
python generate_pdf.py  # Missing sanitization step!

✅ CORRECT: The ONLY allowed execution pattern:

# 1. Write to file  →  2. Sanitize  →  3. Execute
cat > generate_pdf.py << 'PYEOF'
...code...
PYEOF
python scripts/sanitize_code.py generate_pdf.py
python generate_pdf.py

Quick Reference

Task	Best Tool	Command/Code
Merge PDFs	pypdf	`writer.add_page(page)`
Split PDFs	pypdf	One page per file
Extract text	pdfplumber	`page.extract_text()`
Extract tables	pdfplumber	`page.extract_tables()`
Create PDFs	reportlab	Canvas or Platypus
Command line merge	qpdf	`qpdf --empty --pages ...`
OCR scanned PDFs	pytesseract	Convert to image first
Fill PDF forms	pdf-lib or pypdf (see forms.md)	See forms.md

Next Steps

For advanced pypdfium2 usage, see reference.md
For JavaScript libraries (pdf-lib), see reference.md
If you need to fill out a PDF form, follow the instructions in forms.md
For troubleshooting guides, see reference.md
For advanced table of content template, see reference.md

pdf

More from this repository

PDF Processing Guide

Overview

Core Constraints (Must Follow)

1. Output Language

2. Page Count Control

3. Structure Compliance (Mandatory)

4. Information Sourcing Requirements

CRITICAL: Verify Before Writing

5. Character Safety Rule (Mandatory)

Font Setup (Guaranteed Success Method)

CRITICAL: Allowed Fonts Only

Font Registration Template

Font Configuration by Document Type

Chinese Plot PNG Method

Available Font Paths

Guidelines for Output

Layout & Spacing Control

Page Breaks

Vertical Spacing Standards

Cover Page Specifications

Table & Content Flow

Alignment and Typography

Style Configuration

Table Formatting

Standard Table Color Scheme (MUST USE for ALL tables)

Table Cell Paragraph Wrapping (MANDATORY - REVIEW BEFORE EVERY TABLE)

PDF Metadata (REQUIRED)

Required Metadata Fields

For reportlab (Creating New PDFs)

For pypdf (Modifying Existing PDFs)

Quick Start

Python Libraries

pypdf - Basic Operations

Merge PDFs

Split PDF

Extract Metadata

Set/Update Metadata (RoutinAI Branding)

Rotate Pages

pdfplumber - Text and Table Extraction

Extract Text with Layout

Extract Tables

reportlab - Create PDFs

Choosing the Right DocTemplate and Build Method

Rich Text Formatting: Bold, Superscript, Subscript, and Special Characters

Prerequisites

Character Handling (see Core Constraint #5)

Complete Python Example

Preventing Unwanted Line Breaks

Basic PDF Creation

Auto-Generated Table of Contents

⚠️ CRITICAL WARNINGS

❌ FORBIDDEN: Manual Table of Contents

Complete TOC Implementation Example

Cross-References (Figures, Tables, Bibliography)

Command-Line Tools

pdftotext (poppler-utils)

qpdf

Common Tasks

Brand PDFs with RoutinAI Metadata

Extract Text from Scanned PDFs

Add Watermark

Password Protection

Critical Reminders (MUST Follow)

Font Rules

Rich Text Tags (<b>, <super>, <sub>)

Line Breaks in Paragraph

Body Title & Heading Styles

Table Cell Content Rule (MANDATORY)

Table Style Specifications

Document Structure

Image Handling

Final Code Check

MANDATORY: Post-Generation Forbidden Character Sanitization

Quick Reference

Next Steps

PDF Processing Guide

Overview

Core Constraints (Must Follow)

Rich Text Tags (`<b>`, `<super>`, `<sub>`)

Rich Text Tags (`<b>`, `<super>`, `<sub>`)