ワンクリックでManusで任意のスキルを実行

$pwd:

pdf-processing

Name: Pdf Processing
Author: Ming-Kai-LC

// Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.

Manusで実行

$ git log --oneline --stat

stars:0

forks:0

updated:2025年11月16日 19:06

ファイルエクスプローラー

6 ファイル

SKILL.md

readonly

package.json

"author": "Ming-Kai-LC"

"repository": "Ming-Kai-LC/self-learn"

GitHub リポジトリを開く

$ install --globalskills.sh

$ download --local

Manusで実行

[HINT] SKILL.mdと関連ファイルを含む完全なスキルディレクトリをダウンロード

ワンクリックで任意のスキルを実行

name	pdf-processing
description	Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.
version	1.0.0
dependencies	python>=3.8, pypdf>=3.0.0, PyMuPDF>=1.23.0, pdfplumber>=0.9.0, pdf2image>=1.16.0, pytesseract>=0.3.10

PDF Processing for Claude Code

Provides comprehensive techniques and utilities for processing PDF files in Claude Code, especially large files that exceed direct reading capabilities.

Overview

Claude Code can read PDF files directly using the Read tool, but has critical limitations:

Official limits: 32MB max file size, 100 pages max
Real-world limits: Much lower (10-15MB, 30-50 pages)
Known issue: Claude Code crashes with large PDFs, causing session termination and context loss
Token cost: 1,500-3,000 tokens per page for text + additional for images

This skill provides workarounds, utilities, and best practices for handling PDFs of any size.

Quick Start

Check if PDF is Too Large for Direct Reading

import os

def is_pdf_too_large(filepath, max_mb=10):
    """Check if PDF exceeds safe processing size."""
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    return size_mb > max_mb

# Use before attempting to read
if is_pdf_too_large("document.pdf"):
    print("PDF too large - use chunking strategies")
else:
    # Safe to read directly with Claude Code
    pass

Extract Text from PDF

import fitz  # PyMuPDF - fastest option

def extract_text_fast(pdf_path):
    """Extract all text from PDF quickly."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Usage
text = extract_text_fast("document.pdf")

Split Large PDF into Chunks

from pypdf import PdfReader, PdfWriter

def chunk_pdf(input_path, pages_per_chunk=25, output_dir="chunks"):
    """Split PDF into smaller files."""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    os.makedirs(output_dir, exist_ok=True)

    for i in range(0, total_pages, pages_per_chunk):
        writer = PdfWriter()
        end = min(i + pages_per_chunk, total_pages)

        for page_num in range(i, end):
            writer.add_page(reader.pages[page_num])

        output_file = f"{output_dir}/chunk_{i//pages_per_chunk:03d}_pages_{i+1}-{end}.pdf"
        with open(output_file, "wb") as output:
            writer.write(output)

        print(f"Created {output_file}")

# Usage
chunk_pdf("large_document.pdf", pages_per_chunk=30)

Extract Tables from PDF

import pdfplumber

def extract_tables(pdf_path):
    """Extract all tables from PDF with high accuracy."""
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            page_tables = page.extract_tables()
            for table_num, table in enumerate(page_tables, 1):
                tables.append({
                    'page': page_num,
                    'table_num': table_num,
                    'data': table
                })

    return tables

# Usage
tables = extract_tables("report.pdf")
for t in tables:
    print(f"Page {t['page']}, Table {t['table_num']}")
    print(t['data'])

Python Libraries

pypdf (formerly PyPDF2)

Best for: Basic PDF operations (split, merge, rotate)
Speed: Slower than alternatives
Install: pip install pypdf

PyMuPDF (fitz)

Best for: Fast text extraction, general-purpose processing
Speed: 10-20x faster than pypdf
Install: pip install PyMuPDF

pdfplumber

Best for: Table extraction, precise text with coordinates
Speed: Moderate (0.10s per page)
Install: pip install pdfplumber

pdf2image

Best for: Converting PDF pages to images
Requires: Poppler (system dependency)
Install: pip install pdf2image

pytesseract

Best for: OCR on scanned PDFs
Requires: Tesseract (system dependency)
Install: pip install pytesseract

Chunking Strategies

1. Page-Based Splitting

Split PDF into fixed page batches.

When to use: Document structure is irrelevant; you need simple, predictable chunks

Optimal size: 20-30 pages per chunk (stays under 10MB typically)

# See Quick Start "Split Large PDF into Chunks"
chunk_pdf("document.pdf", pages_per_chunk=25)

2. Size-Based Splitting

Monitor file size and split when threshold is reached.

When to use: Avoiding crashes is critical; page count is unreliable indicator

def chunk_by_size(pdf_path, max_mb=8):
    """Split PDF keeping chunks under size limit."""
    reader = PdfReader(pdf_path)
    writer = PdfWriter()
    chunk_num = 0

    for page_num, page in enumerate(reader.pages):
        writer.add_page(page)

        # Check size by writing to bytes
        from io import BytesIO
        buffer = BytesIO()
        writer.write(buffer)
        size_mb = buffer.tell() / (1024 * 1024)

        if size_mb >= max_mb:
            # Save chunk
            output = f"chunk_{chunk_num:03d}.pdf"
            with open(output, "wb") as f:
                writer.write(f)
            chunk_num += 1
            writer = PdfWriter()  # Start new chunk

3. Overlapping Chunks

Include overlap between chunks to maintain context.

When to use: Content spans pages; losing context between chunks is problematic

Optimal overlap: 1-2 pages (or 10-20% of chunk size)

def chunk_with_overlap(pdf_path, pages_per_chunk=25, overlap=2):
    """Split PDF with overlapping pages for context preservation."""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    chunk_num = 0
    start = 0

    while start < total_pages:
        writer = PdfWriter()
        end = min(start + pages_per_chunk, total_pages)

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        output = f"chunk_{chunk_num:03d}_pages_{start+1}-{end}.pdf"
        with open(output, "wb") as f:
            writer.write(f)

        chunk_num += 1
        start = end - overlap  # Move forward with overlap

4. Text Extraction First

Extract text, then chunk the text instead of PDF.

When to use: You only need text content, not layout/images

Advantage: Much smaller, faster to process, no crashes

def extract_and_chunk_text(pdf_path, chars_per_chunk=10000):
    """Extract text and split into manageable chunks."""
    import fitz

    doc = fitz.open(pdf_path)
    full_text = ""

    for page in doc:
        full_text += f"\n\n--- Page {page.number + 1} ---\n\n"
        full_text += page.get_text()

    doc.close()

    # Split text into chunks
    chunks = []
    for i in range(0, len(full_text), chars_per_chunk):
        chunks.append(full_text[i:i + chars_per_chunk])

    return chunks

# Usage
text_chunks = extract_and_chunk_text("large.pdf")
for i, chunk in enumerate(text_chunks):
    with open(f"text_chunk_{i:03d}.txt", "w", encoding="utf-8") as f:
        f.write(chunk)

Handling Different PDF Types

Text-Based PDFs (Native Text)

PDFs created digitally with searchable text.

Detection:

import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()  # First page

if len(text.strip()) > 50:
    print("Text-based PDF")
else:
    print("Likely scanned PDF")

Best approach: Direct text extraction with PyMuPDF or pdfplumber

Scanned PDFs (Images of Text)

PDFs created by scanning physical documents.

Requires: OCR (Optical Character Recognition)

Approach:

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path):
    """Extract text from scanned PDF using OCR."""
    # Convert to images
    images = convert_from_path(pdf_path, dpi=300)

    # OCR each page
    text = ""
    for i, image in enumerate(images, 1):
        page_text = pytesseract.image_to_string(image)
        text += f"\n\n--- Page {i} ---\n\n{page_text}"

    return text

Performance note: OCR is much slower than direct text extraction

Mixed PDFs

Some pages have text, others are scanned.

Approach: Detect page-by-page and use appropriate method

def extract_mixed_pdf(pdf_path):
    """Handle PDFs with both text and scanned pages."""
    import fitz
    from pdf2image import convert_from_path
    import pytesseract

    doc = fitz.open(pdf_path)
    full_text = ""

    for page_num, page in enumerate(doc):
        text = page.get_text()

        if len(text.strip()) > 50:
            # Has text - use direct extraction
            full_text += f"\n\n--- Page {page_num + 1} (text) ---\n\n{text}"
        else:
            # Likely scanned - use OCR
            images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
            ocr_text = pytesseract.image_to_string(images[0])
            full_text += f"\n\n--- Page {page_num + 1} (OCR) ---\n\n{ocr_text}"

    doc.close()
    return full_text

Helper Scripts

This skill includes pre-built scripts in the scripts/ directory:

chunk_pdf.py: Flexible PDF chunking with multiple strategies
extract_text.py: Unified text extraction (handles text-based and OCR)
extract_tables.py: Advanced table extraction with formatting
process_large_pdf.py: Orchestrate complete large PDF processing workflow

Using Helper Scripts

# Chunk a large PDF
python .claude/skills/pdf-processing/scripts/chunk_pdf.py large_doc.pdf --pages 30 --overlap 2

# Extract all text
python .claude/skills/pdf-processing/scripts/extract_text.py document.pdf --output text.txt

# Extract tables to CSV
python .claude/skills/pdf-processing/scripts/extract_tables.py report.pdf --output tables/

# Process large PDF end-to-end
python .claude/skills/pdf-processing/scripts/process_large_pdf.py huge_doc.pdf --strategy chunk --output processed/

Error Handling

Preventing Crashes

Key principle: Never trust PDF size alone - always check before reading

def safe_pdf_read(pdf_path, max_pages=30, max_mb=10):
    """Safely check if PDF can be read directly."""
    import fitz

    # Check file size
    size_mb = os.path.getsize(pdf_path) / (1024 * 1024)
    if size_mb > max_mb:
        return False, f"File too large: {size_mb:.1f}MB (max: {max_mb}MB)"

    # Check page count
    try:
        doc = fitz.open(pdf_path)
        page_count = len(doc)
        doc.close()

        if page_count > max_pages:
            return False, f"Too many pages: {page_count} (max: {max_pages})"

        return True, f"Safe to read: {size_mb:.1f}MB, {page_count} pages"

    except Exception as e:
        return False, f"Error checking PDF: {e}"

# Usage
safe, message = safe_pdf_read("document.pdf")
print(message)

if safe:
    # Use Claude Code Read tool
    pass
else:
    # Use chunking strategies
    pass

Handling Corrupted PDFs

def is_pdf_valid(pdf_path):
    """Check if PDF is valid and readable."""
    try:
        import fitz
        doc = fitz.open(pdf_path)
        _ = len(doc)  # Force reading
        doc.close()
        return True, "PDF is valid"
    except Exception as e:
        return False, f"PDF is corrupted or invalid: {e}"

Graceful Degradation

def extract_with_fallback(pdf_path):
    """Try multiple extraction methods, falling back if needed."""

    # Try 1: PyMuPDF (fastest)
    try:
        import fitz
        doc = fitz.open(pdf_path)
        text = "\n".join(page.get_text() for page in doc)
        doc.close()
        if text.strip():
            return text, "pymupdf"
    except Exception as e:
        print(f"PyMuPDF failed: {e}")

    # Try 2: pdfplumber (more reliable)
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join(page.extract_text() or "" for page in pdf.pages)
        if text.strip():
            return text, "pdfplumber"
    except Exception as e:
        print(f"pdfplumber failed: {e}")

    # Try 3: OCR (last resort)
    try:
        from pdf2image import convert_from_path
        import pytesseract
        images = convert_from_path(pdf_path, dpi=300)
        text = "\n\n".join(pytesseract.image_to_string(img) for img in images)
        return text, "ocr"
    except Exception as e:
        print(f"OCR failed: {e}")

    return None, "all_methods_failed"

Best Practices

Always check file size before reading: Use safe_pdf_read() to avoid crashes
Prefer text extraction over direct reading: Extract text first, then process text files
Use overlapping chunks for context: 1-2 pages overlap prevents information loss
Choose the right tool: PyMuPDF for speed, pdfplumber for tables, OCR for scans
Monitor progress: For large PDFs, log progress to recover from interruptions
Save intermediate results: Don't lose progress if processing fails partway through
Test with small chunks first: Validate approach on 1-2 chunks before processing entire document

Common Workflows

Workflow 1: Analyze Large Report

# 1. Check if direct read is safe
safe, msg = safe_pdf_read("report.pdf")

if not safe:
    # 2. Extract text instead
    text = extract_text_fast("report.pdf")

    # 3. Save to file for Claude to read
    with open("report_text.txt", "w", encoding="utf-8") as f:
        f.write(text)

    # 4. Process text file (much safer)
    # Claude can now read report_text.txt without crashes

Workflow 2: Extract Data from Multi-Page Invoice

# 1. Extract tables from all pages
tables = extract_tables("invoice_100pages.pdf")

# 2. Convert to structured format
import csv

for t in tables:
    filename = f"invoice_page{t['page']}_table{t['table_num']}.csv"
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerows(t['data'])

Workflow 3: Process Scanned Document Archive

# 1. Check if scanned
import fitz
doc = fitz.open("archive.pdf")
is_scanned = len(doc[0].get_text().strip()) < 50
doc.close()

if is_scanned:
    # 2. Use OCR
    text = ocr_pdf("archive.pdf")

    # 3. Save extracted text
    with open("archive_ocr.txt", "w", encoding="utf-8") as f:
        f.write(text)

Troubleshooting

Issue: "Claude Code crashed when reading PDF"

Solution: File was too large. Use chunking or text extraction first.

Issue: "Extracted text is gibberish"

Solution: PDF might be scanned. Use OCR (ocr_pdf() function).

Issue: "Table extraction is inaccurate"

Solution: Use pdfplumber with custom table detection settings (see reference.md).

Issue: "OCR is too slow"

Solution: Reduce DPI (try 150-200 instead of 300), or process only needed pages.

Issue: "Out of memory when processing large PDF"

Solution: Process page-by-page instead of loading entire document. See process_large_pdf.py.

Next Steps

For advanced techniques and detailed API references, see reference.md
For troubleshooting specific library issues, see library documentation
For custom workflows, combine techniques from Quick Start and Common Workflows sections

Installation

Required dependencies:

pip install pypdf PyMuPDF pdfplumber pdf2image pytesseract

System dependencies:

Poppler (for pdf2image): Installation guide
Tesseract (for OCR): Installation guide

name	pdf-processing
description	Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.
version	1.0.0
dependencies	python>=3.8, pypdf>=3.0.0, PyMuPDF>=1.23.0, pdfplumber>=0.9.0, pdf2image>=1.16.0, pytesseract>=0.3.10