Run any Skill in Manus with one click

Get Started

pdf-extraction-fallback

Multi-stage fallback strategy for PDF/document extraction using sequential tool alternatives

Run Skill in Manus

Overview

Multi-stage fallback strategy for PDF/document extraction using sequential tool alternatives

Install command

npx skills add https://github.com/HKUDS/OpenSpace --skill pdf-extraction-fallback

Copy and paste this command into Claude Code to install the skill

Source

HKUDS/OpenSpace

Stars6,420

Forks792

UpdatedMarch 24, 2026 at 08:03

File Explorer

2 files

SKILL.md

readonly

name	pdf-extraction-fallback
description	Multi-stage fallback strategy for PDF/document extraction using sequential tool alternatives

PDF Extraction Fallback Strategy

When processing documents (especially PDFs), initial extraction attempts may fail due to formatting, encryption, or tool limitations. This skill provides a systematic fallback approach that tries multiple extraction methods before declaring failure.

Core Principle

Never declare completion after a single tool failure. Instead, iterate through a hierarchy of extraction methods, each with different capabilities and limitations.

Fallback Hierarchy

Attempt extraction methods in this order:

Stage 1: Direct PDF Reading

Try native PDF libraries first (fastest, preserves structure):

import PyPDF2
from pypdf import PdfReader

def extract_with_pypdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

Stage 2: Shell-based Extraction (pdftotext)

If Stage 1 fails, use system tools:

# Install if needed: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt
pdftotext -raw input.pdf output.txt  # Alternative layout

import subprocess

def extract_with_pdftotext(pdf_path):
    result = subprocess.run(
        ['pdftotext', '-layout', pdf_path, '-'],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        return result.stdout
    raise Exception("pdftotext failed")

Stage 3: Alternative Python Parsers

Try different Python libraries with varying capabilities:

# pdfplumber - better for tables
import pdfplumber
def extract_with_pdfplumber(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

# pdfminer - handles complex layouts
from pdfminer.high_level import extract_text
def extract_with_pdfminer(pdf_path):
    return extract_text(pdf_path)

Stage 4: OCR Fallback

For scanned images or when text extraction fails:

# Using tesseract
convert input.pdf output-%d.png  # Convert to images first
tesseract output-0.png result --psm 6

# Using pytesseract
from pdf2image import convert_from_path
import pytesseract

def extract_with_ocr(pdf_path):
    images = convert_from_path(pdf_path, dpi=300)
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)
    return text

Implementation Pattern

def robust_pdf_extraction(pdf_path):
    """Try multiple extraction methods until one succeeds."""
    
    extraction_methods = [
        ("PyPDF2", extract_with_pypdf),
        ("pdftotext", extract_with_pdftotext),
        ("pdfplumber", extract_with_pdfplumber),
        ("pdfminer", extract_with_pdfminer),
        ("OCR", extract_with_ocr),
    ]
    
    errors = []
    for method_name, method_func in extraction_methods:
        try:
            print(f"Trying {method_name}...")
            text = method_func(pdf_path)
            if text and text.strip():
                print(f"Success with {method_name}")
                return text
            else:
                errors.append(f"{method_name}: empty result")
        except Exception as e:
            errors.append(f"{method_name}: {str(e)}")
            print(f"{method_name} failed: {e}")
            continue
    
    # All methods failed
    raise Exception(f"All extraction methods failed:\n" + "\n".join(errors))

Success Criteria

A method is considered successful when:

No exceptions are raised during execution
Non-empty text is returned (text.strip() has content)
Content quality meets task requirements (check for expected keywords/patterns)

Best Practices

Log each attempt - Record which methods were tried and why they failed
Validate output - Check extracted text contains expected content markers
Graceful degradation - Proceed with partial data if full extraction isn't possible
Cache successful method - Remember which method worked for similar files
Set timeouts - Prevent OCR or complex parsing from hanging indefinitely

Common Failure Modes

Symptom	Likely Cause	Best Fallback
Empty pages	Image-based PDF	OCR (Stage 4)
Garbled text	Encoding issues	pdftotext (Stage 2)
Missing tables	Simple parser	pdfplumber (Stage 3)
Permission errors	Encrypted PDF	Check password/permissions first
Layout lost	Complex formatting	pdftotext -layout or pdfplumber

When to Use

Processing unknown/untrusted PDF sources
Batch processing diverse document types
Critical tasks where extraction failure is not acceptable
Documents with mixed content (text + images + tables)

PDF Extraction Fallback Strategy

Core Principle

Never declare completion after a single tool failure. Instead, iterate through a hierarchy of extraction methods, each with different capabilities and limitations.

Fallback Hierarchy

Attempt extraction methods in this order:

Stage 1: Direct PDF Reading

Try native PDF libraries first (fastest, preserves structure):

import PyPDF2
from pypdf import PdfReader

def extract_with_pypdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

Stage 2: Shell-based Extraction (pdftotext)

If Stage 1 fails, use system tools:

# Install if needed: apt-get install poppler-utils
pdftotext -layout input.pdf output.txt
pdftotext -raw input.pdf output.txt  # Alternative layout

import subprocess

def extract_with_pdftotext(pdf_path):
    result = subprocess.run(
        ['pdftotext', '-layout', pdf_path, '-'],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        return result.stdout
    raise Exception("pdftotext failed")

Stage 3: Alternative Python Parsers

Try different Python libraries with varying capabilities:

# pdfplumber - better for tables
import pdfplumber
def extract_with_pdfplumber(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

# pdfminer - handles complex layouts
from pdfminer.high_level import extract_text
def extract_with_pdfminer(pdf_path):
    return extract_text(pdf_path)

Stage 4: OCR Fallback

For scanned images or when text extraction fails:

# Using tesseract
convert input.pdf output-%d.png  # Convert to images first
tesseract output-0.png result --psm 6

# Using pytesseract
from pdf2image import convert_from_path
import pytesseract

def extract_with_ocr(pdf_path):
    images = convert_from_path(pdf_path, dpi=300)
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)
    return text

Implementation Pattern

def robust_pdf_extraction(pdf_path):
    """Try multiple extraction methods until one succeeds."""
    
    extraction_methods = [
        ("PyPDF2", extract_with_pypdf),
        ("pdftotext", extract_with_pdftotext),
        ("pdfplumber", extract_with_pdfplumber),
        ("pdfminer", extract_with_pdfminer),
        ("OCR", extract_with_ocr),
    ]
    
    errors = []
    for method_name, method_func in extraction_methods:
        try:
            print(f"Trying {method_name}...")
            text = method_func(pdf_path)
            if text and text.strip():
                print(f"Success with {method_name}")
                return text
            else:
                errors.append(f"{method_name}: empty result")
        except Exception as e:
            errors.append(f"{method_name}: {str(e)}")
            print(f"{method_name} failed: {e}")
            continue
    
    # All methods failed
    raise Exception(f"All extraction methods failed:\n" + "\n".join(errors))

Success Criteria

A method is considered successful when:

No exceptions are raised during execution
Non-empty text is returned (text.strip() has content)
Content quality meets task requirements (check for expected keywords/patterns)

Best Practices

Log each attempt - Record which methods were tried and why they failed
Validate output - Check extracted text contains expected content markers
Graceful degradation - Proceed with partial data if full extraction isn't possible
Cache successful method - Remember which method worked for similar files
Set timeouts - Prevent OCR or complex parsing from hanging indefinitely

Common Failure Modes

Symptom	Likely Cause	Best Fallback
Empty pages	Image-based PDF	OCR (Stage 4)
Garbled text	Encoding issues	pdftotext (Stage 2)
Missing tables	Simple parser	pdfplumber (Stage 3)
Permission errors	Encrypted PDF	Check password/permissions first
Layout lost	Complex formatting	pdftotext -layout or pdfplumber

When to Use

Processing unknown/untrusted PDF sources
Batch processing diverse document types
Critical tasks where extraction failure is not acceptable
Documents with mixed content (text + images + tables)

pdf-extraction-fallback

PDF Extraction Fallback Strategy

Core Principle

Fallback Hierarchy

Stage 1: Direct PDF Reading

Stage 2: Shell-based Extraction (pdftotext)

Stage 3: Alternative Python Parsers

Stage 4: OCR Fallback

Implementation Pattern

Success Criteria

Best Practices

Common Failure Modes

When to Use

More from this repository

More from this repository

PDF Extraction Fallback Strategy

Core Principle

Fallback Hierarchy

Stage 1: Direct PDF Reading

Stage 2: Shell-based Extraction (pdftotext)

Stage 3: Alternative Python Parsers

Stage 4: OCR Fallback

Implementation Pattern

Success Criteria

Best Practices

Common Failure Modes

When to Use