Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.

2026-04-076.5k

adaptive-stem-alignment

HKUDS/OpenSpace

Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification

2026-03-246.5k

diagnostic-stem-delivery

HKUDS/OpenSpace

Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow

2026-03-246.5k

aligned-stem-workflow

HKUDS/OpenSpace

Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies

2026-03-246.5k

incremental-audio-workflow

HKUDS/OpenSpace

Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates

2026-03-246.5k

audio-track-production

HKUDS/OpenSpace

End-to-end audio production workflow with stems, effects, archiving, and verification

2026-03-246.5k

name	pdf-extraction-fallback-7db3aa
description	Resilient multi-tier PDF extraction with sequential fallback strategies when initial reading fails

PDF Extraction Fallback Strategy

Purpose

When processing complex documents (tax forms, legal documents, scanned materials), PDF extraction often fails on the first attempt. This skill provides a systematic fallback approach that tries multiple extraction methods in sequence until one succeeds.

When to Use

Initial PDF reading tools return errors or empty content
Document appears to be scanned/image-based rather than text-based
Previous extraction attempts produced incomplete or garbled output
Working with forms, tables, or structured documents that need reliable extraction

Fallback Sequence

Tier 1: Shell-Based Extraction (pdftotext)

Start with command-line tools that often handle edge cases better:

# Extract text maintaining layout
pdftotext -layout input.pdf output.txt

# Extract raw text (faster, less formatting)
pdftotext input.pdf output.txt

# Extract specific page range
pdftotext -f 1 -l 3 input.pdf output.txt

Check if output contains meaningful content before proceeding.

Tier 2: Python-Based Parsing

If shell tools fail, use Python libraries with different extraction approaches:

# Using PyPDF2 for basic text extraction
import PyPDF2
with open('document.pdf', 'rb') as f:
    reader = PyPDF2.PdfReader(f)
    text = ''.join(page.extract_text() for page in reader.pages)

# Using pdfplumber for tables and structured content
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

# Using pypdf for newer PDF features
from pypdf import PdfReader
reader = PdfReader('document.pdf')
text = ''.join(page.extract_text() for page in reader.pages)

Tier 3: OCR Tools (for Scanned Documents)

If the PDF contains images or scanned content, use OCR:

# Using tesseract via command line
tesseract input.pdf output --psm 6

# Using Python with pytesseract
import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('document.pdf')
text = ''.join(pytesseract.image_to_string(img) for img in images)

Implementation Pattern

def extract_pdf_resilient(pdf_path):
    """Try multiple extraction methods until one succeeds."""
    
    # Tier 1: Shell extraction
    result = run_shell(f'pdftotext -layout "{pdf_path}" -')
    if result.stdout and len(result.stdout.strip()) > 100:
        return result.stdout, 'pdftotext'
    
    # Tier 2: Python libraries
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            text = ''.join(page.extract_text() or '' for page in pdf.pages)
        if text.strip():
            return text, 'pdfplumber'
    except Exception:
        pass
    
    # Tier 3: OCR fallback
    try:
        from pdf2image import convert_from_path
        import pytesseract
        images = convert_from_path(pdf_path)
        text = ''.join(pytesseract.image_to_string(img) for img in images)
        if text.strip():
            return text, 'tesseract-ocr'
    except Exception:
        pass
    
    raise ExtractionError("All extraction methods failed")

Decision Criteria

Indicator	Action
Empty output	Proceed to next tier
Garbled/special characters	Try next tier
Partial content	Accept if meets minimum threshold
Tool not available	Skip to next tier
Format-specific errors	Try alternative library

Best Practices

Validate each attempt - Check output length and quality before accepting
Log which method succeeded - Track which tier worked for future reference
Set minimum content thresholds - Don't accept trivial results (e.g., <50 chars)
Combine methods if needed - Some documents need multiple approaches for different sections
Preserve original file - Never modify the source PDF during extraction attempts

Error Handling

Catch exceptions at each tier, don't fail immediately
Log detailed error messages for debugging
Continue to next tier even if current tier partially succeeds but produces poor quality
After all tiers fail, provide clear summary of what was tried

Output Quality Check

Before declaring extraction complete:

def validate_extraction(text):
    if not text or len(text.strip()) < 50:
        return False
    if text.count('') > len(text) * 0.1:  # Too many replacement chars
        return False
    if len(set(text)) < 10:  # Too little character variety
        return False
    return True