with one click
pdf-extraction-fallbacks
Multi-fallback PDF/text extraction with early failure detection and sequential tool fallbacks
Menu
Multi-fallback PDF/text extraction with early failure detection and sequential tool fallbacks
| name | pdf-extraction-fallbacks |
| description | Multi-fallback PDF/text extraction with early failure detection and sequential tool fallbacks |
When extracting text from PDFs (especially regulatory documents, handbooks, or protected content), single-method approaches often fail due to JavaScript protection, CORS restrictions, encoding issues, or corrupted downloads. This skill provides a robust multi-fallback workflow that detects failures early and tries sequential extraction methods.
Before attempting extraction, validate the downloaded file:
# Download the PDF
curl -L -o document.pdf "$URL"
# Check file size (reject if < 1KB - likely error page)
FILE_SIZE=$(stat -f%z document.pdf 2>/dev/null || stat -c%s document.pdf 2>/dev/null)
if [ "$FILE_SIZE" -lt 1024 ]; then
echo "ERROR: File too small ($FILE_SIZE bytes) - likely not a valid PDF"
# Check if it's an HTML error page
head -c 200 document.pdf | grep -i "<html\|<!doctype\|error\|access denied" && \
echo "Detected HTML error page instead of PDF"
exit 1
fi
# Check PDF magic bytes
HEAD_BYTES=$(head -c 4 document.pdf)
if [ "$HEAD_BYTES" != "%PDF" ]; then
echo "ERROR: File does not start with PDF magic bytes"
head -c 100 document.pdf
exit 1
fi
# Try pdftotext first (fastest, most reliable for simple PDFs)
if command -v pdftotext &> /dev/null; then
pdftotext -layout document.pdf output.txt 2>/dev/null
if [ -s output.txt ]; then
WORD_COUNT=$(wc -w < output.txt)
if [ "$WORD_COUNT" -gt 50 ]; then
echo "SUCCESS: pdftotext extracted $WORD_COUNT words"
exit 0
fi
fi
fi
# Try PyMuPDF - handles more complex PDFs
import fitz # pymupdf
try:
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
if len(text.strip()) > 500: # Sanity check
with open("output.txt", "w") as f:
f.write(text)
print(f"SUCCESS: PyMuPDF extracted {len(text)} characters")
else:
print("WARNING: PyMuPDF extraction too short, trying next method")
except Exception as e:
print(f"PyMuPDF failed: {e}")
# Try pdfplumber - better for tables and structured content
import pdfplumber
try:
text = ""
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
if len(text.strip()) > 500:
with open("output.txt", "w") as f:
f.write(text)
print(f"SUCCESS: pdfplumber extracted {len(text)} characters")
else:
print("WARNING: pdfplumber extraction too short")
except Exception as e:
print(f"pdfplumber failed: {e}")
If all methods fail, the PDF may be JavaScript-protected:
# Check for JavaScript in PDF
import fitz
doc = fitz.open("document.pdf")
has_js = False
for page in doc:
if page.get_java_script():
has_js = True
break
if has_js:
print("WARNING: PDF contains JavaScript - may be protected")
# Try rendering pages as images and OCR (requires additional tools)
# Or try alternative download source
If the primary URL fails:
.gov mirrors, archive.org)Download PDF
│
├─→ File < 1KB? → REJECT (likely error page)
├─→ No %PDF header? → REJECT (not a PDF)
│
└─→ Valid PDF
│
├─→ pdftotext → >50 words? → SUCCESS
│ └─→ Try next
│
├─→ PyMuPDF → >500 chars? → SUCCESS
│ └─→ Try next
│
├─→ pdfplumber → >500 chars? → SUCCESS
│ └─→ Try next
│
└─→ All failed → Check for JS protection, try alternative sources
| Symptom | Likely Cause | Solution |
|---|---|---|
| File < 100 bytes | JavaScript error page | Check CORS, try different user-agent |
| File ~1-5KB | HTML error/warning page | Parse HTML for actual PDF link |
| pdftotext returns empty | Encrypted/protected PDF | Try PyMuPDF with password handling |
| Garbled text output | Encoding issue | Try pdfplumber, specify encoding |
| Extraction very short | Images-only PDF | Need OCR (tesseract) |
Save as extract_pdf_robust.sh:
#!/bin/bash
set -e
URL="$1"
OUTPUT="${2:-output.txt}"
TEMP_PDF="temp_download.pdf"
echo "Downloading from: $URL"
curl -L -A "Mozilla/5.0" -o "$TEMP_PDF" "$URL"
# Validate
SIZE=$(stat -c%s "$TEMP_PDF" 2>/dev/null || stat -f%z "$TEMP_PDF")
echo "Downloaded: $SIZE bytes"
if [ "$SIZE" -lt 1024 ]; then
echo "ERROR: File too small - checking content..."
head -200 "$TEMP_PDF"
exit 1
fi
if ! head -c 4 "$TEMP_PDF" | grep -q "%PDF"; then
echo "ERROR: Not a valid PDF file"
head -100 "$TEMP_PDF"
exit 1
fi
# Try extraction methods
python3 << 'PYTHON'
import sys
import fitz
import pdfplumber
pdf_path = "temp_download.pdf"
output_path = "output.txt"
# Method 1: PyMuPDF
try:
doc = fitz.open(pdf_path)
text = "".join(page.get_text() for page in doc)
if len(text.strip()) > 500:
with open(output_path, "w") as f:
f.write(text)
print(f"PyMuPDF: {len(text)} chars")
sys.exit(0)
except Exception as e:
print(f"PyMuPDF failed: {e}")
# Method 2: pdfplumber
try:
text = ""
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
txt = page.extract_text()
if txt:
text += txt + "\n"
if len(text.strip()) > 500:
with open(output_path, "w") as f:
f.write(text)
print(f"pdfplumber: {len(text)} chars")
sys.exit(0)
except Exception as e:
print(f"pdfplumber failed: {e}")
print("All extraction methods failed")
sys.exit(1)
PYTHON
curl - For downloadingpoppler-utils (pdftotext) - Optional, fast extractionPyMuPDF (fitz) - pip install pymupdfpdfplumber - pip install pdfplumberDelegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification
Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow
Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies
Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates
End-to-end audio production workflow with stems, effects, archiving, and verification