with one click
pdf-download-extract-fallback
Multi-step PDF download and text extraction with progressive fallback strategies
Menu
Multi-step PDF download and text extraction with progressive fallback strategies
| name | pdf-download-extract-fallback |
| description | Multi-step PDF download and text extraction with progressive fallback strategies |
This skill provides a robust workflow for acquiring PDF documents from web sources and extracting their text content, with multiple fallback mechanisms to handle various failure modes.
When working with PDFs from web sources, encounters with JavaScript redirects, corrupted files, missing tools, or inaccessible content are common. This workflow ensures maximum success rate through progressive fallback strategies.
Many PDF hosting sites use JavaScript-based redirects or block automated requests. Use curl with a realistic browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filenameAlways validate the downloaded file is actually a PDF before attempting extraction:
file output.pdf
Expected output should contain "PDF document". If not:
First attempt extraction using the standard pdftotext utility (part of poppler-utils):
pdftotext output.pdf output.txt
If pdftotext is not available, install it:
# Debian/Ubuntu
apt-get update && apt-get install -y poppler-utils
# macOS
brew install poppler
# RHEL/CentOS
yum install -y poppler-utils
If pdftotext fails or produces poor results, use Python's PyMuPDF library:
import fitz # PyMuPDF
doc = fitz.open("output.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("output.txt", "w") as f:
f.write(text)
Install if needed:
pip install pymupdf
If the PDF cannot be accessed or extracted after all attempts:
Example degradation note:
NOTE: Source document [URL] was inaccessible due to [reason].
Content below combines partial extraction with established domain knowledge
for [topic]. Verify against official sources when available.
#!/bin/bash
# pdf-extract-workflow.sh
# pdf-extract-workflow.sh - Handles both URL downloads and local files
INPUT="$1"
OUTPUT_PDF="downloaded.pdf"
OUTPUT_TXT="extracted.txt"
if [[ "$INPUT" =~ ^https?:// ]]; then
# Mode A: URL download
PDF_URL="$INPUT"
echo "Downloading PDF from URL..."
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$PDF_URL"
else
# Mode B: Local file
if [ ! -f "$INPUT" ]; then
echo "ERROR: Local file not found: $INPUT"
exit 1
fi
OUTPUT_PDF="$INPUT"
echo "Using local file: $INPUT"
fi
# Step 2: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
echo "WARNING: Downloaded file is not a valid PDF"
echo "Attempting fallback extraction anyway..."
fi
# Step 3: Try pdftotext
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
echo "Extraction successful with pdftotext"
exit 0
fi
fi
# Step 4: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys
try:
doc = fitz.open("downloaded.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("extracted.txt", "w") as f:
f.write(text)
print("Extraction successful with PyMuPDF")
sys.exit(0)
except Exception as e:
print(f"PyMuPDF failed: {e}")
sys.exit(1)
PYTHON_SCRIPT
# Step 5: Handle complete failure
# Step 5: Handle complete failure (domain knowledge fallback)
if [ $? -ne 0 ]; then
echo "ERROR: All extraction methods failed."
echo "ACTION: Generate content from domain knowledge and clearly mark source limitations."
echo "Document the failure and proceed with knowledge-based content generation."
fi
| Symptom | Cause | Solution |
|---|---|---|
| HTML content in file | URL redirected to error page or wrong file type | Check HTTP status, verify file with file command |
| Empty extraction | Password-protected or scanned PDF | Try OCR tools or request accessible version |
| Garbled text | Encoding issues | Try PyMuPDF with different extraction mode |
| Curl blocked | Anti-bot measures | Add more headers, use delay between requests |
| pdftotext not found | Tool not installed | Run apt-get install poppler-utils or use PyMuPDF fallback |
| PyMuPDF import failed | Package not installed | Run pip install pymupdf |
| File not found (local) | Incorrect path or file not accessible | Verify file path, check permissions, confirm file was uploaded |
When you already have the PDF file locally (not from a URL):
if [ ! -f "your_file.pdf" ]; then
echo "ERROR: File not found"
echo "ACTION: Verify the file path and that the file was successfully uploaded"
exit 1
fi
file your_file.pdf
Expected output should contain "PDF document". If not, the file may be corrupted or mislabeled.
After validation, skip directly to Step 3: Primary Extraction with pdftotext in the main workflow.
Before attempting any PDF extraction, verify your environment has the necessary tools:
# Check pdftotext availability
command -v pdftotext && echo "pdftotext: AVAILABLE" || echo "pdftotext: NOT FOUND - install poppler-utils"
# Check PyMuPDF availability
python3 -c "import fitz; print('PyMuPDF: AVAILABLE')" 2>/dev/null || echo "PyMuPDF: NOT FOUND - run: pip install pymupdf"
Installation commands if tools are missing:
# Install pdftotext (poppler-utils)
apt-get update && apt-get install -y poppler-utils # Debian/Ubuntu
yum install -y poppler-utils # RHEL/CentOS
brew install poppler # macOS
# Install PyMuPDF
pip install pymupdf
This skill provides a robust workflow for acquiring PDF documents from web sources or processing locally-available files and extracting their text content, with multiple fallback mechanisms to handle various failure modes.
Before beginning, identify your scenario:
| Scenario | Start Here | Skip |
|---|---|---|
| PDF already on local disk | Step 2 (Verify File Type) | Step 1 (Download) |
| PDF at a web URL | Step 1 (Download) | None |
When working with PDFs from web sources or local files, encounters with corrupted files, missing tools, or inaccessible content are common. This workflow ensures maximum success rate through progressive fallback strategies.
Many PDF hosting sites use JavaScript-based redirects or block automated requests. Use curl with a realistic browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filenameIf you already have the PDF file locally, skip Step 1 and begin here:
Always validate the file is actually a PDF before attempting extraction:
Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification
Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow
Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies
Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates
End-to-end audio production workflow with stems, effects, archiving, and verification