with one click
pdf-extract-shell-first
PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback
Menu
PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback
Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification
Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow
Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies
Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates
End-to-end audio production workflow with stems, effects, archiving, and verification
| name | pdf-extract-shell-first |
| description | PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback |
This skill provides an optimized workflow for extracting text content from PDF documents (local files or downloaded URLs) using a prioritized tool cascade that favors shell-based extraction before falling back to Python libraries.
Analysis of execution patterns shows:
read_file on PDFs sometimes returns binary/image data instead of textrun_shell with pdftotext has higher success rate and fewer sandbox errorsexecute_code_sandbox can fail with "unknown error" in constrained environmentsBefore beginning, identify your scenario:
| Scenario | Start Here | Skip |
|---|---|---|
| PDF already on local disk | Step 1 (Try read_file) | Shell download steps |
| PDF at a web URL | Shell download, then Step 1 | None |
| Need maximum reliability | Full cascade (all 3 tools) | None |
If your PDF is at a web URL, download it first using browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o target.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filenameIf you already have the PDF locally, skip to Step 1.
First, attempt to extract text using the read_file tool:
read_file(filetype="pdf", file_path="target.pdf")
Evaluate the response:
| Response Type | Interpretation | Next Action |
|---|---|---|
| Clean readable text | Success | Proceed to content analysis |
| Binary data / PNG image / garbled | read_file returned raw data | Go to Step 2 immediately |
| Error / timeout | Tool failure | Go to Step 2 immediately |
Critical: If read_file returns binary image data or garbled content, do not retry read_file. Immediately proceed to Step 2.
When read_file fails or returns binary data, use run_shell with pdftotext:
run_shell(command="pdftotext target.pdf output.txt")
Then read the extracted text:
read_file(filetype="txt", file_path="output.txt")
If pdftotext is not found, install it first:
run_shell(command="apt-get update && apt-get install -y poppler-utils")
# Or for macOS:
run_shell(command="brew install poppler")
Then retry:
run_shell(command="pdftotext target.pdf output.txt")
Verify extraction quality:
output.txt exists and has contentIf pdftotext is unavailable or produces poor results, use Python's PyMuPDF via execute_code_sandbox:
import fitz # PyMuPDF
doc = fitz.open("target.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("output.txt", "w") as f:
f.write(text)
print(f"Extracted {len(text)} characters from {len(doc)} pages")
Execute via:
execute_code_sandbox(code="<python code above>")
Then read the result:
read_file(filetype="txt", file_path="output.txt")
Note: execute_code_sandbox may fail with "unknown error" in some environments. If this occurs, document the failure and proceed to Step 4.
If all extraction methods fail:
Example degradation documentation:
EXTRACTION FAILURE REPORT:
- Source: [URL or file path]
- read_file: Returned binary/image data (no text extraction)
- run_shell/pdftotext: [Tool not available / produced garbled output / succeeded]
- execute_code_sandbox/PyMuPDF: [Failed with unknown error / succeeded]
NOTE: Content below combines partial extraction with established domain
knowledge for [topic]. Verify against official sources when available.
PDF to Extract
│
▼
┌───────────────┐
│ read_file │
│ (primary) │
└───────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
Returns text Returns binary Error/timeout
(✓) / image data │
│ │ │
▼ ▼ ▼
SUCCESS ┌───────────────┐
│ run_shell │
│ pdftotext │
└───────┬───────┘
│
┌───────┼───────┐
│ │ │
Succeeds Not Garbled
(✓) avail. output
│ │ │
▼ ▼ ▼
SUCCESS ┌───────────────┐
│ execute_code │
│ _sandbox │
│ PyMuPDF │
└───────┬───────┘
│
┌───────┼───────┐
│ │ │
Succeeds Fails Error
(✓) │ │
│ ▼ │
▼ Domain │
SUCCESS Knowledge │
│ │
└──────┘
FAILURE
DOCUMENTED
#!/bin/bash
# pdf-extract-cascade.sh
# Implements the full tool cascade for PDF extraction
INPUT="$1"
OUTPUT_PDF="target.pdf"
OUTPUT_TXT="output.txt"
# Step 0: Handle URL vs local file
if [[ "$INPUT" =~ ^https?:// ]]; then
echo "Downloading PDF from URL..."
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$INPUT"
else
if [ ! -f "$INPUT" ]; then
echo "ERROR: Local file not found: $INPUT"
exit 1
fi
OUTPUT_PDF="$INPUT"
fi
# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
echo "WARNING: File is not a valid PDF"
file "$OUTPUT_PDF"
fi
# Step 2: Try pdftotext (shell-first approach)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
if [ -s "$OUTPUT_TXT" ]; then
echo "SUCCESS: Extraction completed with pdftotext"
wc -l "$OUTPUT_TXT"
exit 0
fi
fi
fi
# Step 3: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys
try:
doc = fitz.open("target.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
with open("output.txt", "w") as f:
f.write(text)
print(f"SUCCESS: Extracted {len(text)} characters from {len(doc)} pages")
sys.exit(0)
except Exception as e:
print(f"PyMuPDF failed: {e}")
sys.exit(1)
PYTHON_SCRIPT
# Step 4: Handle complete failure
if [ $? -ne 0 ]; then
echo "FAILURE: All extraction methods failed"
echo "ACTION: Generate content from domain knowledge"
echo "Document each tool's failure mode for future reference"
exit 1
fi
read_file returns image/binary data, immediately switch to run_shellpdftotext before starting complex workflows| Symptom | Likely Cause | Recommended Action |
|---|---|---|
| read_file returns PNG/binary | PDF rendered as image, not parsed | Immediately use run_shell with pdftotext |
| pdftotext: command not found | poppler-utils not installed | Run apt-get install poppler-utils first |
| pdftotext produces empty file | Password-protected or scanned PDF | Try PyMuPDF, or use OCR tools |
| execute_code_sandbox "unknown error" | Sandbox execution issue | Document failure, use domain knowledge fallback |
| Garbled text output | Encoding issues | Try PyMuPDF with page.get_text("text") |
| All tools fail | Severely corrupted or encrypted PDF | Document limitation, use knowledge-based content |
read_file may return binary dataexecute_code_sandbox has shown instabilityThis skill (pdf-extract-shell-first) differs from the parent in these key ways:
read_file → run_shell → execute_code_sandboxpdftotext via run_shell as preferred over Python