with one click
pdf-text-extraction
Extract text from PDFs using shell tools when read_file fails
Menu
Extract text from PDFs using shell tools when read_file fails
| name | pdf-text-extraction |
| description | Extract text from PDFs using shell tools when read_file fails |
Use this skill when read_file with filetype='pdf':
The built-in PDF handler is unreliable for structured text extraction. Shell-based tools provide more robust alternatives.
# Extract text from PDF to stdout
pdftotext /path/to/file.pdf -
# Extract text to a file
pdftotext /path/to/file.pdf output.txt
# Preserve layout (maintains spacing/structure)
pdftotext -layout /path/to/file.pdf output.txt
Usage in agent:
run_shell command="pdftotext -layout /path/to/document.pdf -"
# Get PDF metadata (pages, author, creation date, etc.)
pdfinfo /path/to/file.pdf
Usage in agent:
run_shell command="pdfinfo /path/to/document.pdf"
import fitz # PyMuPDF
doc = fitz.open("/path/to/file.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
print(text)
Usage in agent:
run_shell command="python3 -c \"import fitz; doc=fitz.open('file.pdf'); print(''.join(p.get_text() for p in doc))\""
import pdfplumber
with pdfplumber.open("/path/to/file.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables() # For tabular data
Usage in agent:
run_shell command="python3 -c \"import pdfplumber; pdf=pdfplumber.open('file.pdf'); print(''.join(p.extract_text() or '' for p in pdf.pages))\""
Try pdftotext first - Fastest and most reliable for plain text
run_shell command="pdftotext -layout /path/to/file.pdf -"
If pdftotext unavailable, check for Python libraries
run_shell command="python3 -c \"import fitz; print('PyMuPDF available')\""
For table/structured data, use pdfplumber
run_shell command="python3 -c \"import pdfplumber; ...\""
Verify extraction succeeded - Check output contains readable text, not binary data
If tools are not available:
# Ubuntu/Debian
apt-get install poppler-utils # pdftotext, pdfinfo
# Install Python libraries
pip install pymupdf pdfplumber
read_file with filetype='pdf' for critical text extraction# Step 1: Try pdftotext
RESULT=$(pdftotext -layout /path/to/form.pdf - 2>/dev/null)
# Step 2: Verify we got text, not error
if [ -z "$RESULT" ]; then
# Fallback to Python
RESULT=$(python3 -c "import fitz; doc=fitz.open('/path/to/form.pdf'); print(''.join(p.get_text() for p in doc))")
fi
# Step 3: Use extracted text
echo "$RESULT"
Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification
Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow
Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies
Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates
End-to-end audio production workflow with stems, effects, archiving, and verification