with one click
pdf-text-extraction-fallback
Extract text from PDFs using pdftotext when read_file returns binary data
Menu
Extract text from PDFs using pdftotext when read_file returns binary data
Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.
Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification
Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow
Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies
Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates
End-to-end audio production workflow with stems, effects, archiving, and verification
| name | pdf-text-extraction-fallback |
| description | Extract text from PDFs using pdftotext when read_file returns binary data |
Use this skill when read_file with filetype="pdf" returns binary/image data instead of readable text content. This is a common issue with PDF files that contain embedded images or complex formatting.
Before attempting extraction, ensure you're using the correct parameter name:
filetype (not file_type) for the read_file function# Correct
read_file(filetype="pdf", file_path="document.pdf")
# Incorrect - may fail silently
read_file(file_type="pdf", file_path="document.pdf")
After calling read_file, check if the result contains:
b'...' byte strings with non-text content)If yes, proceed with the pdftotext workaround.
Use run_shell to call pdftotext, which extracts text directly from PDF files:
# Extract text to stdout
result = run_shell(command="pdftotext /path/to/document.pdf -")
text_content = result.stdout
The - flag tells pdftotext to output to stdout instead of creating a file.
result = run_shell(command="pdftotext /path/to/document.pdf -")
if result.stderr:
# Check for errors like "pdftotext not found"
# May need to install poppler-utils
pass
text_content = result.stdout
# text_content now contains the extracted text
# Step 1: Try normal read with correct parameters
content = read_file(filetype="pdf", file_path="reference.pdf")
# Step 2: Check if content is readable
if not content or looks_like_binary(content):
# Step 3: Fall back to pdftotext
result = run_shell(command="pdftotext reference.pdf -")
text_content = result.stdout
# Step 4: Verify extraction succeeded
if result.stderr:
# Handle error (e.g., install pdftotext)
pass
pdftotext is part of the poppler-utils package:
apt-get install poppler-utilsbrew install popplerIf stdout approach has issues, output to a temporary file:
run_shell(command="pdftotext /path/to/document.pdf /tmp/output.txt")
text_content = read_file(filetype="txt", file_path="/tmp/output.txt")
filetype parameter spelling before troubleshooting