Run any Skill in Manus with one click

Get Started

pdf-text-extraction-9424c5

Extract text from PDF files using pdftotext when read_file returns binary data

Run Skill in Manus

Stars6,485

Forks808

UpdatedMarch 24, 2026 at 08:03

Source

HKUDS

HKUDS/OpenSpace

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

File Explorer

2 files

SKILL.md

readonly

name	pdf-text-extraction-9424c5
description	Extract text from PDF files using pdftotext when read_file returns binary data

PDF Text Extraction via pdftotext

Problem

When using read_file on PDF documents, the function may return binary image data or garbled content instead of readable text. This occurs because PDFs can contain scanned images or complex binary structures that read_file cannot properly parse as text.

Solution

Use the pdftotext command-line utility via run_shell to extract clean text content from PDF files.

Steps

1. Verify PDF file exists

import os

pdf_path = "path/to/document.pdf"
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"PDF not found: {pdf_path}")

2. Extract text using pdftotext

from tools import run_shell

# Extract text to stdout
result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
pdf_text = result.stdout

# Alternative: extract to a temporary file
temp_txt = "/tmp/extracted.txt"
run_shell(command=f"pdftotext '{pdf_path}' '{temp_txt}'", timeout=60)
with open(temp_txt, 'r') as f:
    pdf_text = f.read()

3. Handle parameter naming carefully

When calling read_file, be aware of the parameter name:

Use filetype="pdf" (not file_type)
Some tool implementations may use different parameter names

# Correct parameter usage
content = read_file(file_path="doc.pdf", filetype="pdf")

# If this returns binary/garbled data, fall back to pdftotext

Common pdftotext Options

Option	Description
`-`	Output to stdout
`-layout`	Maintain original layout
`-f <n>`	Start from page n
`-l <n>`	End at page n
`-q`	Quiet mode

Example with options:

result = run_shell(command=f"pdftotext -layout -q '{pdf_path}' -", timeout=60)

Error Handling

from tools import run_shell

def extract_pdf_text(pdf_path):
    """Extract text from PDF using pdftotext with error handling."""
    import os
    
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    
    result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
    
    if result.returncode != 0:
        raise RuntimeError(f"pdftotext failed: {result.stderr}")
    
    return result.stdout.strip()

When to Use This Pattern

read_file returns binary data, garbled text, or image content for a PDF
You need searchable/processable text from PDF documents
The PDF contains text (not just scanned images - for those, consider OCR tools)

Prerequisites

pdftotext must be installed (part of poppler-utils on Debian/Ubuntu, poppler on macOS via Homebrew)
Verify availability: run_shell(command="which pdftotext")

More from this repository

same repository

delegate-task

HKUDS/OpenSpace

Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.

2026-04-076.5k

adaptive-stem-alignment

HKUDS/OpenSpace

Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification

2026-03-246.5k

diagnostic-stem-delivery

HKUDS/OpenSpace

Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow

2026-03-246.5k

aligned-stem-workflow

HKUDS/OpenSpace

Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies

2026-03-246.5k

incremental-audio-workflow

HKUDS/OpenSpace

Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates

2026-03-246.5k

audio-track-production

HKUDS/OpenSpace

End-to-end audio production workflow with stems, effects, archiving, and verification

2026-03-246.5k

name	pdf-text-extraction-9424c5
description	Extract text from PDF files using pdftotext when read_file returns binary data

PDF Text Extraction via pdftotext

Problem

Solution

Use the pdftotext command-line utility via run_shell to extract clean text content from PDF files.

Steps

1. Verify PDF file exists

import os

pdf_path = "path/to/document.pdf"
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"PDF not found: {pdf_path}")

2. Extract text using pdftotext

from tools import run_shell

# Extract text to stdout
result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
pdf_text = result.stdout

# Alternative: extract to a temporary file
temp_txt = "/tmp/extracted.txt"
run_shell(command=f"pdftotext '{pdf_path}' '{temp_txt}'", timeout=60)
with open(temp_txt, 'r') as f:
    pdf_text = f.read()

3. Handle parameter naming carefully

When calling read_file, be aware of the parameter name:

Use filetype="pdf" (not file_type)
Some tool implementations may use different parameter names

# Correct parameter usage
content = read_file(file_path="doc.pdf", filetype="pdf")

# If this returns binary/garbled data, fall back to pdftotext

Common pdftotext Options

Option	Description
`-`	Output to stdout
`-layout`	Maintain original layout
`-f <n>`	Start from page n
`-l <n>`	End at page n
`-q`	Quiet mode

Example with options:

result = run_shell(command=f"pdftotext -layout -q '{pdf_path}' -", timeout=60)

Error Handling

from tools import run_shell

def extract_pdf_text(pdf_path):
    """Extract text from PDF using pdftotext with error handling."""
    import os
    
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    
    result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
    
    if result.returncode != 0:
        raise RuntimeError(f"pdftotext failed: {result.stderr}")
    
    return result.stdout.strip()

When to Use This Pattern

read_file returns binary data, garbled text, or image content for a PDF
You need searchable/processable text from PDF documents
The PDF contains text (not just scanned images - for those, consider OCR tools)

Prerequisites

pdftotext must be installed (part of poppler-utils on Debian/Ubuntu, poppler on macOS via Homebrew)
Verify availability: run_shell(command="which pdftotext")