Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.

2026-04-076.5k

adaptive-stem-alignment

HKUDS/OpenSpace

Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification

2026-03-246.5k

diagnostic-stem-delivery

HKUDS/OpenSpace

Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow

2026-03-246.5k

aligned-stem-workflow

HKUDS/OpenSpace

Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies

2026-03-246.5k

incremental-audio-workflow

HKUDS/OpenSpace

Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates

2026-03-246.5k

audio-track-production

HKUDS/OpenSpace

End-to-end audio production workflow with stems, effects, archiving, and verification

2026-03-246.5k

name	pdf-text-extraction
description	Extract text from PDFs using shell tools when read_file fails

PDF Text Extraction (Fallback Method)

When to Use This Skill

Use this skill when read_file with filetype='pdf':

Returns binary image data instead of text
Produces errors or incomplete content
Fails to extract structured data reliably

The built-in PDF handler is unreliable for structured text extraction. Shell-based tools provide more robust alternatives.

Available Methods

Method 1: pdftotext (Recommended)

# Extract text from PDF to stdout
pdftotext /path/to/file.pdf -

# Extract text to a file
pdftotext /path/to/file.pdf output.txt

# Preserve layout (maintains spacing/structure)
pdftotext -layout /path/to/file.pdf output.txt

Usage in agent:

run_shell command="pdftotext -layout /path/to/document.pdf -"

Method 2: pdfinfo (Metadata)

# Get PDF metadata (pages, author, creation date, etc.)
pdfinfo /path/to/file.pdf

Usage in agent:

run_shell command="pdfinfo /path/to/document.pdf"

Method 3: Python with PyMuPDF (fitz)

import fitz  # PyMuPDF

doc = fitz.open("/path/to/file.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()
print(text)

Usage in agent:

run_shell command="python3 -c \"import fitz; doc=fitz.open('file.pdf'); print(''.join(p.get_text() for p in doc))\""

Method 4: Python with pdfplumber (Tables)

import pdfplumber

with pdfplumber.open("/path/to/file.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()  # For tabular data

Usage in agent:

run_shell command="python3 -c \"import pdfplumber; pdf=pdfplumber.open('file.pdf'); print(''.join(p.extract_text() or '' for p in pdf.pages))\""

Workflow

Try pdftotext first - Fastest and most reliable for plain text
```
run_shell command="pdftotext -layout /path/to/file.pdf -"
```

If pdftotext unavailable, check for Python libraries

run_shell command="python3 -c \"import fitz; print('PyMuPDF available')\""

For table/structured data, use pdfplumber

run_shell command="python3 -c \"import pdfplumber; ...\""

Verify extraction succeeded - Check output contains readable text, not binary data

Installation Notes

If tools are not available:

# Ubuntu/Debian
apt-get install poppler-utils  # pdftotext, pdfinfo

# Install Python libraries
pip install pymupdf pdfplumber

Anti-Patterns to Avoid

Do NOT rely solely on read_file with filetype='pdf' for critical text extraction
Do NOT assume PDF text is in any particular order - verify extracted content
Do NOT use image-based extraction unless the PDF is scanned (use OCR instead)

Example: Complete Extraction Pattern

# Step 1: Try pdftotext
RESULT=$(pdftotext -layout /path/to/form.pdf - 2>/dev/null)

# Step 2: Verify we got text, not error
if [ -z "$RESULT" ]; then
    # Fallback to Python
    RESULT=$(python3 -c "import fitz; doc=fitz.open('/path/to/form.pdf'); print(''.join(p.get_text() for p in doc))")
fi

# Step 3: Use extracted text
echo "$RESULT"