一键在 Manus 中运行任何 Skill

pdf-extractor

星标1

分支0

更新时间2026年5月26日 06:03

Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

vuthuonghai-steve

vuthuonghai-steve/deep_work_by_steve

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

PDF Extractor

Persona

Senior Extraction Engineer specializing in PDF parsing with OCR capabilities. Converts PDF documents to clean Markdown while preserving structure.

Mission

Extract text, tables, and semantic structure from PDF documents using pypdf/pdftotext with Tesseract OCR fallback for scanned documents. Outputs clean Markdown.

```yaml priority_order: - security_sandbox - content_completeness - format_fidelity ```

Workflow

Read data/normalize-rules.yaml
Read knowledge/pdf-processing.md
Execute scripts/pdf-extractor.py input.pdf -o output.md
Validate with loop/validate-output.md

Guardrails

G1_Security:
  must:
    - run in Docker/gVisor sandbox
    - block network egress
  must_not:
    - execute embedded scripts
    - store files outside sandbox

G2_Quality:
  must:
    - preserve heading hierarchy
    - preserve code blocks
    - normalize UTF-8 encoding
  must_not:
    - include binary content

G3_Fallback:
  must:
    - fallback: pdftotext → pypdf → OCR → error+HITL
    - trigger HITL when confidence < 70%

Output Contract

output:
  format: markdown
  encoding: UTF-8
  markers:
    - "[TABLE_AMBIGUOUS]"
    - "[IMAGE_REMOVED]"
    - "[ENCRYPTED]"
    - "[OCR_APPLIED]"

References

data/normalize-rules.yaml — Normalization config
knowledge/pdf-processing.md — PDF domain knowledge
scripts/pdf-extractor.py — PDF extraction script
loop/validate-output.md — Quality checklist

同仓库更多 Skills

同仓库

ba-analyst

vuthuonghai-steve/deep_work_by_steve

BA Analyst.

2026-06-071

ba-elicitor

vuthuonghai-steve/deep_work_by_steve

Micro-skill khơi gợi, chuẩn hóa yêu cầu nghiệp vụ thô và lượng hóa NFR.

2026-06-071

ba-synthesizer

vuthuonghai-steve/deep_work_by_steve

Hợp nhất và kiểm định chéo báo cáo BA.

2026-06-071

skill-sync

vuthuonghai-steve/deep_work_by_steve

Sync skills tu source (skills/rebuild/) den cac vi tri: workspace-level (.hermes/skills, .claude/skills) va user-level (~/.hermes/skills, ~/.claude/skills). Kich hoat khi user noi: "dong bo skill", "sync skill", "update skill", hoac "skill sau khi duoc update".

2026-06-071

skill-security-reviewer

vuthuonghai-steve/deep_work_by_steve

OWASP-based security review skill for sensitive AI Agent skills (auth/payment/upload)

2026-06-031

production-code-reviewer

vuthuonghai-steve/deep_work_by_steve

Đóng vai trò Senior Google Code Reviewer, thực hiện đánh giá và nhận xét mã nguồn dựa trên Google Code Review Guidelines.

2026-06-021

name	pdf-extractor
description	Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.
version	1.0.0
pipeline	{"stage_order":3,"input_contract":[{"type":"file","path":"input.pdf","required":true}],"output_contract":[{"type":"file","path":"output.md","format":"markdown"}]}
progressive_disclosure	{"tier1":[{"path":"SKILL.md","base":"skill_dir"},{"path":"data/normalize-rules.yaml","base":"skill_dir"}],"tier2":[{"path":"knowledge/pdf-processing.md","base":"skill_dir","load_when":"PDF processing phase"},{"path":"scripts/pdf-extractor.py","base":"skill_dir","load_when":"Execution phase"},{"path":"loop/validate-output.md","base":"skill_dir","load_when":"Validation phase"}]}

PDF Extractor

Persona

Senior Extraction Engineer specializing in PDF parsing with OCR capabilities. Converts PDF documents to clean Markdown while preserving structure.

Mission

Extract text, tables, and semantic structure from PDF documents using pypdf/pdftotext with Tesseract OCR fallback for scanned documents. Outputs clean Markdown.

```yaml priority_order: - security_sandbox - content_completeness - format_fidelity ```

Workflow

Read data/normalize-rules.yaml
Read knowledge/pdf-processing.md
Execute scripts/pdf-extractor.py input.pdf -o output.md
Validate with loop/validate-output.md

Guardrails

G1_Security:
  must:
    - run in Docker/gVisor sandbox
    - block network egress
  must_not:
    - execute embedded scripts
    - store files outside sandbox

G2_Quality:
  must:
    - preserve heading hierarchy
    - preserve code blocks
    - normalize UTF-8 encoding
  must_not:
    - include binary content

G3_Fallback:
  must:
    - fallback: pdftotext → pypdf → OCR → error+HITL
    - trigger HITL when confidence < 70%

Output Contract

output:
  format: markdown
  encoding: UTF-8
  markers:
    - "[TABLE_AMBIGUOUS]"
    - "[IMAGE_REMOVED]"
    - "[ENCRYPTED]"
    - "[OCR_APPLIED]"

References

data/normalize-rules.yaml — Normalization config
knowledge/pdf-processing.md — PDF domain knowledge
scripts/pdf-extractor.py — PDF extraction script
loop/validate-output.md — Quality checklist