一键导入
pdf-extractor
Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
BA Analyst.
Micro-skill khơi gợi, chuẩn hóa yêu cầu nghiệp vụ thô và lượng hóa NFR.
Hợp nhất và kiểm định chéo báo cáo BA.
Sync skills tu source (skills/rebuild/) den cac vi tri: workspace-level (.hermes/skills, .claude/skills) va user-level (~/.hermes/skills, ~/.claude/skills). Kich hoat khi user noi: "dong bo skill", "sync skill", "update skill", hoac "skill sau khi duoc update".
OWASP-based security review skill for sensitive AI Agent skills (auth/payment/upload)
Đóng vai trò Senior Google Code Reviewer, thực hiện đánh giá và nhận xét mã nguồn dựa trên Google Code Review Guidelines.
| name | pdf-extractor |
| description | Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines. |
| version | 1.0.0 |
| pipeline | {"stage_order":3,"input_contract":[{"type":"file","path":"input.pdf","required":true}],"output_contract":[{"type":"file","path":"output.md","format":"markdown"}]} |
| progressive_disclosure | {"tier1":[{"path":"SKILL.md","base":"skill_dir"},{"path":"data/normalize-rules.yaml","base":"skill_dir"}],"tier2":[{"path":"knowledge/pdf-processing.md","base":"skill_dir","load_when":"PDF processing phase"},{"path":"scripts/pdf-extractor.py","base":"skill_dir","load_when":"Execution phase"},{"path":"loop/validate-output.md","base":"skill_dir","load_when":"Validation phase"}]} |
Senior Extraction Engineer specializing in PDF parsing with OCR capabilities. Converts PDF documents to clean Markdown while preserving structure.
Extract text, tables, and semantic structure from PDF documents using pypdf/pdftotext with Tesseract OCR fallback for scanned documents. Outputs clean Markdown.
```yaml priority_order: - security_sandbox - content_completeness - format_fidelity ```data/normalize-rules.yamlknowledge/pdf-processing.mdscripts/pdf-extractor.py input.pdf -o output.mdloop/validate-output.mdG1_Security:
must:
- run in Docker/gVisor sandbox
- block network egress
must_not:
- execute embedded scripts
- store files outside sandbox
G2_Quality:
must:
- preserve heading hierarchy
- preserve code blocks
- normalize UTF-8 encoding
must_not:
- include binary content
G3_Fallback:
must:
- fallback: pdftotext → pypdf → OCR → error+HITL
- trigger HITL when confidence < 70%
output:
format: markdown
encoding: UTF-8
markers:
- "[TABLE_AMBIGUOUS]"
- "[IMAGE_REMOVED]"
- "[ENCRYPTED]"
- "[OCR_APPLIED]"