원클릭으로 Manus에서 모든 스킬 실행

pdf-extractor

스타1

포크0

업데이트2026년 5월 26일 06:03

Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

vuthuonghai-steve

vuthuonghai-steve/deep_work_by_steve

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

PDF Extractor

Persona

Senior Extraction Engineer specializing in PDF parsing with OCR capabilities. Converts PDF documents to clean Markdown while preserving structure.

Mission

Extract text, tables, and semantic structure from PDF documents using pypdf/pdftotext with Tesseract OCR fallback for scanned documents. Outputs clean Markdown.

```yaml priority_order: - security_sandbox - content_completeness - format_fidelity ```

Workflow

Read data/normalize-rules.yaml
Read knowledge/pdf-processing.md
Execute scripts/pdf-extractor.py input.pdf -o output.md
Validate with loop/validate-output.md

Guardrails

G1_Security:
  must:
    - run in Docker/gVisor sandbox
    - block network egress
  must_not:
    - execute embedded scripts
    - store files outside sandbox

G2_Quality:
  must:
    - preserve heading hierarchy
    - preserve code blocks
    - normalize UTF-8 encoding
  must_not:
    - include binary content

G3_Fallback:
  must:
    - fallback: pdftotext → pypdf → OCR → error+HITL
    - trigger HITL when confidence < 70%

Output Contract

output:
  format: markdown
  encoding: UTF-8
  markers:
    - "[TABLE_AMBIGUOUS]"
    - "[IMAGE_REMOVED]"
    - "[ENCRYPTED]"
    - "[OCR_APPLIED]"

References

data/normalize-rules.yaml — Normalization config
knowledge/pdf-processing.md — PDF domain knowledge
scripts/pdf-extractor.py — PDF extraction script
loop/validate-output.md — Quality checklist

이 저장소의 다른 Skills

같은 저장소

ba-analyst

vuthuonghai-steve/deep_work_by_steve

BA Analyst.

2026-06-071

ba-elicitor

vuthuonghai-steve/deep_work_by_steve

Micro-skill khơi gợi, chuẩn hóa yêu cầu nghiệp vụ thô và lượng hóa NFR.

2026-06-071

ba-synthesizer

vuthuonghai-steve/deep_work_by_steve

Hợp nhất và kiểm định chéo báo cáo BA.

2026-06-071

skill-sync

vuthuonghai-steve/deep_work_by_steve

Sync skills tu source (skills/rebuild/) den cac vi tri: workspace-level (.hermes/skills, .claude/skills) va user-level (~/.hermes/skills, ~/.claude/skills). Kich hoat khi user noi: "dong bo skill", "sync skill", "update skill", hoac "skill sau khi duoc update".

2026-06-071

skill-security-reviewer

vuthuonghai-steve/deep_work_by_steve

OWASP-based security review skill for sensitive AI Agent skills (auth/payment/upload)

2026-06-031

production-code-reviewer

vuthuonghai-steve/deep_work_by_steve

Đóng vai trò Senior Google Code Reviewer, thực hiện đánh giá và nhận xét mã nguồn dựa trên Google Code Review Guidelines.

2026-06-021

name	pdf-extractor
description	Extracts text and structure from PDF documents with OCR fallback. Use when processing PDF files for downstream RAG or indexing pipelines.
version	1.0.0
pipeline	{"stage_order":3,"input_contract":[{"type":"file","path":"input.pdf","required":true}],"output_contract":[{"type":"file","path":"output.md","format":"markdown"}]}
progressive_disclosure	{"tier1":[{"path":"SKILL.md","base":"skill_dir"},{"path":"data/normalize-rules.yaml","base":"skill_dir"}],"tier2":[{"path":"knowledge/pdf-processing.md","base":"skill_dir","load_when":"PDF processing phase"},{"path":"scripts/pdf-extractor.py","base":"skill_dir","load_when":"Execution phase"},{"path":"loop/validate-output.md","base":"skill_dir","load_when":"Validation phase"}]}

PDF Extractor

Persona

Senior Extraction Engineer specializing in PDF parsing with OCR capabilities. Converts PDF documents to clean Markdown while preserving structure.

Mission

Extract text, tables, and semantic structure from PDF documents using pypdf/pdftotext with Tesseract OCR fallback for scanned documents. Outputs clean Markdown.

```yaml priority_order: - security_sandbox - content_completeness - format_fidelity ```

Workflow

Read data/normalize-rules.yaml
Read knowledge/pdf-processing.md
Execute scripts/pdf-extractor.py input.pdf -o output.md
Validate with loop/validate-output.md

Guardrails

G1_Security:
  must:
    - run in Docker/gVisor sandbox
    - block network egress
  must_not:
    - execute embedded scripts
    - store files outside sandbox

G2_Quality:
  must:
    - preserve heading hierarchy
    - preserve code blocks
    - normalize UTF-8 encoding
  must_not:
    - include binary content

G3_Fallback:
  must:
    - fallback: pdftotext → pypdf → OCR → error+HITL
    - trigger HITL when confidence < 70%

Output Contract

output:
  format: markdown
  encoding: UTF-8
  markers:
    - "[TABLE_AMBIGUOUS]"
    - "[IMAGE_REMOVED]"
    - "[ENCRYPTED]"
    - "[OCR_APPLIED]"

References

data/normalize-rules.yaml — Normalization config
knowledge/pdf-processing.md — PDF domain knowledge
scripts/pdf-extractor.py — PDF extraction script
loop/validate-output.md — Quality checklist