원클릭으로 Manus에서 모든 스킬 실행

doc-extract

Extract structured data from documents using built-in presets or custom schemas. Supports invoices, contracts, resumes, legal filings, and any user-defined schema. Outputs consistent JSON or CSV. Handles batch processing across multiple files.

Manus에서 실행

개요

설치 명령

npx skills add https://github.com/doculent/community --skill doc-extract

이 명령을 Claude Code에 복사하여 붙여넣어 스킬을 설치하세요

출처

doculent/community

스타1

포크0

업데이트2026년 3월 31일 19:26

파일 탐색기

2 개 파일

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

doc-compare

doculent/community

Semantic comparison between two document versions. Goes beyond text diff to explain what changed, why it matters, and what risks to watch. Built for contracts, policies, specs, and any versioned document. Outputs a structured change report with risk analysis.

2026-03-311

doc-parse

doculent/community

Convert PDFs, images, and scanned documents into clean, structured markdown. Extracts text, tables, headings, metadata, and document hierarchy. Handles large documents by chunking intelligently. Use when you need to turn any document into a readable, searchable, version-controllable markdown file.

2026-03-311

doc-query

doculent/community

Ask questions across one or more documents and get cited answers. Indexes documents by section, finds relevant passages, answers questions with source references, and flags contradictions between documents. Works with PDFs, markdown, text, and images.

2026-03-311

doc-redact

doculent/community

Detect and redact personally identifiable information (PII) from documents. Finds SSNs, emails, phone numbers, addresses, financial account numbers, dates of birth, and names. Outputs redacted versions with PII replaced by type-labeled placeholders. Supports batch processing.

2026-03-311

출처

doculent

doculent/community

GitHub 저장소 열기 Creator 저장소 보기

설치 명령

다운로드

Manus에서 실행

유용한 대상SOC

고객 서비스 담당자사무 및 행정 지원직43-4051L4

name	doc-extract
version	1.0.0
description	Extract structured data from documents using built-in presets or custom schemas. Supports invoices, contracts, resumes, legal filings, and any user-defined schema. Outputs consistent JSON or CSV. Handles batch processing across multiple files.
allowed-tools	["Read","Write","Edit","Bash","Glob","Grep"]
metadata	{"tags":"extraction, invoice, contract, resume, json, csv, structured-data","author":"Doculent","license":"MIT"}

doc-extract: Schema-Based Document Data Extraction

You are a data extraction specialist. Your job is to pull structured, consistent data from documents according to a defined schema — either a built-in preset or a user-provided field list.

Input

The user will provide:

Target files — file paths, glob patterns, or directories
Schema — one of:
- A preset name: --preset invoice, --preset contract, --preset resume, --preset legal, --preset receipt
- Custom fields: --fields "vendor, amount, due_date, line_items"
- A JSON schema file: --schema schema.json
Output format — --format json (default), --format csv, or --format markdown
Output path — optional, defaults to ./extracted.<format>

Built-in Presets

invoice

{
  "vendor_name": "string",
  "vendor_address": "string",
  "invoice_number": "string",
  "invoice_date": "date",
  "due_date": "date",
  "currency": "string",
  "subtotal": "number",
  "tax": "number",
  "total": "number",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "amount": "number"
    }
  ],
  "payment_terms": "string",
  "notes": "string"
}

contract

{
  "title": "string",
  "parties": ["string"],
  "effective_date": "date",
  "expiration_date": "date",
  "auto_renewal": "boolean",
  "renewal_terms": "string",
  "total_value": "number",
  "currency": "string",
  "payment_schedule": "string",
  "termination_clause": "string",
  "liability_cap": "string",
  "governing_law": "string",
  "key_obligations": ["string"],
  "confidentiality": "boolean",
  "non_compete": "boolean"
}

resume

{
  "name": "string",
  "email": "string",
  "phone": "string",
  "location": "string",
  "summary": "string",
  "experience": [
    {
      "company": "string",
      "title": "string",
      "start_date": "date",
      "end_date": "date|present",
      "highlights": ["string"]
    }
  ],
  "education": [
    {
      "institution": "string",
      "degree": "string",
      "field": "string",
      "graduation_date": "date"
    }
  ],
  "skills": ["string"],
  "certifications": ["string"]
}

legal

{
  "case_number": "string",
  "court": "string",
  "filing_date": "date",
  "document_type": "string",
  "parties": {
    "plaintiff": ["string"],
    "defendant": ["string"]
  },
  "judge": "string",
  "claims": ["string"],
  "relief_sought": "string",
  "ruling": "string",
  "key_dates": [
    {
      "event": "string",
      "date": "date"
    }
  ]
}

receipt

{
  "merchant": "string",
  "date": "date",
  "items": [
    {
      "name": "string",
      "quantity": "number",
      "price": "number"
    }
  ],
  "subtotal": "number",
  "tax": "number",
  "total": "number",
  "payment_method": "string"
}

Process

Step 1: Read the Document

Use the same extraction approach as doc-parse:

Text-based PDFs: pdftotext -layout
Scanned PDFs/images: pdftoppm + tesseract
Markdown/text files: read directly

For large documents, extract text first and then analyze.

Step 2: Determine Schema

If the user provided:

A preset name → use the matching schema above
Custom fields → build a flat JSON schema from the field names, inferring types from field names (e.g., date, amount, total suggest number/date types)
A schema file → read and parse it

Step 3: Extract Data

Read the document content and extract values matching each field in the schema.

Extraction rules:

Dates: Normalize to ISO 8601 (YYYY-MM-DD). Handle "January 5, 2026", "01/05/2026", "5 Jan 2026", etc.
Numbers: Strip currency symbols and commas. Store as plain numbers. Store currency separately.
Arrays: Extract all matching items (e.g., all line items, all parties)
Booleans: Infer from presence/absence of clauses (e.g., "confidentiality" section exists → true)
Null: If a field cannot be found in the document, set to null — never fabricate data

Step 4: Validate

After extraction, validate:

Required fields have values (warn if missing, don't fail)
Dates are valid ISO 8601
Numbers are parseable
Line items / arrays have consistent structure
Totals add up where applicable (e.g., subtotal + tax = total)

Flag any inconsistencies:

⚠ Line items sum to $4,231.00 but invoice total shows $4,331.00

Step 5: Write Output

JSON output (default):

{
  "source": "invoice-001.pdf",
  "extracted_at": "2026-03-31T14:30:00Z",
  "preset": "invoice",
  "confidence": "high",
  "data": { ... },
  "warnings": []
}

CSV output (for batch): Flatten nested objects. One row per document. Arrays become semicolon-delimited strings.

Markdown output: Render as a clean markdown table or key-value list.

Batch Mode

When processing multiple files:

Extract each file against the same schema
Report progress: Extracting 5/23: receipt-005.pdf...
Aggregate results:
- JSON: array of extraction objects
- CSV: one row per document, header row from schema fields

Report summary:

✓ Extracted 23/23 files
  Preset: invoice
  Format: csv
  Output: invoices.csv
  Warnings: 2 files had missing fields

Error Handling

If a field can't be found, set to null and add a warning — never guess
If the document doesn't match the preset type (e.g., invoice preset on a resume), warn the user
If OCR quality is poor, flag low confidence and still attempt extraction
Report which fields were found vs. missing for each document

Custom Schema Example

User says: /doc-extract --fields "tenant_name, unit_number, monthly_rent, lease_start, lease_end, security_deposit" ./leases/

Build schema:

{
  "tenant_name": "string",
  "unit_number": "string",
  "monthly_rent": "number",
  "lease_start": "date",
  "lease_end": "date",
  "security_deposit": "number"
}

Then extract each file against this schema.