Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

heavy-file-ingestion

Use when a user asks to read, analyze, summarize, or extract from a heavyweight file (PDF, DOCX, PPTX, XLSX, CSV, TSV). Converts the file into markdown or CSV first, generates a lightweight index, and only spends model tokens on the compressed artifact. Trigger on "read this PDF", "look through this spreadsheet", "summarize this deck", or any time raw file ingestion would waste tokens. Do NOT trigger for plain text, markdown, JSON, YAML, or source code files — those are cheap to read directly.

Ejecutar en Manus

Resumen

Comando de instalación

npx skills add https://github.com/dtsong/skill-governance --skill heavy-file-ingestion

Copia y pega este comando en Claude Code para instalar la habilidad

Fuente

dtsong/skill-governance

Estrellas0

Forks0

Actualizado5 de abril de 2026, 20:42

Explorador de archivos

4 archivos

SKILL.md

readonly

name

heavy-file-ingestion

description

Heavy File Ingestion

Problem

Agents waste context and money reading heavyweight files raw. Convert first, reason second.

Trigger Conditions

User asks to read or analyze a PDF, slide deck, spreadsheet, or word-processing file
File is large, structured, or binary enough that raw ingestion is a bad trade
User wants a markdown working copy, CSV extraction, or a quick map before analysis

Core Policy

Convert before reading. Do not dump raw heavyweight files into model context if a deterministic converter can create a cheaper artifact.
Index before reasoning. Read the generated index.md first — it tells you what is in the file, how clean the extraction was, and whether escalation is justified.
Match converter to file type.
- PDFs and documents → markdown artifact
- Presentations → markdown slide outline
- Spreadsheets → CSV per sheet plus markdown manifest
Escalate by cost tier, not instinct.
- Tier 1: deterministic converter + index
- Tier 2: cheap model on extracted artifact only if quality flags say the deterministic pass lost structure
- Tier 3: expensive model only after the file has been compressed into markdown, CSV, or a sampled subset

Process

Step 1: Identify file path, extension, and rough size.

Step 2: Run the converter — do not read the original file directly.

uv run \
  --with pdfplumber \
  --with python-docx \
  --with python-pptx \
  --with openpyxl \
  python scripts/convert_heavy_file.py /absolute/path/to/file.ext

If markitdown is installed and preferred for PDF/DOCX, add --prefer markitdown.
Output lands in <source>.ob1/ by default.

Step 3: Read the generated index.md before any converted artifact.

Step 4: Use the index to decide the cheapest next action:

read_extracted_artifact → read the markdown or CSV and continue
install_dependency_and_retry → install the missing dependency and rerun
cheap_model_or_stronger_converter → retry with a better converter or use a cheaper model on the extracted artifact only

Step 5: Only escalate to a stronger model after the file has already been compressed.

Client Rules

Prefer deterministic scripts over model-based conversion.
Save converted artifacts next to the source file and work from those.
For spreadsheets, use the generated per-sheet CSV files instead of reasoning over workbook internals.
For PDFs, treat scan detection and low-density warnings as routing signals, not reasons to read the original raw.

Gotchas

The converter needs Python 3.10+ and the listed dependencies. If uv is unavailable, fall back to pip install then run directly.
Scanned PDFs produce near-empty markdown — the scanned_pdf_suspected quality flag tells you this happened. Do not re-read the original; escalate to Docling or a vision model on the extracted artifact.
Large XLSX files with many sheets produce one CSV per sheet. Read the workbook.md manifest first to decide which sheets matter.
The --prefer markitdown flag only affects PDF and DOCX. Spreadsheets and presentations always use native extractors.
Conversion artifacts are written next to the source file. If the source is read-only or on a network mount, pass --output-dir to redirect.

References

File	Load When	Contains
`references/open-source-stack.md`	Choosing a fallback converter or explaining tool choices	Rationale for MarkItDown, Docling, pdfplumber, openpyxl selection

Más de este repositorio

mismo repositorio

suite-name

dtsong/skill-governance

TRIGGER_DESCRIPTION. Use when USER_CONTEXT. Routes to specialists for CAPABILITIES.

2026-03-180

skill-name

dtsong/skill-governance

TRIGGER_DESCRIPTION. Use when USER_CONTEXT. Covers CAPABILITIES.

2026-03-180

specialist-name

dtsong/skill-governance

SPECIALIST_TRIGGER. Use when SPECIALIST_CONTEXT.

2026-02-150

Fuente

dtsong

dtsong/skill-governance

Abrir repositorio de GitHub Ver repositorios del creador

Comando de instalación

Descarga

Ejecutar en Manus

Útil paraSOC

Representantes de servicio al clienteOficina y apoyo administrativo43-4051L4

name

heavy-file-ingestion

description

Heavy File Ingestion

Problem

Agents waste context and money reading heavyweight files raw. Convert first, reason second.

Trigger Conditions

User asks to read or analyze a PDF, slide deck, spreadsheet, or word-processing file
File is large, structured, or binary enough that raw ingestion is a bad trade
User wants a markdown working copy, CSV extraction, or a quick map before analysis

Core Policy

Convert before reading. Do not dump raw heavyweight files into model context if a deterministic converter can create a cheaper artifact.
Index before reasoning. Read the generated index.md first — it tells you what is in the file, how clean the extraction was, and whether escalation is justified.
Match converter to file type.
- PDFs and documents → markdown artifact
- Presentations → markdown slide outline
- Spreadsheets → CSV per sheet plus markdown manifest
Escalate by cost tier, not instinct.
- Tier 1: deterministic converter + index
- Tier 2: cheap model on extracted artifact only if quality flags say the deterministic pass lost structure
- Tier 3: expensive model only after the file has been compressed into markdown, CSV, or a sampled subset

Process

Step 1: Identify file path, extension, and rough size.

Step 2: Run the converter — do not read the original file directly.

uv run \
  --with pdfplumber \
  --with python-docx \
  --with python-pptx \
  --with openpyxl \
  python scripts/convert_heavy_file.py /absolute/path/to/file.ext

If markitdown is installed and preferred for PDF/DOCX, add --prefer markitdown.
Output lands in <source>.ob1/ by default.

Step 3: Read the generated index.md before any converted artifact.

Step 4: Use the index to decide the cheapest next action:

read_extracted_artifact → read the markdown or CSV and continue
install_dependency_and_retry → install the missing dependency and rerun
cheap_model_or_stronger_converter → retry with a better converter or use a cheaper model on the extracted artifact only

Step 5: Only escalate to a stronger model after the file has already been compressed.

Client Rules

Prefer deterministic scripts over model-based conversion.
Save converted artifacts next to the source file and work from those.
For spreadsheets, use the generated per-sheet CSV files instead of reasoning over workbook internals.
For PDFs, treat scan detection and low-density warnings as routing signals, not reasons to read the original raw.

Gotchas

The converter needs Python 3.10+ and the listed dependencies. If uv is unavailable, fall back to pip install then run directly.
Scanned PDFs produce near-empty markdown — the scanned_pdf_suspected quality flag tells you this happened. Do not re-read the original; escalate to Docling or a vision model on the extracted artifact.
Large XLSX files with many sheets produce one CSV per sheet. Read the workbook.md manifest first to decide which sheets matter.
The --prefer markitdown flag only affects PDF and DOCX. Spreadsheets and presentations always use native extractors.
Conversion artifacts are written next to the source file. If the source is read-only or on a network mount, pass --output-dir to redirect.

References

File	Load When	Contains
`references/open-source-stack.md`	Choosing a fallback converter or explaining tool choices	Rationale for MarkItDown, Docling, pdfplumber, openpyxl selection