| name | extract |
| type | skill |
| category | instruction |
| description | General extraction/ingestion skill that routes to specific workflows based on input type. Extracts structured information from documents, emails, reviews, feedback, and other sources. |
| triggers | ["extract information","extract from document","ingest","extract decisions","extract training data","convert document","DOCX/PDF/XLSX conversion","/convert-to-md"] |
| modifies_files | true |
| needs_task | true |
| mode | execution |
| domain | ["operations"] |
| allowed-tools | Read,Grep,Glob,Edit,Write,Skill,Bash |
| version | 1.0.0 |
| permalink | skills-extract |
Extraction & Ingestion Skill
Taxonomy note: This skill provides domain expertise (HOW) for extracting structured information from documents and sources. See [[aops-core/skills/remember/references/TAXONOMY.md]] for the skill/workflow distinction.
General-purpose extraction skill that intelligently routes to specialized workflows based on input type. Extracts structured information from various sources and stores it appropriately (public framework vs. private data).
Framework Context
Universal axioms apply (enforced by rbg). P#52 (Read-Then-Write Memory) is especially relevant.
Search Before Creating
MANDATORY before creating any new extracted content: search PKB for existing knowledge on the same subject.
mcp__pkb__search(query="[topic/person/document subject]")
- If a match exists → augment it rather than creating a duplicate
- If people are mentioned → retrieve existing relationship context to enrich extraction
- If no match → proceed with creation
This prevents duplicate memories and grounds extraction in accumulated knowledge. See [[remember]] skill's "search first" step as the model.
Purpose
Provide a unified entry point for all extraction tasks:
- Training data extraction from feedback documents
- Archive information extraction (emails, correspondence)
- Decision extraction from task queues
- Knowledge extraction from documents
- Review pair extraction for LLM training
Workflow Routing
When invoked, analyze the input and route to the appropriate workflow:
1. Training Data Extraction
Signals:
- Input contains review feedback + source document
- User mentions "training", "extract patterns", "learning", "dataset"
- Documents contain tracked changes, comments, or annotations
- Goal is to build LLM training data
Route to: workflows/training-data.md
Storage:
- Sensitive data (actual review content, source documents):
$ACA_DATA/processed/review_training/
- Generalized patterns (depersonalized principles): Framework docs or peer-review skill
2. Archive Information Extraction
Signals:
- Input is email archive, correspondence, receipts
- User mentions "archive", "preserve", "remember"
- Goal is to capture significant events/relationships
- Source is historical documents
Route to: Archive extraction logic (selective extraction, use /remember for storage)
Storage: Use Skill(skill="remember") for PKB storage
3. Decision Extraction
Signals:
- User mentions "decisions", "pending", "blocking"
- Goal is to surface approval/choice items
- Source is task queue
Route to: Existing aops-core/skills/decision-extract/SKILL.md
Storage: Daily note with decision formatting
4. Document Knowledge Extraction
Signals:
- Single document needs key information extracted
- User mentions "extract", "parse", "ingest"
- Not training data, not archive, not decisions
- Goal is structured information retrieval
Route to: workflows/document-knowledge.md (to be created)
Storage: Depends on content - PKB or framework docs
5. Document to Markdown Conversion
Signals:
- Input is a document file (DOCX, PDF, XLSX, TXT, PPTX, MSG, DOC, DOTX)
- User mentions "convert", "convert to markdown", "docx to markdown", "pdf to markdown"
- Invoked as
/convert-to-md (alias preserved for backwards compatibility)
- Goal is format conversion, not structured extraction
Route to: workflows/docs-to-md.md
Storage: Converted .md files replace originals in the same directory
Workflow: Training Data Extraction
Input Types
Type A: Review with Inline Comments (DOCX with tracked changes)
- Example: Peer review with inline comments and suggestions
- Procedure:
procedures/review-inline-comments.md
- Output: Training pairs + generalized principles
Type B: Separate Review + Source Documents
- Example:
review.txt + source.pdf + metadata.json
- Workflow: Review-training extraction (match feedback to source evidence)
- Output: Training pairs matching feedback to source evidence
Type C: Revision History
- Example: Git history, Google Docs revision history, track changes
- Workflow:
workflows/revision-history.md (to be created)
- Output: Before/after pairs with change rationales
Extraction Process
See procedures/review-inline-comments.md for detailed procedure.
Quick summary:
- Convert to workable format (preserve markup)
- Extract feedback units (text + comment pairs)
- Categorize feedback (type, scope, action)
- Identify patterns (group similar feedback)
- Generalize principles (abstract to transferable form)
- Separate sensitive/public (raw data vs. patterns)
- Store appropriately (sensitive → $ACA_DATA, patterns → framework)
Storage Rules
CRITICAL: Training data often contains sensitive information (author names, unpublished work, specific critiques).
Sensitive data → $ACA_DATA/processed/review_training/{collection_name}/:
extracted_examples.json (full text/feedback pairs)
training_pairs.jsonl (machine-readable format)
collection_summary.md (with identifying information)
- Source documents (if retained)
Generalized patterns → Framework (public repo):
aops-core/skills/hydrator/workflows/peer-review.md (update with principles)
aops-core/skills/*/references/ (depersonalized examples)
- No names, no specific unpublished content, no identifying details
Quality Standards
High-quality extraction:
- Clear connection between feedback and source
- Sufficient context for learning
- Well-categorized with teaching points
- Patterns are transferable
Generalization quality:
- Principles are specific enough to apply
- Principles are general enough to transfer
- Examples span different contexts
- Limitations are documented
Workflow: Archive Information Extraction
Apply selective extraction logic.
Key principle: Most archival documents have NO long-term value. Be highly selective.
Extract: Concrete outcomes, significant relationships, financial records
Skip: Newsletters, invitations, administrative routine, mass communications
Storage: Use Skill(skill="remember") with proper tags and canonical identifiers.
Workflow: Decision Extraction
Delegate to aops-core/skills/decision-extract/SKILL.md.
Key principle: Extract tasks requiring approval/choice that are blocking other work.
Storage: Daily note with formatted decision list for batch processing.
Sensitive Data Handling
What is Sensitive?
- Author names and identifying information
- Unpublished work content
- Specific critiques of individuals' work
- Email content and correspondence
- Personal information
- Institutional confidential information
Storage Location: $ACA_DATA/processed/
Directory structure:
$ACA_DATA/processed/
├── review_training/
│ ├── {collection_name}/
│ │ ├── extracted_examples.json
│ │ ├── training_pairs.jsonl
│ │ ├── collection_summary.md
│ │ └── source_documents/
│ └── ...
├── email_archive/
│ └── ...
└── ...
Access: This directory is:
- Outside the public academicOps repo
- In personal data directory ($ACA_DATA =
/home/nic/brain)
- Should be backed up separately
- Not committed to git
Depersonalization for Public Framework
When adding examples to public framework docs:
- Remove all names (use "Author", "Reviewer", or generic placeholders)
- Remove specific work titles (use generic descriptions)
- Remove institutional affiliations
- Generalize to principle, not specific instance
- Use constructed examples if real ones can't be depersonalized
Integration with Other Skills
When to use /extract vs. specialized skills
Use /extract:
- Unclear what type of extraction is needed
- Multiple types of documents to process
- Want intelligent routing to appropriate workflow
Use specialized skill directly:
/remember - When you know you want to add to knowledge base
/decision-extract - When specifically extracting decisions
/review-training - When processing matched review/source pairs (legacy)
Note on /convert-to-md: This trigger is now an alias for /extract. Invoking /convert-to-md routes to the workflows/docs-to-md.md workflow.
Skill Composition
/extract → analyze input → route to:
- /remember (for archival preservation)
- /decision-extract (for pending decisions)
- training-data workflow (for LLM training data)
- document-knowledge workflow (for general extraction)
Error Handling
| Scenario | Behavior |
|---|
| Unclear input type | Ask user to clarify extraction goal |
| Cannot convert document format | Try alternative conversion, document failure |
| Ambiguous feedback | Flag with "quality": "ambiguous", include with caveats |
| No clear extraction value | Ask user if they want to skip or force extraction |
| Storage location unclear | Default to $ACA_DATA/processed/, confirm with user |
Examples
Example 1: Peer Review Extraction
Input: DOCX file with inline comments from peer review
User: /extract /path/to/review.docx --type peer-review
Agent:
- Detects inline comments → routes to review-inline-comments workflow
- Converts DOCX with
pandoc --track-changes=all
- Extracts 18 feedback units
- Identifies 10 generalisable principles
- Stores sensitive data in
$ACA_DATA/processed/review_training/aoir2026/
- Updates
aops-core/skills/hydrator/workflows/peer-review.md with depersonalized principles
Example 2: Email Archive
Input: Directory of email MSG files
User: /extract emails/2025-Q1/ --type archive
Agent:
- Detects email archive → routes to extractor skill logic
- Processes each email, applies judgment criteria
- Extracts significant events/relationships
- Uses
/remember to store in PKB
- Skips 90% of emails as noise
Example 3: Auto-Detection
Input: Mixed documents without type specified
User: /extract documents/
Agent:
- Analyzes each document
- Detects: 2 peer reviews (tracked changes), 5 emails, 1 grant application
- Routes peer reviews → training-data workflow
- Routes emails → archive extraction
- Asks user about grant application (unclear extraction goal)
Validation Checklist
Before completing extraction:
Completeness:
Quality:
Sensitivity:
Documentation:
Future Enhancements
- Semi-automated pattern detection
- Batch processing for multiple documents
- Integration with continuous ingestion pipeline
- Quality metrics and validation tools
- Cross-collection pattern analysis