// Reverse-engineer document templates to extract exact design specifications and generate reusable AI prompts for pixel-perfect document recreation
| name | template-extractor |
| description | Reverse-engineer document templates to extract exact design specifications and generate reusable AI prompts for pixel-perfect document recreation |
| category | tooling |
| version | 1.0.0 |
| triggers | ["extract template","reverse engineer format","document style guide","replicate formatting","template from document","document specification","format analysis"] |
| mcp_servers | {"required":[],"optional":["memory-mcp"],"auto_enable":false} |
Template Extractor is a systematic reverse-engineering tool that extracts precise design specifications from existing documents (DOCX, PPTX, XLSX, PDF) to enable pixel-perfect recreation. Unlike visual inspection which leads to "close enough" approximations, this skill unpacks document file structures and parses their underlying XML to extract exact font sizes, color hex codes, spacing values, and layout configurations.
The skill generates two critical outputs: (1) a code-level technical specification with exact measurements and values, and (2) an AI-ready prompt that enables any language model to recreate documents in that exact style. This dual-output approach ensures both machine precision and human comprehension.
By validating extracted templates through test recreation and visual comparison, Template Extractor guarantees that generated specifications are accurate and actionable, eliminating the guesswork and iteration cycles typical of manual document formatting.
Use When:
Do Not Use:
Human visual inspection of documents leads to approximations: "that looks like 14pt" or "probably Arial". Template Extractor treats documents as ZIP archives containing structured XML, unpacking them to access authoritative sources like styles.xml, theme1.xml, and document.xml. This approach extracts definitive values rather than best guesses.
Why This Matters: A "close enough" color (#333333 vs #1F1F1F) creates subtle inconsistency that compounds across documents. A 1pt font size difference (11pt vs 12pt) changes readability and layout flow. Manual inspection cannot reliably detect these differences, but XML parsing provides ground truth.
In Practice:
word/styles.xml for exact heading and body text specificationsword/theme/theme1.xml for color scheme definitions (hex values, not visual approximations)word/document.xml for page layout settings (margins, orientation, dimensions)word/media/ directoriesTemplate Extractor generates two complementary artifacts: a technical specification for developers/designers who need exact measurements, and an AI replication prompt for language models that will generate documents in this style. This separation serves distinct audiences while ensuring both receive accurate, actionable information.
Why This Matters: Developers need machine-readable values ("font: Calibri 11pt, line-height: 1.15, margin-top: 0pt, margin-bottom: 8pt") while AI systems need natural language instructions with embedded precision ("Use Calibri 11pt for body text with 1.15 line spacing. Add 8pt spacing after each paragraph."). A single output cannot optimize for both use cases.
In Practice:
Output 1 (TEMPLATE_SPEC.md): Structured technical reference
Output 2 (AI_PROMPT.md): Natural language generation instructions
Extraction accuracy is validated by using the generated specification to recreate a test document, then comparing it visually and structurally to the original. This closes the loop and ensures specifications are not just theoretically correct but practically actionable.
Why This Matters: Specifications can be technically accurate but incomplete, omitting critical details that only become apparent during recreation. A color scheme might be extracted perfectly, but if table border rules are missing, the recreated document will differ visibly. Verification through recreation catches these gaps before specifications are used in production.
In Practice:
Objective: Identify document type and unpack file structure to access underlying XML and assets.
Actions:
Detect Document Type
.docx (Word), .pptx (PowerPoint), .xlsx (Excel), .pdf (PDF)Unpack Document Structure
unpacked/ directory
unzip document.docx -d unpacked/
word/ (DOCX), ppt/ (PPTX), or xl/ (XLSX) - main content[type]/theme/ - theme colors and fonts[type]/media/ - embedded images and logos_rels/ - relationships and referencesIdentify Key Files
word/document.xml, word/styles.xml, word/theme/theme1.xmlppt/presentation.xml, ppt/slideLayouts/, ppt/theme/theme1.xmlxl/workbook.xml, xl/styles.xml, xl/theme/theme1.xmlCatalog Media Assets
word/media/, ppt/media/, or xl/media/image1.png, logo.svg) and formatsOutput: Unpacked directory structure with cataloged XML files and media assets
Objective: Parse XML files and extract all design elements using the comprehensive extraction checklist.
Actions:
Extract Theme Data (from theme1.xml)
<a:clrScheme> for primary, secondary, accent colors
<a:srgbClr val="0078D4"> to #0078D4dk1 (dark1), lt1 (light1), accent1-6<a:fontScheme> for major (headings) and minor (body) fonts
<a:majorFont><a:latin typeface="Calibri Light"><a:minorFont><a:latin typeface="Calibri">Extract Style Definitions (from styles.xml)
<w:rFonts w:ascii="Calibri Light"><w:sz w:val="32"> (32 half-points = 16pt)<w:b/> (bold) or absence (regular)<w:color w:val="2E74B5"> = #2E74B5<w:spacing w:before="0" w:after="200"> (200 twips = 10pt)Extract Page Settings (from document.xml)
<w:pgSz w:w="12240" w:h="15840"> (12240 twips = 8.5", 15840 twips = 11")<w:pgSz w:orient="portrait"> or landscape<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440"> (1440 twips = 1")w:header="720" (720 twips = 0.5")Extract Structural Elements
Apply Extraction Checklist (see references/extraction-checklist.md)
Convert Units
Output: JSON data structure with all extracted specifications (extraction_results.json)
Objective: Copy all embedded media files to an organized assets directory for reference and reuse.
Actions:
Locate Media Files
word/media/, ppt/media/, xl/media/ directories.png, .jpg, .svg, .emf, .wmfCopy Assets
assets/ directory in output locationCatalog Asset Usage
Generate Asset Reference
ASSETS.md with table of all media filesOutput: assets/ directory with all media files and ASSETS.md reference
Objective: Generate both technical specification and AI replication prompt using the extracted data.
Actions:
Generate TEMPLATE_SPEC.md (Code-Level Specification)
references/output-template.md (Output 1 section)Generate AI_PROMPT.md (AI Replication Prompt)
references/output-template.md (Output 2 section)Package Assets
assets/ directory is referenced in both outputsCreate README
Output:
TEMPLATE_SPEC.md (technical specification)AI_PROMPT.md (AI replication prompt)ASSETS.md (media reference)README.md (usage guide)Objective: Validate extraction accuracy by recreating a test document and comparing to the original.
Actions:
Generate Test Document
Visual Comparison
Structural Validation
Discrepancy Analysis
Iteration
Output: Verified specifications with documented accuracy and known limitations
Purpose: Extract specifications from Microsoft Word documents (.docx)
Key Files:
word/document.xml: Document content and section propertiesword/styles.xml: Style definitions (headings, body, tables, lists)word/theme/theme1.xml: Theme colors and fontsword/media/: Embedded images and logosExtraction Commands:
# Unpack document
unzip document.docx -d unpacked/
# Extract theme colors
grep -A 3 "<a:clrScheme>" unpacked/word/theme/theme1.xml
# Extract heading styles
grep -A 10 "w:styleId=\"Heading1\"" unpacked/word/styles.xml
# List media assets
ls -lh unpacked/word/media/
Key Conversions:
<w:sz w:val="24"> = 24 half-points = 12pt<w:spacing w:after="200"> = 200 twips = 10pt<w:pgMar w:top="1440"> = 1440 twips = 1 inch<w:color w:val="1F4E78"> = #1F4E78Special Considerations:
Purpose: Extract specifications from Microsoft PowerPoint presentations (.pptx)
Key Files:
ppt/presentation.xml: Presentation-level settingsppt/slideLayouts/: Slide layout definitionsppt/slideMasters/: Master slide formattingppt/theme/theme1.xml: Theme colors and fontsppt/media/: Embedded images and graphicsExtraction Commands:
# Unpack presentation
unzip presentation.pptx -d unpacked/
# Extract theme colors
grep -A 3 "<a:clrScheme>" unpacked/ppt/theme/theme1.xml
# List slide layouts
ls -1 unpacked/ppt/slideLayouts/
# Extract master slide settings
cat unpacked/ppt/slideMasters/slideMaster1.xml
Key Conversions:
<p:sldSz cx="9144000" cy="6858000"> (EMUs)
Special Considerations:
Purpose: Extract specifications from Microsoft Excel spreadsheets (.xlsx)
Key Files:
xl/workbook.xml: Workbook-level settingsxl/styles.xml: Cell styles, fonts, colors, number formatsxl/theme/theme1.xml: Theme colors and fontsxl/media/: Embedded charts and imagesExtraction Commands:
# Unpack workbook
unzip workbook.xlsx -d unpacked/
# Extract theme colors
grep -A 3 "<a:clrScheme>" unpacked/xl/theme/theme1.xml
# Extract cell styles
grep -A 5 "<cellXfs>" unpacked/xl/styles.xml
# Extract fonts
grep -A 3 "<fonts>" unpacked/xl/styles.xml
Key Conversions:
<sz val="11"> = 11pt (direct, not half-points)<col min="1" max="1" width="12.5"> (character units)<row r="1" ht="15"> (points)Special Considerations:
Purpose: Extract specifications from PDF documents (limited extraction)
Key Limitations:
Extraction Approach:
# Extract metadata
pdfinfo document.pdf
# Extract fonts
pdffonts document.pdf
# Extract text with layout
pdftotext -layout document.pdf
# Extract images
pdfimages document.pdf extracted_images/
Extraction Strategy:
pdffonts to identify font families usedpdftotext -layout to analyze spacing and structureSpecial Considerations:
Use this comprehensive checklist to ensure no design element is missed. Check off each item as extracted.
OOXML half-points to points: divide by 2
Example: <w:sz w:val="24"/> = 12pt
OOXML twips to points: divide by 20
Example: <w:spacing w:after="200"/> = 10pt
OOXML EMUs to inches: divide by 914400
Example: 914400 EMUs = 1 inch
Points to pixels (at 96 DPI): multiply by 1.333
Example: 12pt = 16px
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Visual Estimation | Guessing colors/fonts from appearance leads to inaccuracy. "That looks like 14pt Arial" may actually be 13pt Calibri, causing subtle but compounding inconsistency. | Always extract from XML source files. Use styles.xml for fonts, theme1.xml for colors. Convert OOXML units precisely. |
| Skipping Verification | Generated templates may drift from original due to missing specifications or ambiguous instructions. Without testing, errors only surface in production. | Always verify by recreation and visual comparison. Generate test document using AI_PROMPT.md, compare side-by-side, iterate until >95% fidelity. |
| Missing Assets | Forgetting to extract logos/images breaks branding and layout. Even if specifications are perfect, missing media ruins output. | Check word/media/, ppt/media/, xl/media/ directories in all unpacked documents. Copy all files to assets/ and document in ASSETS.md. |
| Unit Conversion Errors | OOXML uses half-points, twips, and EMUs. Forgetting conversions leads to 2x font sizes, incorrect margins. | Use conversion formulas: half-points รท 2, twips รท 20, EMUs รท 914400. Validate converted values against visual inspection. |
| Ignoring Edge Cases | First page headers, section breaks, and linked styles have special behavior. Assuming uniform formatting causes inconsistency. | Document all variations in TEMPLATE_SPEC.md implementation notes. Test recreation with multi-page, multi-section documents. |
| Incomplete Color Extraction | Extracting only theme colors misses table borders, callout backgrounds, and special element styling. | Use extraction checklist to systematically capture core palette AND functional colors. Check every visual element. |
| Approximating Measurements | Rounding margins to "about 1 inch" or spacing to "roughly 10pt" compounds across pages. | Extract and document exact values. Use original OOXML values, not rounded approximations. |
| Single-Sample Testing | Verifying with only the original content hides template generalization issues. New content may break layout. | Test recreation with multiple content samples: different lengths, varied structure, additional tables/lists. Ensure template generalizes. |
| Forgetting Fallback Fonts | Primary font may not be available on all systems. No fallback causes browser/OS substitution, ruining design. | Document font stack with fallbacks: "Calibri, Arial, Helvetica, sans-serif". Test on systems without primary font. |
| Mixing RGB and Hex | Inconsistent color notation causes confusion. Hex in some places, RGB in others leads to translation errors. | Standardize on hex for all color specifications. Include RGB as reference in tables, but use hex as primary. |
The scripts/extract_template.py Python helper automates the extraction workflow for Office documents:
Usage:
python scripts/extract_template.py <document_path> <output_dir>
# Example
python scripts/extract_template.py company_report.docx ./extracted_template
What It Does:
output_dir/unpacked/theme1.xmlstyles.xml (DOCX)document.xml (DOCX)output_dir/assets/extraction_results.json with all extracted dataOutput Structure:
extracted_template/
โโโ unpacked/ # Full ZIP contents
โโโ assets/ # Copied media files
โโโ extraction_results.json # Raw extraction data
โโโ [manual step: create TEMPLATE_SPEC.md and AI_PROMPT.md]
Next Steps After Script:
extraction_results.json for raw dataunpacked/ directory for detailed XMLassets/ for logos and imagesTEMPLATE_SPEC.md using Output 1 templateAI_PROMPT.md using Output 2 templateLimitations:
Template Extractor transforms document reverse-engineering from a manual, error-prone guessing process into a systematic, verifiable workflow. By unpacking file structures and parsing XML directly, it provides ground-truth specifications that enable pixel-perfect recreation without iteration cycles.
The dual-output approach (technical specification + AI prompt) ensures both developers and AI systems have actionable, precise instructions tailored to their needs. The verification phase closes the loop, validating that specifications are not just theoretically correct but practically usable.
This skill is essential for teams needing document consistency across multiple files, for migrating formatting between systems, and for creating reusable branded document generators. By eliminating guesswork and manual measurement, it accelerates document production while guaranteeing visual fidelity.
Use this skill whenever precision matters more than approximation - when "close enough" is not acceptable and exact replication is required.
mcp_servers:
required: []
optional: [memory-mcp]
auto_enable: false
Optional MCP Usage:
templates/{document_type}/{organization}who: template-extractor, project: document-templates, intent: specification