| name | extraction-check |
| description | ATS text-extraction regression gate for the resume repo. Runs the compiled PDF through three open-source parsers (pdftotext, pdftotext -layout, Apache Tika) that real ATSs actually use, asserts clean extraction across nine structural checks, and surfaces disagreement between parsers as diagnostic signal. Use when the user asks about ATS parseability, text extraction, PDF parsing reliability, whether a resume variant breaks for ATSs, how the extraction gate works, running or debugging extraction-check, comparing extractor outputs, adding assertions, writing new broken fixtures, interpreting cross-extractor disagreement, or the text-extraction hypothesis at .claude/context/text-extraction-hypothesis.md. Keywords ATS, applicant tracking system, text extraction, PDF parsing, pdftotext, poppler, tika, Apache Tika, reading order, mojibake, cross-extractor, extraction check, regression gate, resume parser, parseability, soft hyphen, url dedup. |
| allowed-tools | Read, Write, Edit, Grep, Glob, Bash |
Extraction Check
Deterministic gate on will_cygan_resume.pdf. Shells to three open-source PDF-to-text parsers — pdftotext, pdftotext -layout, Apache Tika — that enterprise ATSs sit on, runs nine assertions on the output, and fails if any parser sees something structurally wrong.
The underlying hypothesis is in .claude/context/text-extraction-hypothesis.md: text-extraction fidelity is the single highest-leverage ATS-side optimization, and real-world ATSs virtually all run on the same handful of open-source parsers. Beating those parsers is equivalent to passing ATS parsing at the file-format level.
When to invoke this skill
- "Is this resume change safe for ATSs?" / "Will this parse cleanly?"
- "Why did extraction-check fail on [assertion X]?"
- "Run the extraction check" / "Compare what each extractor sees"
- "Add a new assertion / broken fixture / extractor"
- "How does the gate work?"
- "The CI Extraction Check job is failing — help me diagnose"
- Any mention of pdftotext, poppler, Tika, mojibake, reading order, or cross-extractor divergence
Quickstart
just compile
just extraction-check
just test
Prereqs on macOS: brew install poppler tika typst. CI installs pinned Tika 3.3.0.
The nine assertions
| # | Name | What it checks |
|---|
| 1 | 1-non-empty | Extracted text is >500 bytes (catches image-only or encoding-broken PDFs) |
| 2 | 2-section-order | Declared section headers appear in the expected visual order |
| 3 | 3-name-contact | Name and email are present, not glued together (ATS field-map footgun) |
| 4 | 4-job-contiguity | Title / company / date-start / date-end co-occur within a 300-char window per job |
| 5 | 5-date-format | At least one date range matches a consistent "Mon YYYY – …" format |
| 6 | 6-mojibake | Zero replacement chars, no flagged smart-quote/em-dash, and no Private Use Area codepoints U+E000–U+F8FF (icon-font tofu from stray fa-icon(...) calls) |
| 7 | 7-cross-extractor | All available extractors agree on section order and job count, and no two extractor outputs differ by more than 1.5× in byte count |
| 8 | 8-soft-hyphen | No U+00AD soft hyphens in any extractor's output (breaks hyphenated words across paragraphs in Tika) |
| 9 | 9-keyword-roundtrip | Every ATS-searchable keyword declared in [keywords].required survives extraction as an exact substring (guards against ligature collapse and font-substitution regressions) |
| 10 | 10-url-dedup | Every URL that appears in extractor output appears exactly once (redundant title-links triple-emit in Tika's URL block) |
| 11 | 11-section-boundary | Every adjacent pair of top-level section headers is separated by at least one blank line (prevents section-boundary parsers from merging sections) |
Assertion 7 is the canonical reading-order-scramble signal. When pdftotext says 2 jobs but pdftotext -layout says 1, the layout has a bug real ATSs will hit. The byte-ratio extension catches quieter divergence where extractors agree on structure but one is emitting dramatically more or less text than the others.
Reading further
For anything deeper, read the relevant reference:
- Architecture — script layout, data model, how
evaluate_pdf orchestrates parallel extractors
- Running checks — local commands, CI workflow, interpreting output
- Comparing outputs — what each extractor tells you, resolving disagreement, manual cross-checks
- Extending — adding an assertion, a new extractor, or a broken fixture; updating fixtures when the resume changes
- Failure playbook — one entry per assertion: how failures look, root causes, how to fix
Ground rules
- Do not silently loosen an assertion to make it pass. If the 300-char window fires on a legitimate layout, investigate the layout first. Adjust thresholds only with a commit-message reason.
- Do not add a new extractor unless it represents a real ATS parser family.
pypdf / pdfplumber are Python-native and don't match ATS behavior — see the hypothesis spec for why.
- The real resume is the ground truth for fixtures.
scripts/extraction-check.fixtures.toml is hand-maintained against will_cygan_resume.typ. When the resume gains a job or renames a section, update the TOML in the same commit.
- Negative fixtures must fail cleanly. Each broken
.typ exists to prove one assertion fires. If a fixture trips multiple assertions unintentionally, tighten or split the fixture before shipping.