en un clic
pdf-markdown-validator
// Validate PDF to Markdown conversion quality using multi-dimensional metrics. Assess table accuracy, style preservation (bold/italic/headings), robustness, and performance with standardized F1-scoring methodology.
// Validate PDF to Markdown conversion quality using multi-dimensional metrics. Assess table accuracy, style preservation (bold/italic/headings), robustness, and performance with standardized F1-scoring methodology.
Integrate CopilotKit AI components into Next.js frontend for building agentic UIs. Enables context-aware AI agents that can read app state and trigger tools/actions. Supports custom adapters for self-hosted LLMs and multiple provider integrations.
Validate documentation traceability between code annotations (@implements), feature registry, business rules, and use cases. Detect ID collisions, undocumented features, broken cross-references, and namespace violations.
Service management for E2E testing in EdgeQuake. Start, stop, and monitor PostgreSQL, backend API, and frontend services. Includes health checks and logging utilities for interactive testing workflows.
Unified development workflow for EdgeQuake using Makefile commands. Use when starting services, running tests, or managing the full development stack (database, backend, frontend). Provides simplified alternatives to raw cargo/npm commands.
Capture EdgeQuake WebUI routes with Playwright and write artifacts immediately (screenshots + per-page request JSON + capture index). Use when adding/updating Playwright E2E capture specs or when asked to automate UI screenshot collection.
Automatically generate comprehensive documentation for Rust and TypeScript codebases by analyzing code structure, patterns, and relationships. Supports trait-based patterns, async operations, React components, and Next.js applications.
| name | pdf-markdown-validator |
| description | Validate PDF to Markdown conversion quality using multi-dimensional metrics. Assess table accuracy, style preservation (bold/italic/headings), robustness, and performance with standardized F1-scoring methodology. |
| license | Proprietary (repository internal) |
| compatibility | Requires Rust (1.70+), Python (3.9+), pandoc (2.14+), and EdgeQuake PDF crate |
| metadata | {"repo":"raphaelmansuy/edgequake","area":"pdf-processing","languages":["Rust","Markdown","Python"],"frameworks":["edgequake-pdf"],"patterns":["Quality metrics","Evaluation harness","Ground-truth comparison"]} |
Validate PDF to Markdown conversion quality using a comprehensive, multi-dimensional evaluation framework. This skill provides standardized metrics, evaluation harnesses, and reporting tools to assess conversion fidelity across table accuracy, style preservation, robustness, and performance.
Use this skill when you need to:
The validation framework computes a composite quality score (0–100) combining four independent metric dimensions:
FinalScore = (0.40 × TableAccuracy) + (0.40 × StyleAccuracy)
+ (0.10 × Robustness) + (0.10 × Performance)
Each dimension is independent and can be evaluated separately or together.
Measures how accurately tables are detected and their cell content extracted.
Components:
Table Detection F1: IoU-based matching of predicted vs. gold tables (IoU ≥ 0.5 threshold)
Cell Content Accuracy: Token-level F1 averaging across matched table cells
Formula:
TableAccuracy = (0.5 × TableDetectionF1) + (0.5 × CellContentAccuracy)
Interpretation:
Measures how accurately text formatting (bold, italic, heading levels) is preserved.
Components:
Formula:
StyleAccuracy = macro_average(BoldF1, ItalicF1, HeadingF1)
= (BoldF1 + ItalicF1 + HeadingF1) / 3
Interpretation:
Measures system stability and validity across a test corpus, particularly edge cases.
Components:
pandoc syntax validationFormula:
Robustness = (CrashFreeRate + MarkdownValidityRate + CompletenessRate) / 3
Interpretation:
Measures processing speed relative to a baseline; targets 1-page PDF ≈ 200–500ms.
Components:
Formula:
Performance = 0.5 × min(1.0, baseline_median / run_median)
+ 0.5 × min(1.0, baseline_p95 / run_p95)
Interpretation:
Create .gold.md files for each test PDF:
# Copy reference Markdown next to PDF with .gold.md extension
cp reference_output.md test.gold.md
# Format: Each section annotated with metadata
# Gold format example:
# # Heading 1
# **bold text** and *italic text*
#
# | Column A | Column B |
# |----------|---------|
# | cell 1 | cell 2 |
# Evaluate against ground truth
cargo run -p edgequake-pdf --example real_dataset_eval -- \
--input crates/edgequake-pdf/test-data/real_dataset \
--gold \
--metrics
# Generate detailed report
python3 .github/skills/pdf-markdown-validator/scripts/validate.py \
--pdf-dir crates/edgequake-pdf/test-data/real_dataset \
--gold-dir . \
--output-report metrics_report.json
# View summary scores
cat metrics_report.json | jq '.summary'
# Analyze failures by category
python3 .github/skills/pdf-markdown-validator/scripts/analyze_failures.py \
metrics_report.json
# Embed metrics in standard cargo test output
cargo test -p edgequake-pdf -- --nocapture
# Fail CI if composite score below threshold
cargo test -p edgequake-pdf --features ci-strict
from pdf_validator import PDFValidator
validator = PDFValidator(
pdf_dir="test-data/real_dataset",
gold_dir="test-data/gold",
metrics=["table", "style", "robustness", "performance"]
)
score = validator.evaluate()
print(f"Composite Score: {score.composite}/100")
# GitHub Actions example
- name: Validate PDF → Markdown
run: |
cargo run -p edgequake-pdf --example real_dataset_eval -- --metrics
python .github/skills/pdf-markdown-validator/scripts/validate.py \
--ci-mode --fail-below 75
Input PDF: 2×3 table with headers "Name, Age" and row "John, 25"
Gold Markdown:
| Name | Age |
| ---- | --- |
| John | 25 |
Generated Markdown (Perfect):
| Name | Age |
| ---- | --- |
| John | 25 |
Scores:
Generated Markdown (Partial Match):
| Name | Age |
| ---- | ---- |
| John | 25.0 |
Scores:
Gold Markdown:
# Main Heading
This is **bold** and _italic_ text.
## Sub Heading
More content here.
Generated Markdown (Perfect):
# Main Heading
This is **bold** and _italic_ text.
## Sub Heading
More content here.
Scores:
Generated Markdown (Partial):
# Main Heading
This is bold and italic text.
## Sub Heading
More content here.
Scores:
Test corpus: 30 PDFs (including 5 edge cases: corrupted, multilingual, scanned, etc.)
Results:
Robustness Score: (96.7 + 100 + 93.3) / 3 = 96.7%
Baseline (previous release):
Current run:
Scores:
# Navigate to PDF crate
cd edgequake/crates/edgequake-pdf
# Ensure ground-truth annotations exist
# Files should be named: <pdf_name>.gold.md
ls -1 test-data/real_dataset/*.gold.md
# Convert all PDFs to Markdown
cargo run -p edgequake-pdf --example real_dataset_eval -- --write
# Outputs written to: test-data/real_dataset/*.md
# Compute all metrics
python3 ../../.github/skills/pdf-markdown-validator/scripts/validate.py \
--pdf-dir test-data/real_dataset \
--gold-dir test-data/real_dataset \
--output-report validation_report.json \
--verbose
# View summary
cat validation_report.json | jq '.summary'
# Detailed per-document breakdown
cat validation_report.json | jq '.documents | .[] | {name, scores}'
# Identify failure patterns
python3 ../../.github/skills/pdf-markdown-validator/scripts/analyze_failures.py \
validation_report.json --group-by failure_type
The .gold.md files serve as reference implementations. Use this structure:
# Document Title (H1)
## Section Heading (H2)
This paragraph contains **bold text** and _italic text_ and **_bold-italic text_**.
### Subsection (H3)
#### Sub-subsection (H4)
**Note:** Use standard Markdown syntax. Be precise with:
- Bold: **text**
- Italic: _text_
- Bold-Italic: **_text_**
- Headings: # through #### for H1–H4
### Tables
| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Cell 1 | Cell 2 | Cell 3 |
| Cell 4 | Cell 5 | Cell 6 |
Ensure:
- Pipes align properly
- Headers separated by `---|---` row
- No trailing spaces (can affect parsing)
### Code Blocks
\`\`\`python
def hello():
print("world")
\`\`\`
Use triple backticks with language identifier.
### Lists
Bullet list:
- Item 1
- Item 2
- Nested item
- Item 3
Numbered list:
1. First
2. Second
3. Third
### Edge Cases
- **Multi-line table cells**: Not standard Markdown; flatten to single line
- **Merged cells**: Not representable in Markdown tables; split into separate rows
- **Vertical headers**: Use first row convention (all cells with **bold**)
# Full validation pipeline
python3 scripts/validate.py \
--pdf-dir <path/to/pdfs> \
--gold-dir <path/to/gold> \
[--output-report <report.json>] \
[--metrics table,style,robustness,performance] \
[--ci-mode] \
[--fail-below 75]
Options:
--pdf-dir: Directory containing PDFs and generated .md files--gold-dir: Directory containing .gold.md reference files--output-report: JSON file for machine-readable results (default: validation_report.json)--metrics: Comma-separated metrics to compute (default: all)--ci-mode: Fail with non-zero exit code if score below threshold--fail-below: Minimum acceptable score (default: 75)# Identify and categorize failures
python3 scripts/analyze_failures.py \
<report.json> \
[--group-by failure_type|document|metric] \
[--export <output.csv>]
# Compare two validation runs
python3 scripts/compare_runs.py \
<baseline_report.json> \
<current_report.json> \
[--show-improvements] \
[--show-regressions]
name: PDF Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: |
sudo apt-get install -y pandoc
pip install -r .github/skills/pdf-markdown-validator/requirements.txt
- name: Generate Markdown
run: |
cd edgequake/crates/edgequake-pdf
cargo run --example real_dataset_eval -- --write
- name: Validate conversion
run: |
python3 .github/skills/pdf-markdown-validator/scripts/validate.py \
--pdf-dir edgequake/crates/edgequake-pdf/test-data/real_dataset \
--gold-dir edgequake/crates/edgequake-pdf/test-data/real_dataset \
--ci-mode \
--fail-below 75
- name: Upload report
if: always()
uses: actions/upload-artifact@v3
with:
name: validation_report
path: validation_report.json
# In your evaluation extension
from pdf_validator import BaseMetric
class CustomMetric(BaseMetric):
def compute(self, gold_md: str, generated_md: str) -> float:
"""Implement your metric logic."""
# Return float in [0, 1]
pass
# Register with validator
validator.register_metric("custom_metric", CustomMetric(), weight=0.1)
# Use different aggregation strategy
def weighted_macro_f1(f1_scores: dict) -> float:
weights = {"bold": 0.4, "italic": 0.3, "heading": 0.3}
return sum(f1_scores[k] * weights[k] for k in f1_scores)
validator.set_style_aggregator(weighted_macro_f1)
Symptom: Many documents fail pandoc validation.
Diagnosis:
# Run pandoc on output directly
pandoc -f markdown -t html -o /dev/null generated.md
Common causes:
Solution: Check the .md file for syntax errors and regenerate if needed.
Symptom: TableAccuracy < 50% despite detecting tables.
Diagnosis: Table cells extracted but content is corrupted or misaligned.
Solution:
edgequake-pdf/src/processors/Symptom: Performance score < 60%.
Diagnosis: Processing time increased 2-3x over baseline.
Steps:
cargo flamegraph -p edgequake-pdfSymptom: Crashes only on PDFs in edge_cases/ subdirectory.
Diagnosis: Parser/processor doesn't handle malformed or unusual PDFs.
Solution:
.gold.md files.pdf and .gold.md in gitedgequake/crates/edgequake-pdf/README.mdspecs/27-improve-pdf.md (OODA loop methodology)edgequake/crates/edgequake-pdf/examples/real_dataset_eval.rsedgequake/crates/edgequake-pdf/test-data/real_dataset/This skill enables systematic, quantified validation of PDF → Markdown conversions. Use it to:
Start with the Quick Start section, prepare ground-truth annotations, and run the validation pipeline. Iterate using the Workflow guide, focusing on lowest-scoring dimensions first.