| name | scientific-data-extraction |
| description | Extract structured data from scientific literature across multiple formats (PDF, HTML, images, plain text).
Auto-detects scientific domain to recommend specialized tools for chemistry/materials when appropriate.
Use this skill when: extracting numerical data from papers, digitizing graphs/plots, parsing tables from PDFs,
extracting chemical properties or reactions, or converting unstructured scientific text to structured formats.
Key capabilities: format detection and routing, domain-specific extraction (chemistry/materials), multi-method
validation, table extraction, graph digitization, LLM-enhanced extraction with verification, confidence scoring.
|
| allowed-tools | ["*"] |
Scientific Data Extraction Skill
Overview
This skill provides comprehensive guidance for extracting structured data from scientific literature across multiple input formats (PDF, HTML, images, plain text). It auto-detects the scientific domain to recommend specialized tools when appropriate (particularly for chemistry and materials science) and employs a hierarchical extraction approach with multi-method validation for high-confidence results.
When to Use This Skill
Use this skill when you need to:
- Extract numerical data from scientific papers, reports, or documents
- Digitize graphs and plots to recover underlying data points
- Parse tables from PDFs or images into structured formats (CSV, DataFrame, JSON)
- Extract chemical/materials data including properties, reactions, compounds, and structures
- Convert unstructured text to structured JSON or tabular formats
- Validate extracted data through multi-method cross-checking
- Process document batches with consistent extraction methodology
Input Format Detection
The first step is identifying the input format and routing to appropriate tools:
Plain Text (.txt, .md)
- Domain detection via keyword analysis
- NLP-based entity extraction (spaCy, Stanza)
- Regex patterns for structured data (numbers with units, chemical formulas)
- LLM-based structured extraction
HTML (.html, web pages)
- HTML parsing with BeautifulSoup + lxml
- Table detection and extraction
- Text content extraction with structure preservation
- Domain-specific processing after text extraction
PDF (.pdf)
| Priority | Tool | Speed | Use Case |
|---|
| Quick | PyMuPDF4LLM | ~0.12s | Initial exploration, large batches |
| Standard | GROBID | Medium | Research-grade, reference parsing |
| Standard | Docling | Medium | Layout-aware, complex documents |
| Tables | Camelot | Fast | Bordered tables |
| Tables | Tabula | Fast | General tables |
| Tables | pdfplumber | Medium | Complex table structures |
| Deep | Marker-PDF | Slower | Scanned documents with OCR |
Images (.png, .jpg, .tiff)
| Content Type | Recommended Approach |
|---|
| Document scan | OCR (Tesseract/Surya) then text pipeline |
| Graph/Plot | WebPlotDigitizer workflow or LLM vision |
| Table image | Table Transformer or LLM vision |
| Chemical structure | OSRA or DECIMER for SMILES conversion |
Domain Detection
The skill automatically detects scientific domain to apply specialized tools:
Chemistry/Materials Indicators
- Chemical formulas (H2O, NaCl, TiO2)
- SMILES strings, InChI identifiers
- Reaction arrows (→, ⟶, ⇌)
- Property keywords: melting point, bandgap, conductivity, yield, purity
- Material names and IUPAC nomenclature
- Spectroscopic data patterns (NMR shifts, IR peaks)
When Chemistry/Materials Detected
Apply specialized tools:
- ChemDataExtractor v2: Property extraction, entity recognition, table parsing
- OpenChemIE: Reaction extraction from text, tables, and figures
- Domain-specific NER: Chemical named entity recognition
General Scientific Domain
Use general-purpose extraction:
- Standard NLP pipelines
- LLM-based structured extraction
- Template-based parsing
Extraction Method Hierarchy
Apply methods in order of increasing complexity based on requirements:
Level 1: Quick Extraction (Speed Priority)
When to use: Initial exploration, large document batches, simple structured data
import pymupdf4llm
text = pymupdf4llm.to_markdown("paper.pdf")
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')
tables = soup.find_all('table')
Expected confidence: Lower, suitable for screening
Level 2: Standard Extraction (Balanced)
When to use: Research-grade extraction, structure preservation needed
import scipdf_parser
article = scipdf_parser.parse_pdf_to_dict("paper.pdf")
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("paper.pdf")
import camelot
tables = camelot.read_pdf("paper.pdf", flavor='lattice')
df = tables[0].df
Expected confidence: Medium-high
Level 3: Deep Extraction (Accuracy Priority)
When to use: Publication-quality data, domain-specific extraction
from chemdataextractor import Document
doc = Document.from_file("paper.pdf")
records = doc.records
from openchemie import OpenChemIE
model = OpenChemIE()
reactions = model.extract_reactions_from_text(text)
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
result = converter("scanned_paper.pdf")
Expected confidence: High
Level 4: LLM-Enhanced Extraction
When to use: Complex figures, ambiguous data, validation needed
prompt = """
Extract all numerical data from this text as JSON:
- Property name
- Value (number only)
- Unit
- Context (what material/compound)
Text: {text}
"""
prompt = """
Analyze this graph image and extract:
1. X-axis label and range
2. Y-axis label and range
3. All data points as (x, y) pairs
4. Any error bars or uncertainty indicators
"""
Expected confidence: Highest when combined with validation
Multi-Method Validation Pipeline
For high-confidence results, use multiple extraction methods and validate:
Step 1: Primary Extraction
Select method based on input type and domain, extract structured data.
Step 2: Secondary Extraction
Run alternative method on same source, compare results and flag discrepancies.
Step 3: LLM Verification Queries
Ask targeted questions to verify extracted data:
- "Is this value X consistent with the context Y?"
- "Does unit Z make sense for property P?"
- "Are there any missing data points in the expected range?"
Step 4: Confidence Scoring
confidence = {
"score": 0.0,
"level": "HIGH|MEDIUM|LOW|REVIEW",
"methods_agreed": [],
"discrepancies": [],
"verification_notes": ""
}
Step 5: Database Cross-Reference (Optional)
For chemistry/materials, compare against known databases:
- Materials Project
- AFLOW
- PubChem
- NIST databases
Flag significant deviations from expected ranges.
Output Format
Structure extracted data consistently:
{
"extraction_metadata": {
"source": "path/to/document.pdf",
"source_type": "pdf",
"domain_detected": "chemistry",
"methods_used": ["grobid", "chemdataextractor", "llm_verification"],
"timestamp": "2025-01-18T..."
},
"extracted_data": [
{
"data_type": "material_property",
"entity": "TiO2",
"property": "bandgap",
"value": 3.2,
"unit": "eV",
"source_location": {
"page": 4,
"section": "Results",
"table_id": "Table 2",
"row": 3
},
"confidence": {
"score": 0.95,
"level": "HIGH",
"methods_agreed": ["chemdataextractor", "llm_extraction"],
"verification_notes": "Value consistent with literature range 3.0-3.4 eV"
}
}
],
"validation_summary": {
"total_extracted": 47,
"high_confidence": 38,
"medium_confidence": 7,
"needs_review": 2,
"discrepancies": []
}
}
Step-by-Step Instructions
For PDF Data Extraction
- Identify document type: Scanned or text-based PDF
- Choose extraction level: Based on accuracy requirements
- Detect domain: Check for chemistry/materials indicators
- Extract text/structure: Use appropriate tool from hierarchy
- Extract tables separately: Use Camelot, Tabula, or pdfplumber
- Apply domain tools: If chemistry detected, use ChemDataExtractor
- Validate: Run secondary extraction or LLM verification
- Format output: Structure as JSON with confidence scores
For Graph/Plot Digitization
- Assess graph quality: Resolution, clarity, labeling
- Identify graph type: Line plot, scatter, bar chart, contour
- Choose method:
- Simple, clear graphs: WebPlotDigitizer (manual calibration)
- Complex or batch: LLM vision extraction
- Calibrate axes: Define coordinate system
- Extract data points: Manual selection or automatic detection
- Validate: Check extracted points against visual inspection
- Export: CSV or JSON format with uncertainty estimates
For Table Extraction
- Identify table type: Bordered (lattice) or borderless (stream)
- Choose tool:
- Bordered: Camelot with
flavor='lattice'
- Borderless: Tabula or Camelot with
flavor='stream'
- Complex: pdfplumber for fine-grained control
- Extract to DataFrame: Review structure and headers
- Clean data: Fix merged cells, missing values, formatting
- Apply domain parsing: Convert units, parse chemical formulas
- Validate: Compare against source visually
- Export: CSV, JSON, or integrate into dataset
For Chemistry/Materials Extraction
- Confirm domain: Verify chemistry/materials content
- Choose specialized tool:
- Properties: ChemDataExtractor v2
- Reactions: OpenChemIE
- Structures from images: OSRA or DECIMER
- Configure extraction: Set up parsers for target properties
- Run extraction: Process document with domain tools
- Post-process: Normalize units, standardize identifiers
- Cross-reference: Compare against databases (Materials Project, PubChem)
- Validate: LLM verification of unusual values
- Export: Structured JSON with confidence scores
Best Practices
-
Always start with format detection - Correct tool selection depends on accurate format identification
-
Use the simplest method that works - Start at Level 1 and escalate only if needed
-
Preserve source location - Track page numbers, sections, table IDs for traceability
-
Validate unusual values - Any value outside expected ranges should be flagged and verified
-
Document extraction methodology - Record which tools and settings produced each data point
-
Handle uncertainty explicitly - Include error bounds when available, note when values are approximate
-
Cross-reference chemistry data - Always compare against known databases for sanity checking
-
Use LLM verification judiciously - Most valuable for complex figures and ambiguous cases
Requirements
Core Python Packages
pymupdf4llm: Quick PDF extraction
pdfplumber: Detailed PDF analysis
camelot-py: Table extraction (requires ghostscript)
beautifulsoup4, lxml: HTML parsing
spacy: NLP processing
pandas: Data manipulation
Domain-Specific (Chemistry)
chemdataextractor: Chemistry NLP (v2 recommended)
openchemie: Reaction extraction
Optional
tabula-py: Table extraction (requires Java)
grobid (server): Academic PDF parsing
docling: IBM document converter
marker-pdf: OCR-capable PDF conversion
tesseract or surya: OCR engines
Limitations
-
Scanned documents require OCR - Quality depends on scan resolution and OCR accuracy
-
Complex table structures - Merged cells, nested headers may require manual correction
-
Graph digitization is approximate - Precision limited by image resolution and calibration
-
Domain tools are specialized - Chemistry tools won't work well on biology or physics texts
-
LLM extraction can hallucinate - Always validate with source or alternative method
-
Some PDFs are protected - May not be extractable due to DRM or image-only content
Related Skills
- literature-review: For systematic literature searching and synthesis
- scientific-reviewer: For evaluating extracted data quality
- materials-databases: For cross-referencing extracted chemistry/materials data
- python-plotting: For visualizing extracted data
References
See the references/ directory for detailed documentation on:
pdf-tools.md: Comprehensive PDF extraction tool comparison
table-extraction.md: Table extraction methods and code examples
graph-digitization.md: Graph data extraction techniques
chemistry-tools.md: ChemDataExtractor and OpenChemIE usage
llm-extraction.md: LLM-based extraction patterns and validation
See the examples/ directory for complete workflows:
extract-from-pdf.md: End-to-end PDF extraction example
extract-table-data.md: Table extraction comparison
digitize-graph.md: Graph digitization guide
chemistry-extraction.md: Chemistry-specific extraction workflow