| name | data-extraction |
| description | Use when extracting structured data from medical research PDFs, parsing study characteristics, patient demographics, outcomes, and results. Invoke for systematic review data collection from papers. |
Data Extraction Skill
This skill guides structured data extraction from research papers for systematic reviews.
When to Use
Invoke this skill when the user:
- Asks to extract data from a PDF
- Needs study characteristics pulled
- Wants patient demographics collected
- Requests outcome data extraction
- Mentions "data extraction" or "data collection"
Data Elements to Extract
1. Study Identification
| Field | Description | Example |
|---|
| study_id | FirstAuthorYear format | "Smith2023" |
| pmid | PubMed ID | "37654321" |
| doi | Digital Object Identifier | "10.1001/jamasurg.2023.1234" |
| title | Full article title | "..." |
2. Study Characteristics
| Field | Description | Values |
|---|
| year | Publication year | 2020 |
| country | Study location | "USA", "Japan" |
| study_design | Design type | "RCT", "Retrospective cohort" |
| multicenter | Single/multi | true/false |
| study_period | Enrollment dates | "2015-2020" |
3. Patient Demographics
| Field | Format | Notes |
|---|
| sample_size | Integer | Total N |
| age_mean | Number | Mean age |
| age_sd | Number | Standard deviation |
| age_median | Number | If no mean |
| age_iqr | [Q1, Q3] | Interquartile range |
| male_percent | 0-100 | Percentage male |
4. Clinical Characteristics (Neurosurgery)
Common scales and measures:
- GCS (Glasgow Coma Scale): 3-15
- GOS (Glasgow Outcome Scale): 1-5
- mRS (modified Rankin Scale): 0-6
- NIHSS (NIH Stroke Scale): 0-42
- Hunt-Hess: I-V
- Fisher Grade: 1-4
- WHO Grade: I-IV (tumors)
5. Intervention Details
intervention:
name: "Decompressive craniectomy"
type: "Surgical"
technique: "Unilateral frontotemporoparietal"
timing: "Within 48 hours"
details: "Bone flap ≥12cm diameter"
6. Outcome Data
Binary Outcomes (events/total)
outcomes:
- name: "Mortality"
type: "binary"
timepoint: "30 days"
intervention:
events: 12
total: 50
control:
events: 25
total: 52
Continuous Outcomes (mean ± SD)
outcomes:
- name: "Length of stay"
type: "continuous"
timepoint: "discharge"
intervention:
mean: 14.5
sd: 6.2
n: 50
control:
mean: 18.3
sd: 7.1
n: 52
Effect Estimates
effect_estimate:
measure: "OR"
value: 0.65
ci_lower: 0.42
ci_upper: 0.98
p_value: 0.038
Extraction Principles
DO:
- Extract only explicitly stated data
- Record the exact numbers from the paper
- Note units (mg, mm, days, months)
- Specify timepoints for each outcome
- Flag unclear or ambiguous values with "?"
- Document page numbers for key data
DON'T:
- Calculate or derive values (unless necessary)
- Assume missing data
- Interpret unclear statements
- Mix timepoints within outcomes
Quality Checks
After extraction, verify:
Common Issues
Converting Median/IQR to Mean/SD
When only median and IQR reported:
Mean ≈ Median (for symmetric distributions)
SD ≈ IQR / 1.35 (for normal distributions)
Extracting from Figures
- Use WebPlotDigitizer for graph data
- Note "extracted from figure" in comments
- Estimate uncertainty
Missing Control Group (Single-Arm)
For case series without controls:
outcomes:
- name: "Mortality"
type: "binary"
timepoint: "in-hospital"
single_arm:
events: 15
total: 100
Output Format
Use YAML format for structured extraction:
study_id: "Smith2023"
pmid: "37654321"
doi: "10.1001/jamasurg.2023.1234"
year: 2023
country: "USA"
study_design: "Retrospective cohort"
sample_size: 150
patient_demographics:
age_mean: 58.3
age_sd: 12.4
male_percent: 62
intervention:
name: "Decompressive craniectomy"
type: "Surgical"
outcomes:
- name: "Mortality"
type: "binary"
timepoint: "30 days"
intervention:
events: 12
total: 75
control:
events: 18
total: 75
notes: "Single-center study. High crossover rate (15%)."
Validation
After extraction, use the validate_extraction tool to check against schema:
mcp__neuroresearch__validate_extraction(data, schema_type="study")