| name | extract-literature-model |
| description | This skill should be used when the user wants to "add a model from a paper", "extract a pharmacometric model from the literature", "implement a published PK/PD model in nlmixr2lib", or provides a scientific article, conference poster, supplement, or regulatory review document and asks to add the model to nlmixr2lib. Guides source review, standardized model file creation under inst/modeldb/, in-file source-trace verification, and validation vignette with PKNCA NCA checks. |
Extract a pharmacometric model from the literature
Input: a scientific source describing a pharmacometric model (journal article, supplement, conference poster, or regulatory review).
Output: a packaged nlmixr2lib model file under inst/modeldb/, a validation vignette under vignettes/articles/, and updated registry artifacts — opened as a pull request against main.
Work through the six phases below. Stop and ask the user at any of the decision points called out explicitly; ambiguity is the main failure mode for this workflow, and silent assumptions are what get shipped as bugs.
References
Read these on demand; don't load them up front.
references/naming-conventions.md — parameter, compartment, IIV, and error-model names.
inst/references/covariate-columns.md — authoritative register of covariate column names. Consult before introducing any new covariate. (Installed with the package so R code like checkModelConventions() can parse it at runtime.)
references/model-file-template.md — skeleton for the .R file.
references/vignette-template.md — skeleton for the validation vignette.
references/pknca-recipes.md — PKNCA setups for single-dose, steady-state, and multi-dose NCA.
references/endogenous-validation.md — validation strategy for endogenous / mechanistic / turnover models where PKNCA is not the right check.
references/verification-checklist.md — checklist to walk after the first-pass implementation.
Phase 1 — Source acquisition and scoping
Step 0 — Source-file presence and self-acquisition
Before any of the numbered steps below, confirm every source file the task references is on disk and is a valid PDF. The task block includes Lead PDF, Supplements, Model files, and Source dir paths; check each path that is set.
For each missing or invalid file (file does not exist, is < 10 KB, or whose first 4 bytes are not %PDF), attempt OA acquisition before giving up. The task agent has Bash and WebFetch tools — use them. Try sources in this order, stopping at the first that yields a valid %PDF-headed download of ≥ 10 KB:
- CrossRef link array.
curl -sS "https://api.crossref.org/works/<DOI>" -A "literature-acquisition (mailto:wdenney@humanpredictions.com)" → inspect the message.link[] array; download any URL whose content-type contains pdf or whose URL ends in .pdf.
- Unpaywall.
curl -sS "https://api.unpaywall.org/v2/<DOI>?email=wdenney@humanpredictions.com" → if best_oa_location.url_for_pdf is set, download it.
- Europe PMC.
curl -sS "https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=DOI:<DOI>&format=json" → if any result has a pmcid field, download https://europepmc.org/articles/<PMCID>?pdf=render. Most effective for Wiley, Elsevier, and Springer papers in PMC after embargo.
- NCBI PMC ID converter.
curl -sS "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=<DOI>&format=json&tool=literature-acquisition&email=wdenney@humanpredictions.com" → download https://www.ncbi.nlm.nih.gov/pmc/articles/<PMCID>/pdf/.
- Publisher-specific patterns (only for known-OA outlets):
- BMC:
https://<journal>.biomedcentral.com/track/pdf/<DOI>
- Frontiers:
https://www.frontiersin.org/articles/<DOI>/pdf
- Nature OA / Springer Nature OA:
https://www.nature.com/articles/<id>.pdf (where <id> is the trailing segment of the DOI)
- Hindawi:
https://onlinelibrary.wiley.com/doi/pdf/<DOI> (post-Wiley acquisition)
- Dovepress:
https://www.dovepress.com/getfile.php?fileID=<file_id>
- Static Springer Nature media (for Nature Comm SI files):
https://static-content.springer.com/esm/art%3A<doi-percent-encoded>/MediaObjects/<id>_MOESM1_ESM.pdf
Validation after each download: confirm size ≥ 10 KB AND the first 4 bytes equal %PDF. Wiley/Elsevier non-OA endpoints typically return ~5 KB HTML challenge pages — these MUST be rejected (delete and try the next source). Do NOT extract from a file whose head bytes are not %PDF.
Title-content sanity check for the lead PDF: after a successful download, pdftotext -l 1 <file> - and confirm the title / first-author line approximately matches what the task block expects. CrossRef DOIs occasionally point to a different paper than the task's metadata claims (publishers' DOI sequence is not always continuous, e.g. 10.1038/aps.2014.120 was a Zhang ROR review when the task expected Lu 2015 tacrolimus). If the downloaded PDF's title disagrees with the task expectation, don't extract from it — sidecar-ask the operator with the actual-vs-expected metadata.
When to give up and sidecar. If all 5 source paths above have been tried and none produced a valid PDF whose title matches the task expectation, write a sidecar request describing what was attempted:
Lead PDF for not on disk and OA acquisition failed. Tried: CrossRef link array (n URLs), Unpaywall (oa_status=, pdf_url=), Europe PMC (PMCID=), NCBI PMC (PMCID=), publisher landing (). Each returned <reason: HTML challenge page / 404 / not-PDF / title-mismatch>. Options: (A) operator drops the PDF on disk and re-dispatches, (B) operator emails corresponding author, (C) skip this task. Which applies?
The same ladder applies to supplements and errata:
- For Wiley supplements: try
https://onlinelibrary.wiley.com/action/downloadSupplement?doi=<DOI>&file=<filename>. Filenames often follow the pattern <journal-prefix><articleid>-sup-<NNNN>-<descriptor>.pdf; if the exact filename isn't known, fetch the article landing page and grep for downloadSupplement URLs.
- For Springer Nature SI:
https://static-content.springer.com/esm/art%3A<doi-percent-encoded>/MediaObjects/<id>_MOESM1_ESM.pdf. Increment MOESM2_ESM, MOESM3_ESM, etc. for additional supplements.
- For BMC: PMC mirror often has supplements at
https://www.ncbi.nlm.nih.gov/pmc/articles/<PMCID>/bin/<filename>.
- For Frontiers: typically inline at the article landing page; fetch the page and grep for
Image_*.pdf / Table_*.docx / Presentation_*.pdf in the HTML.
- For errata: search PubMed (
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=<title>+AND+erratum&retmode=json) for the original PMID + erratum link; download the erratum PDF via the same OA ladder.
If a supplement is unobtainable but the lead PDF is on disk, decide based on whether the missing supplement contains parameter values you need: if it does, sidecar-ask before extracting (see Phase 4 missing-parameter pathway); if it does not, document the gap in the vignette Errata and proceed.
Reasonable-attempts cap. Don't loop forever. The 5-source ladder above with one retry per source is the cap; after that, sidecar. If a source returns a 5xx error, retry once after 5 seconds; otherwise treat as failed and move on.
Numbered steps
-
Confirm the source type (journal article, supplement, poster, regulatory document).
-
Verify the on-disk file is the paper the task names. Open the source file (or its trimmed .md companion) and read the title + first-author line + journal + year. Compare against the task's Paper metadata block. If any of {first-author, year, journal, drug} disagrees, stop and sidecar-ask:
The on-disk file <filename> reports <actual-author> <actual-year>, <actual-journal>: "<actual-title>" but the task names <expected-author> <expected-year>, ... <expected-drug>. Options: (A) the task metadata is wrong and the file is the right paper — please correct the task metadata, (B) the file is mislabelled — please drop the correct PDF, (C) skip this task. Which applies?
This catches both "wrong-PMID-in-filename" (e.g. Frey 2010 saved as PMID_23436260.pdf) and "wrong-drug-guessed" task-generator errors. Do not proceed with extraction until the mismatch is resolved.
-
Check the study population species. Skim the Methods / Subjects section. If every PK dataset contributing to the final model is non-human (rat, mouse, cyno, dog, etc.), stop and sidecar-ask the operator before drafting anything:
This paper reports a preclinical-only () population PK model. nlmixr2lib is a library of human population-PK models. Should I (A) extract it anyway with clear preclinical metadata, (B) extract only a human-scaled projection if the paper includes one, or (C) skip this paper?
If the paper has both preclinical and human cohorts and the final model is fit to pooled or human-only data, proceed normally — this trigger is specifically for animal-only final models. Record the operator's decision in the PR body.
-
Prefer trimmed markdown when available. The preprocessor at mab_human_consensus/tracking/preprocess_papers.py writes a <stem>_trimmed.md next to each source file (PMC XML, PDF, DOCX, XLSX) containing only the sections the extraction actually needs: Title + Abstract + Methods + Results + Tables + Figure captions. The Introduction, Discussion, Conclusions, References, Acknowledgments, and publisher boilerplate are stripped. If PMID_<pmid>_pmc_trimmed.md (or PMID_<pmid>_trimmed.md for a PDF, or <stem>_trimmed.md for a supplement) exists, read it instead of the raw .xml / .pdf / .docx — it's typically 40-95% smaller with no loss of extractable content. Full-text sanity check on the trimmed file: ~15 KB+ (full-text trim) vs < 3 KB (abstract-only trim). Fall back to the raw source only if the _trimmed.md doesn't exist, the trim appears to have lost a specific piece of information you need (rare — only when the paper is structurally unusual), or you explicitly need the discarded sections (e.g., to quote a Discussion claim in the vignette narrative).
-
Verify the source contains full text, not just the abstract. Wiley / BJCP and some other publishers serve PMC XML containing only front matter + abstract. Before reading for model structure, run a quick sanity check (against the trimmed file if present, otherwise the raw source):
- Trimmed
.md file size ≥ ~15 KB, or raw PMC XML ≥ ~40 KB (full-text XML is typically 100 KB+; abstract-only is usually < 20 KB).
- The file contains a materially-present Methods section (not just a "Methods" heading followed by one abstract paragraph).
- If only a PDF is on disk, confirm it runs past the abstract (multi-page, Methods / Results / Tables present).
If only the abstract is available, sidecar-ask:
The source on disk for contains only the abstract and front matter; full text appears to be blocked by the publisher. Options: (A) pause this task until full text is provided, (B) proceed only if a supplement / regulatory review on disk contains the model equations and parameter tables, (C) skip this paper. Which applies?
Never attempt extraction from an abstract alone — population-PK parameter values, covariate effects, and equations are not in an abstract.
-
Detect upstream-popPK dependencies. Skim Methods for phrases like "PK was described using the popPK model previously developed from <study/phase>", "the structural PK model was fixed from ", "covariate effects were carried over from et al.", or "the PK model from was used as a backbone." If the current paper's PD model fixes its PK parameters from a separate publication that is not on disk:
- Try to identify the upstream paper from the references list (PMID, DOI, or full citation).
- If identifiable: sidecar-ask whether the operator wants this task to acquire
depends_on: [<upstream-task>] and pause until the upstream task completes — versus extracting only the PD layer with the upstream PK parameters reproduced inline.
- If unidentifiable (e.g. "the popPK model from internal Phase 1/2 studies" with no specific citation): sidecar-ask whether to (A) skip the task, (B) proceed with parameters fixed inline as reported in the current paper, with a clear "upstream PK source not located" note in the vignette Errata, or (C) defer pending operator investigation.
Never silently fabricate upstream PK parameters from training data.
-
Always search for supplementary information. Supplements frequently contain the NONMEM control stream and parameter tables that disambiguate model structure. If the user provided only a main article, ask whether a supplement exists and request it.
-
Always search for errata, corrigenda, or author corrections. Check the journal's landing page for the article, the publisher's "corrections" / "notices" feed, and a search like "<first author> <year> <drug>" erratum on PubMed and Google Scholar. Ask the user whether they are aware of any corrections if the source is paywalled or the search is inconclusive. When an erratum revises a value used in the model (parameter estimate, covariate effect, equation, units), the erratum value takes precedence over the main publication. If multiple errata exist, the most recent supersedes earlier ones. Record the erratum citation in the model file's reference metadata alongside the main paper, and in every in-file source-trace comment whose value comes from the erratum, point to the erratum (not the original table).
-
Verify parameters are final estimates, not initial estimates. Supplement control streams usually list initial values in $THETA / $OMEGA; final values come from the paper's results table or $TABLE output. If only a control stream is available, confirm values against any published point estimates before treating them as final.
-
Multiple-model handling.
- Base model + final model → extract only the final.
- Any other "multiple model" case (per-subpopulation, per-endpoint, sensitivity analyses) → list the candidates to the user and ask which to extract. Offer "one," "all," or "a subset."
-
Systematic review / meta-analysis handling. If the source is a systematic review or meta-analysis that catalogs other authors' published popPK / PD models without developing an original model of its own, the default action is to skip the task and queue the cited primary papers for future extraction. Do not extract a cataloged model from the review's summary table — extracting from a secondary source loses model-selection rationale, covariate-encoding details (reference categories, units, allometric exponents, scaling normalisations), and the full residual-error / IIV structure that only the primary papers contain. Recognise systematic reviews by:
- An explicit "Review" / "Systematic Review" tag on page 1 or in the article-type header.
- A Methods section describing a literature-search protocol (PubMed / EMBASE / Scopus / Web of Science search query, PRISMA-style screening flowchart, inclusion / exclusion criteria).
- A Results section that tabulates other authors' models in side-by-side tables (one row per cited study, columns for structural model / parameter estimates / covariates).
- No
$THETA / $OMEGA / $SIGMA block, no original VPC / GOF figures, no original NONMEM control stream attributable to the review's authors.
Sidecar-ask the operator with the list of cited primary papers and three options:
<paper-name> is a systematic review of previously published popPK / PD models. Per the standing policy, the recommended action is to skip this task and queue the cited primary papers individually for future extraction. Cited primary models: <numbered list with first-author + year + drug + journal>. Confirm: (A) skip this task and queue all cited primary papers for future extraction (operator-followups register), (B) skip without queueing references (e.g. the references duplicate already-queued tasks), (C) extract one or more models from the review's tables (operator names which; the review becomes the transcription source with provenance noted prominently in vignette Errata).
Option (A) is the recommended default. The cited-papers list is committed in the report so the operator can add them to the queue in batch.
-
Confirm the target subdirectory under inst/modeldb/ (usually specificDrugs/; endogenous, therapeuticArea, pharmacokinetics, and pharmacodynamics are also valid).
Phase 2 — Sync with origin/main and branch
Do this before any file changes:
git fetch origin
git checkout -b <firstauthor>-<year>-<drug> origin/main
Local main may be stale. The regenerated data/modeldb.rda / inst/modeldb.qs2 must reflect current origin/main or they will clobber models added upstream. Never push directly to main; always open a PR.
(When this skill runs as a task under claude_runner, the runner's preamble
will note that a fresh worktree has already been set up — verify with
git branch --show-current and skip the git fetch / git checkout -b
in that case. The runner provides the authoritative instructions for its
own environment.)
Worktree resumption — handle pre-existing WIP. Before drafting anything, run git status -s in the worktree. If the working tree is not clean (uncommitted modifications or untracked files), this worktree is a resumption of a prior task instance that crashed or was interrupted:
- Read every modified / untracked file under
inst/modeldb/specificDrugs/, inst/modeldb/<other-categories>/, vignettes/articles/, data/, inst/, and NEWS.md.
- Decide whether the prior WIP is salvageable (sound enough to continue from where it stopped) or unsalvageable (revert to clean state via
git restore . and git clean -fd, then start over).
- If salvageable: continue from where the prior run left off and note the resumption in the PR body so the reviewer is aware.
- If unsalvageable: clean and restart. No operator approval is required for the cleanup itself; the reset is automatic for an unsalvageable WIP.
If the worktree's branch was already committed and pushed in a prior run (i.e. git log origin/<branch>..HEAD shows no new commits and the branch exists on origin with task content), that is a different case — the task was already completed previously. Sidecar-ask:
Worktree branch <branch> is already pushed with prior task content. Options: (A) verify the pushed branch and exit cleanly without re-extracting, (B) tear down and re-extract from scratch (operator may have new source files / new skill version). Which applies?
Phase 3 — Model file
File path: inst/modeldb/<category>/<FirstAuthor>_<Year>_<drug>.R.
If the chosen <FirstAuthor>_<Year>_<drug> name collides with an existing file (rare — e.g., two same-author/year/drug entries with different scenarios), append a lowercase letter to the year: Author_2019a_drug.R, Author_2019b_drug.R. Use the same year-letter for both files in the pair so the chronological ordering is preserved.
The function name must equal the filename minus .R; buildModelDb() enforces this.
Use references/model-file-template.md as the starting skeleton and the two best-formed existing models as anchors:
inst/modeldb/specificDrugs/Clegg_2024_nirsevimab.R — covariates, maturation, correlated IIV, exported race-derivation helper.
inst/modeldb/specificDrugs/Hu_2026_clesrovimab.R — simpler case with Hill-type maturation.
The file body has this shape:
description, reference, vignette, units, covariateData, population — metadata before ini(). vignette is the basename of the validation vignette in vignettes/articles/ (e.g., "Clegg_2024_nirsevimab", no path, no extension); buildModelDb() extracts it so the list-of-models table can link to the rendered vignette on the pkgdown site.
ini() — parameters with label() and a trailing in-file comment pointing to the source location for every value. Wrap fixed parameters in fixed() — see the "Fixed parameters" subsection below.
model() — derived terms → individual parameters → micro-constants → ODEs → bioavailability → observation and error.
Fixed parameters in ini()
When the source paper holds a parameter at a known value rather than estimating it, encode that fact by wrapping the value in fixed() in ini(). This applies to every parameter type — structural THETAs, allometric exponents, IIV variances/covariances, residual-error magnitudes, covariate-effect coefficients — not just IIV. Failing to encode the fixed status loses load-bearing provenance: a downstream user cannot tell whether the value was estimated and reported as a point estimate, or whether the source authors held it constant; re-fitting the model under one assumption vs the other gives different results.
Recognise fixed parameters from any of these signals in the source:
- Explicit prose: "fixed during estimation", "fixed at ", "held fixed at the literature value", "not estimated", "set to 1 (fixed)".
- Allometric exponents reported without uncertainty (no SE / RSE / CI), especially the canonical 0.75 (CL-like) and 1 (V-like) values; if the paper says "allometric scaling with fixed exponents", encode those exponents as
fixed().
- NONMEM
$THETA lines with a FIX flag ($THETA (0.225 FIX) ; CL).
- NONMEM
$OMEGA / $SIGMA lines with a FIX flag ($OMEGA 0.0 FIX for an off-diagonal that was structurally constrained to zero; $SIGMA 1.0 FIX for log-transform-both-sides residual where the variance is absorbed into the additive term).
- Bioavailability
F1 = 1 / lfdepot <- log(1) set as a structural anchor when the paper writes "F was fixed to 1 because absolute bioavailability was not identifiable".
- Maturation reference values (e.g., reference WT = 70 kg, reference PMA = 40 weeks) — these are model-structural constants and are always fixed; encode them as numeric literals inside
model() or as fixed() parameters in ini() if they appear in label()-worthy form.
- Carry-over from upstream: when the current paper inherits a structural-PK model from a prior publication and re-estimates only a subset, the inherited parameters are fixed; wrap them in
fixed() and put the upstream citation in the trailing comment.
Encoding examples:
ini({
lcl <- log(0.225) ; label("Clearance (L/h)")
lcl <- fixed(log(2)) ; label("Clearance (L/h)")
e_wt_cl <- fixed(0.75) ; label("Allometric exponent on CL")
lfdepot <- fixed(log(1)) ; label("Bioavailability (depot)")
etalcl ~ 0.32
etalvc ~ fixed(0.18)
etalcl + etalvc ~ c(0.32, fixed(0), 0.18)
CcaddSd <- fixed(0.10) ; label("Additive residual SD on log-Cc (LTBS)")
})
When in doubt — for example, a $THETA reported with an RSE of 0% but no FIX flag, or a parameter reported to three decimal places with no uncertainty — sidecar-ask the operator before guessing. Mis-classifying an estimated parameter as fixed (or vice versa) is a real error that downstream users will hit.
Follow references/naming-conventions.md strictly:
- Structural PK parameters log-transformed:
lka, lcl, lvc, lvp, lvp2, lq, lq2, lfdepot.
- IIV:
eta + transformed name, e.g., etalcl (not etacl). Block correlations via etalcl + etalvc ~ c(var, cov, var).
- Residual error:
propSd, addSd. Multi-output: CcpropSd, tumorSizeaddSd, etc.
- Compartments:
depot, central, peripheral1, peripheral2, effect. Observation: Cc.
Covariate columns come from inst/references/covariate-columns.md. Before writing any covariate into the file:
- If the canonical name exists, use it and record the source column name in
covariateData[[name]]$source_name.
- If the source name is an alias of an existing canonical name (e.g., source uses
SEXM, canonical is SEXF), use the canonical name, note the required value transformation (SEXF = 1 - SEXM), and ask the user to confirm the effect-coefficient sign and reference-category implications before committing.
- If the concept isn't in the register at all, propose a new entry (canonical name, description, units, type, reference category, source aliases) and ask the user to confirm before adding it. The new entry is committed alongside the model.
- Do not add a change-log / history / "## Change log" / "## Summary" section or per-extraction history line to
inst/references/covariate-columns.md. The H3 entry itself is the authoritative record; chronological history of when an entry was added or modified is read from git log. Any context that someone needs to use the covariate (derivation rules, scope-promotion rationale, name-collision history, naming-decision sidecars) belongs in the H3 entry's Description / Notes / Source aliases — not in a separate change log.
population uses the extensible schema in naming-conventions.md. Common fields: n_subjects, n_studies, age_range, weight_range, sex_female_pct, race_ethnicity, disease_state, dose_range, regions, notes. Add any additional keys the paper describes (e.g., ga_range, renal_function, co_medication) — do not force facts into the common schema.
Phase 4 — Verification (re-read the source)
After the first pass, re-read the source independently and walk through references/verification-checklist.md. Common pitfalls:
- NONMEM THETA log-vs-linear reporting and
omega² = log(CV² + 1) for log-normal variance.
- NONMEM "additive on log-scale" ≡ proportional in nlmixr2's linear space.
- Reference weight / age for allometric and maturation terms (70 kg? 5 kg? 40 weeks PMA?).
- Reference category for categorical effects after composite-group renames.
- Bioavailability target compartment.
- Units consistency (dose × F ÷ V must give the declared concentration units).
Every parameter's in-file source-trace comment must be verified — the comment states where it came from, so checking is a line-by-line audit.
When any uncertainty remains, ask the user using this format:
Ambiguity at [source location]. Two plausible interpretations: (A) …, (B) …. Which applies?
Never silently resolve ambiguity. Never tune parameter values to match a validation target.
Missing-parameter / author-correspondence pathway. If a parameter you need to populate the model is not present anywhere on disk (not in the main paper, not in any supplement, not in any associated regulatory review), do not substitute a training-data value or a "typical" textbook value. Sidecar-ask:
Parameter <name> (e.g. kdeg, V_DXd) is required for the <full-TMDD / payload-PK / structural> model but is not reported in any source on disk for . Options: (A) draft an author-correspondence email and pause this task pending reply (operator handles the email), (B) approximate with <QSS / steady-state / fixed-from-class> and document the approximation in vignette Errata, (C) skip this paper. Which applies?
Operator-followup tracking: unresolved-parameter cases are recorded in tracking/operator_followups.md (the F1, F2, ... numbered register) so the operator can batch-send author emails. When a reply arrives with a numeric value, the operator records it in the followups file; this skill does NOT email authors.
Non-paper provenance — annotate inline. When a parameter value did not come from the paper's text or tables — e.g. the operator read it off a graphical figure, an author supplied it via email correspondence, or it was carried from an upstream-task model file — record the provenance as an inline comment on the parameter line so the source-trace is unambiguous:
ini({
lcl <- log(0.225) ; label("Clearance (L/day)")
lkdeg <- log(0.0231) ; label("Receptor degradation rate (1/day)")
lvdxd <- log(0.038) ; label("DXd payload volume (L/kg) — figure-derived")
})
This is mandatory for any parameter not from the paper text/tables. The vignette's Assumptions and deviations / Errata section must also list the non-paper provenance (see references/vignette-template.md).
Phase 5 — Validation vignette
File path: vignettes/articles/<FirstAuthor>_<Year>_<drug>.Rmd, with matching VignetteIndexEntry. Drug-specific vignettes live under vignettes/articles/ so pkgdown builds them for the site but CRAN does not — .Rbuildignore excludes that directory. The basename (without .Rmd) must match the vignette <- "..." field in the model file. Only the legacy PK_2cmt_mAb_Davda_2014.Rmd remains at top-level vignettes/.
Use references/vignette-template.md. Required sections, in order:
- Header and setup — libraries include
nlmixr2lib, PKNCA, rxode2, dplyr, ggplot2.
- Population — narrative reproducing the
population metadata; cite the source table listing baseline demographics.
- Source trace — a dedicated table listing the source location (page / table / equation / figure) for every model equation and every
ini() parameter. This is in addition to the in-file comments; the vignette gives reviewers a single place to audit provenance.
- Virtual cohort — covariate distributions match the population metadata. Use WHO weight-for-age curves for pediatric models.
- Simulation —
rxode2::rxSolve(mod, events) for stochastic VPCs; rxode2::zeroRe() + rxSolve for typical-value replications.
- Replicate published figures — one code chunk per figure, caption linking to the source figure number ("Replicates Figure 4 of ").
- PKNCA validation — required; no inline trapezoidal NCA. See
references/pknca-recipes.md. The PKNCA formula must include a treatment grouping variable (conc ~ time | id/treatment) so per-group results can be compared against the paper.
- Comparison against published NCA — if the source paper reports Cmax / Tmax / AUC / half-life, render a side-by-side comparison table. Flag differences > 20% in the narrative and investigate the source — do not tune.
- Assumptions and deviations — explicit list of what you had to assume because the paper didn't say (race distribution, z-score stability, etc.).
For endogenous / turnover models where NCA isn't the right validation, replace the PKNCA section with the steady-state / perturbation-recovery / mass-balance checks described in references/endogenous-validation.md. For multi-output models, run one PKNCA block per output.
Endogenous and mechanistic models
Papers that describe endogenous turnover, steady-state balance, or mechanistic enzyme kinetics (e.g., Kim 2006 IgG FcRn recycling, Charbonneau 2021 phenylalanine) have a different shape than drug PK models:
- Parameters are mechanistic constants (
Vmax, Km, kint, kcat, kpro, baseline concentrations bl_<species>, fractional-activity scalars like f_<enzyme>) rather than log-transformed CL/V.
ini() usually has no IIV etas and no residual error — the model describes population-typical mechanism, not variability.
model() has no dosing events; the state starts at a biological baseline (<state>(0) <- bl_<state> or <state>(0) <- css).
- Validation is not PKNCA. Use steady-state / perturbation-recovery / mass-balance checks. See
references/endogenous-validation.md.
- Dimensional analysis is load-bearing. These models often mix
mg/mL, mg/kg, L/kg, 1/day; a single unit slip silently corrupts the balance. Walk through every term in every ODE and the derived rates.
Naming conventions for mechanistic parameters are documented in references/naming-conventions.md under "Endogenous / mechanistic parameters."
Phase 6 — Registration, tests, docs, PR
-
Re-confirm the branch is on top of fresh origin/main (git fetch origin && git rebase origin/main if needed).
-
Run nlmixr2lib::buildModelDb() to regenerate data/modeldb.rda and inst/modeldb.qs2. Confirm the new model appears in modellib(). When verifying in R, do devtools::load_all(".") first so modellib() reads the worktree's in-development package, not the stale system install — see references/verification-checklist.md § "Verifying against the worktree's nlmixr2lib" for why a bare library(nlmixr2lib) can return a misleading FALSE.
-
Run nlmixr2lib::checkModelConventions(model = "<FirstAuthor>_<Year>_<drug>") and review the output. Any deviations from the canonical parameter / IIV / residual-error / covariate / compartment conventions (see references/naming-conventions.md and inst/references/covariate-columns.md) that the function flags should be either fixed in the model file before committing, or explicitly justified in the vignette's Assumptions and deviations section. buildModelDb() runs checkModelConventions() implicitly at package-build time, but running it explicitly on your new model makes drift visible before commit, not after CI. Paste the key lines of the output into the PR body so a reviewer can see what was checked.
-
Render the vignette locally and verify it completes without error in under 5 minutes of wall-clock time. This is a mandatory pre-push gate — the worker MUST run this command and verify exit 0 / HTML present before any git commit / git push of the new vignette. It catches the same failure modes pkgdown CI does — missing data columns, time-varying covariates assigned to rxEt objects that get silently dropped, PKNCA formulas referencing absent columns, simulation crashes — and surfaces them in seconds rather than after a CI cycle.
timeout 300 Rscript --vanilla -e "
pkgload::load_all('.', quiet = TRUE)
rmarkdown::render(
'vignettes/articles/<FirstAuthor>_<Year>_<drug>.Rmd',
quiet = FALSE
)
"
Interpret the result:
- Exit 0, output HTML present → vignette is clean; the gate is passed; proceed to commit.
- Exit non-zero with an R error → fix the vignette before committing. Read the traceback carefully; the most common causes are (a) a
select() / PKNCA formula referencing a column that was never created (add it in the preceding mutate()), (b) a variable name used before it is defined (reorder chunks), (c) a simulation that errors because covariate values are out of range for the model, (d) event_table$col <- vector syntax against an rxEt object — rxode2 silently drops these assignments, so always materialize via as.data.frame() and add covariate columns there before passing to rxSolve().
- Exit 124 (timeout) → the vignette exceeds 5 minutes. Reduce simulation size: cut
nSub (stochastic VPC subjects), shorten the observation grid, or move expensive chunks behind eval = FALSE with a note. Do not skip the time budget — pkgdown CI has strict wall-time limits and a slow vignette breaks the build for everyone.
- C-level segfault (
*** caught segfault ***) → broken R / rxode2 / nlmixr2 install, not a model-file problem. Stop, sidecar-ask the operator to fix the environment; do not work around with --no-build-vignettes.
Do not interpret silence as success — re-run with quiet = FALSE and confirm an HTML output file landed next to the .Rmd. Do not assume "checkModelConventions was clean, vignette must be fine" — those are independent failure modes. The render gate is the only one that exercises the full data path the way pkgdown CI does.
-
Verify no non-ASCII characters in the new model file or vignette (mandatory pre-push gate). R CMD check warns on non-ASCII strings in package data — and the description <- field of every model file is reified into data/modeldb.rda, so a single em-dash, en-dash, multiplication sign, or Greek letter in the description triggers a WARNING: found non-ASCII strings and breaks the R-CMD-check matrix on every platform. Comments inside the model file do not trigger the data-warning, but the same characters in vignette source can cause downstream encoding surprises (cross-platform, locale, knitr, pkgdown), so apply the gate to both.
Run, replacing the placeholders:
for f in inst/modeldb/<category>/<FirstAuthor>_<Year>_<drug>.R \
vignettes/articles/<FirstAuthor>_<Year>_<drug>.Rmd; do
if LC_ALL=C grep -nP "[^\x00-\x7F]" "$f"; then
echo "FAIL: non-ASCII in $f"; exit 1
fi
done
echo "OK: all checked files are ASCII-only"
If anything matches, replace the offending characters with their ASCII equivalents before committing. Common substitutions that have triggered this warning in past extractions:
| Non-ASCII | ASCII replacement |
|---|
— (em dash, U+2014) | -- |
– (en dash, U+2013) | - |
× (multiplication sign) | x (or * when between numeric quantities) |
· (middle dot) | * (multiplication) or . (decimal-style) |
≈ | ~= |
→ | -> |
∈ | in |
≤ | <= |
≡ | == |
… | ... |
² | ^2 |
µ (Greek mu) | u (e.g. µg/mL → ug/mL) |
Greek letters in prose (η, λ, etc.) | spell out (eta, lambda) |
§ | Section |
The description string is the highest-impact site (it is what triggers the R-CMD-check warning), but be thorough — replace non-ASCII anywhere in the new model file and vignette so the gate stays clean even after edits.
-
Run devtools::check(). Vignettes must build cleanly.
A C-level segfault (*** caught segfault ***) during check() or vignette rendering is a red flag — it indicates a broken R / rxode2 / nlmixr2 install in the environment, not a model-file problem. Stop, sidecar-ask the operator to investigate and fix the environment, and do not work around it with --no-build-vignettes or similar flags.
-
Add a short, single-line NEWS.md entry under the current development
version. The goal is a scannable changelog — the model file and vignette
already contain the full detail, so NEWS should only mention:
- the drug,
- a minimal reference (author + year + DOI link — not a full citation),
- and the population studied.
Format:
- Add <Author> <Year> <drug> ([doi:<doi>](https://doi.org/<doi>)) — <population phrase>.
Examples:
- Add Xu 2019 sarilumab ([doi:10.1007/s40262-019-00765-1](https://doi.org/10.1007/s40262-019-00765-1)) — adults with rheumatoid arthritis.
- Add Clegg 2024 nirsevimab ([doi:10.1007/s40262-024-01387-y](https://doi.org/10.1007/s40262-024-01387-y)) — preterm and term infants.
Do NOT include: covariate list, compartment count, IIV structure,
residual-error form, data origin, study counts, PKNCA sentence, or
anything else that lives in the model file's metadata or vignette. A
reviewer who wants those details clicks through to the model file.
-
Commit the model file, the vignette under vignettes/articles/, the regenerated modeldb.rda / modeldb.qs2 / modeldb.Rd, the NEWS.md entry, and any updates to inst/references/covariate-columns.md (if a new covariate was registered) together on the feature branch.
-
Push the branch and open a PR against main. Use gh pr create with a title like Add <Author> <Year> <drug> model. Before pushing, confirm steps 4 (vignette render) and 5 (non-ASCII check) were both run on the final state of the model and vignette files and both exited clean. If either file was edited after the last gate run — even for a typo fix or a cosmetic comment change — re-run both gates; do not push on the strength of an earlier check against an older copy.
Stop-and-ask triggers (consolidated)
Don't guess — ask the user when:
- The source has multiple non-hierarchical models and it's not obvious which to extract.
- Parameter values look like initial estimates rather than final.
- Covariate encoding isn't fully specified (reference category, units, transformation).
- A source column name is not in
inst/references/covariate-columns.md (propose a new entry and confirm).
- A source column is an alias of an existing canonical name and the mapping involves value inversion or a reference-category flip.
- A parameter name deviates from the nlmixr2lib standard (propose the canonical name and confirm).
- PKNCA output disagrees with a published NCA table by more than ~20% after careful review.
- The source is paywalled and the user hasn't supplied the text.
- An erratum search is inconclusive (e.g., paywalled journal, ambiguous correction notice) — ask the user to confirm whether any corrections apply.
- The paper's final model was fit to animal-only data (see Phase 1 species-check step).
- The PMC XML / PDF on disk contains only the paper's abstract (see Phase 1 full-text-check step).
- The on-disk file's title / first-author / journal / year / drug disagrees with the task's
Paper metadata block (see Phase 1 paper-identity-check step).
- The paper depends on an upstream popPK model that is not on disk — either the upstream paper is identifiable (queue dependency) or unidentifiable (operator decision needed). See Phase 1 upstream-popPK detection step.
- A required structural-model parameter is absent from every on-disk source (paper + supplements + regulatory review). Use the Phase 4 missing-parameter sidecar template; do not fall back to training data.
- The worktree at dispatch has a pre-existing pushed branch for this task ID (prior completed run). See Phase 2 worktree-resumption step.
devtools::check() or vignette rendering produces a C-level segfault — the environment is broken; do not paper over with --no-build-vignettes or similar.
- The
timeout 300 rmarkdown::render() gate exits 124 — the vignette exceeds 5 minutes; reduce simulation size before committing.
Use this fixed format for ambiguities:
Ambiguity at [source location]. Two plausible interpretations: (A) …, (B) …. Which applies?
A stop-and-ask trigger is not advisory: when one fires, stop work
on the current task and wait for the operator. Do not "document a
best guess and proceed" — silent best-guesses ship as bugs the
operator cannot retroactively correct without re-running the whole
extraction. If you find yourself thinking "I'll just pick the safest
option and move on," that itself is a stop-and-ask signal.
When running interactively, use AskUserQuestion and wait for the
answer. When running under a task runner, use the runner's
documented stop-and-ask protocol (the runner is responsible for the
mechanics — file paths, schema, status transitions, re-dispatch).
Either way: stop, ask, wait — do not guess past the trigger.
Refusal handling
If at any phase you find yourself producing a content-policy refusal in response to legitimate clinical-trial content (drug + dose + indication + adverse-event language is normal in pharmacometric papers and is not a safety concern), do not silently degrade the extraction. Sidecar-ask:
Phase step hit a content-policy refusal while reading <paper section / table>. The content is standard published clinical pharmacology (drug + dose + indication + AE language). Options: (A) rephrase the extraction prompt and continue, (B) skip the affected section and document the gap in vignette Errata, (C) defer the task pending operator review.
A refusal is operator-actionable signal, not a fatal error. Treat it the same as any other stop-and-ask trigger.
Constraints
- Never invent parameter values; if it's not in the source, ask.
- Never tune parameters to make a validation output match a target.
- This skill only adds new models. Retrofitting existing vignettes to use PKNCA, or renaming covariates in existing files, is a future separate skill.
- Never push directly to
main. Open a PR.