Run any Skill in Manus with one click

bio-epidemiological-genomics-variant-surveillance

Stars943

Forks165

UpdatedMay 29, 2026 at 16:17

Assigns pathogen lineages (SARS-CoV-2 Pangolin via UShER mode; Nextclade clade + QC; pango-designation alias_key.json resolution) and tracks variant frequencies over time using Nextstrain (Augur + Auspice), wastewater deconvolution (Freyja, COJAC, alcov, lineagespot), lineage fitness modelling (Wenseleers / Bedford-Figgins multinomial logistic), and recombinant detection (3SEQ, RDP4, Bolotie). Covers Pangolin pangolin-data version pinning (mandatory for reproducibility), Nextclade dataset versioning (lineage-defining mutations change with dataset), Freyja barcode forward-only date constraint, ARTIC primer scheme version churn (V3 / V4 / V4.1 / V5.3.2 / Midnight 1200) with documented dropout regions, recombinant X-prefix Pango designation lag, GISAID vs INSDC dual-deposition tensions, and the Karthikeyan 2022 wastewater early-detection signal with explicit reproducibility caveats. Use when assigning Pango lineages and Nextclade clades to viral consensus sequences, building Nextstrain Augur surveillance pipelin

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

GPTomics

GPTomics/bioSkills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

3 files

SKILL.md

readonly

Version Compatibility

Reference examples tested with: pangolin 4.3+ (pangolin-data 1.30+), nextclade 3.8+, augur 24.0+, freyja 1.4+, cojac 0.9+, lineagespot 1.6+ (Bioconductor), usher 0.6+, matUtils 0.6+, samtools 1.20+, lofreq 2.1+, ivar 1.4+, ARTIC pipeline 1.3+, snakemake 8.5+, pandas 2.2+, BioPython 1.84+, jq 1.7+.

Before using code patterns, verify installed versions match. If versions differ:

pangolin --all-versions -- prints pangolin + pangolin-data + scorpio + constellations versions
nextclade dataset list --tag latest sars-cov-2 -- list current dataset tags
freyja --version; freyja barcode-build --help (note: HYPHEN, not underscore; some legacy docs show barcode_build)
nextclade run --help -- v3+ syntax replaced v2; old nextclade invocation no longer works
augur --version; augur refine --help for current root-strategy flags

If pangolin --inference usher is rejected, the flag is --analysis-mode usher (no --inference). If nextclade --input-dataset DIR works, the installed version may be v2; v3 accepts both but --dataset NAME is the modern form for built-in datasets. Pangolin and Nextclade output column names differ between major releases -- introspect rather than retry.

Variant Surveillance

"Which lineages are circulating, and how fast are they growing?" -> Assign consensus or wastewater samples to a curated lineage / clade nomenclature, then track frequencies over time with explicit version pinning. The lineage assignment is NOT a stable property of the sequence; it is a property of the sequence interpreted by a specific pangolin-data / Nextclade-dataset version. Two labs running the same Pangolin binary with different pangolin-data versions can produce different calls on the same genome. For published or regulatory output, pin BOTH the executable AND the dataset version (pangolin --all-versions; nextclade dataset list --tag latest), and re-run the whole archive after every dataset update -- comparing today's BA.2.86 call to last month's "Unassigned" call is invalid.

CLI: pangolin sequences.fasta --analysis-mode usher --outfile lineage_report.csv -- UShER mode is the default since v4 (pangoLEARN deprecated mid-2023)
CLI: nextclade run --input-dataset nc_dataset/sars-cov-2 --output-tsv nc.tsv sequences.fasta -- clade + Pango + QC + mutations
CLI: freyja variants sample.bam --variants sample.variants.tsv --depths sample.depths.tsv --ref reference.fa then freyja demix sample.variants.tsv sample.depths.tsv --output sample.demix.tsv -- wastewater lineage deconvolution
CLI: augur refine --tree tree.nwk --alignment aln.fasta --metadata meta.tsv --output-tree refined.nwk --root oldest --timetree -- Nextstrain time-scaling

The Single Most Important Modern Insight -- Lineage assignment is dataset-version-dependent

A SARS-CoV-2 sequence called BA.5 today might be called BA.5.2.1 next week and KP.3 a month after that. pangolin-data and nextclade-dataset are updated weekly; lineage definitions evolve through pango-designation GitHub issues, often days-to-weeks before pangolin-data releases include the lineage. During the lag window, the same genome submitted in lab A (older pangolin-data) and lab B (current) gets different calls. The cross-lab "different lineage" result is then misread as biology. For any report, pin BOTH the executable AND the dataset version with pangolin --all-versions and nextclade dataset list --tag latest recorded alongside the call. For longitudinal studies, re-run the WHOLE archive after every dataset update -- comparing today's BA.2.86 call against last month's "Unassigned" call is invalid. Second-order insight: Pangolin's pangoLEARN mode was officially deprecated mid-2023 in favour of UShER mode (Pongmoragot 2024 Virus Evol 10:vead085); cross-study comparison of XBB sub-lineage prevalence from 2022 - mid-2023 is contaminated by the pangoLEARN -> UShER mode switch even when the same pangolin-data version is used.

Algorithmic Taxonomy

Tool	Mechanism	Inputs	Output	Strength	Fails when
Pangolin UShER mode (O'Toole 2021 Virus Evol 7:veab064; Pongmoragot 2024 Virus Evol 10:vead085)	Parsimony placement on daily-updated UShER mutation-annotated tree	SARS-CoV-2 consensus	Pango lineage call	UShER is the default since v4; more accurate than pangoLEARN for recent / divergent lineages	Designation lag for emerging lineages; recombinants require manual Pango-X designation
Pangolin pangoLEARN mode	Random-forest classifier trained on pangolin-data	SARS-CoV-2 consensus	Pango lineage call	Fast	DEPRECATED mid-2023; less accurate than UShER for novel sub-lineages
Nextclade (Aksamentov 2021 JOSS 6:3773)	Reference-tree placement + clade assignment + mutation calling + QC	Viral consensus (multi-pathogen)	Clade + Pango + QC + mutations	Integrated alignment QC; mutation outliers; recombination indicators	Dataset version drift changes lineage-defining mutations
Nextstrain Augur (Huddleston 2021 JOSS 6:2906)	Python CLI for subsampling + alignment + tree + ancestral-trait + time-tree	Genomes + metadata + sampling config	Auspice JSON for visualization	End-to-end pipeline for curated surveillance builds	Subsampling configuration drives results more than data; nextstrain.org subsamples ~3000-5000 of millions
UShER + matUtils + matOptimize + RIPPLES (Turakhia 2021 Nat Genet 53:809)	Parsimony placement on daily MAT; SPR refinement; recombination detection	New consensus + existing MAT	Updated MAT, subtrees, recombinant calls	Pandemic-scale (millions of genomes)	Parsimony branch lengths systematically shorter than ML; re-estimate before downstream R_e
Freyja (Karthikeyan 2022 Nature 609:101)	Depth-weighted LAD regression on barcode-matrix mutation frequencies	Wastewater BAM + barcode	Per-lineage abundance	Recovers expected abundances down to ~5%; quantitative	Lineages absent from barcode invisible; barcode is forward-only -- cannot deconvolve lineages designated after barcode date
COJAC (Jahn 2022 Nat Microbiol 7:1151)	Co-occurrence of signature mutations on the same read pair	Wastewater BAM	Per-lineage presence / absence	More robust than per-site frequencies; detected Alpha 13 days before clinical	Single-read amplicons (no co-occurrence) cannot resolve; requires paired-end or long-read
alcov	Lineage deconvolution similar paradigm to Freyja	Wastewater BAM	Per-lineage abundance	Alternative to Freyja	Less benchmarked
lineagespot (Pechlivanis 2022 Sci Rep 12:2659)	R/Bioconductor lineage deconvolution from VCF + signature mutations	VCF + reference lineage mutations	Per-lineage abundance	R / Bioconductor integration	Less ML-driven; smaller community
Wenseleers / Bedford-Figgins multinomial logistic (Abousamra, Figgins, Bedford 2024 PLoS Comput Biol 20:e1012443)	Multinomial logistic regression on lineage frequencies over time	Lineage frequencies + dates	Growth advantage per lineage with 95% CI	Standard for outbreak.info / cov-lineages.org	Marginal CI for one lineage hides covariance with all others; early estimates inflated
3SEQ (Boni 2007 Genetics 176:1035)	Triplet-based recombination detection	Aligned sequences	Recombinant candidates	General-purpose	High false-positive rate at low divergence
RDP4 / RDP5 (Martin 2015 Virus Evol 1:vev003)	Multiple-method recombination detection	Aligned sequences	Recombinant candidates	Multi-method consensus	Slow; parameter-sensitive
Bolotie (Varabyou 2021 Bioinformatics 37:2298)	SARS-CoV-2-specific recombination detection	SARS-CoV-2 consensus	Recombinant candidates	Tuned for SARS-CoV-2 sub-lineage divergence	Specialist tool

Decision Tree by Scenario

Scenario	Recommended	Why wrong choices fail
Assign lineage to a SARS-CoV-2 consensus	Pangolin with `--analysis-mode usher` (UShER default since v4) + Nextclade cross-check; pin pangolin-data and Nextclade dataset versions	pangoLEARN alone (deprecated); not pinning version (cross-lab calls diverge)
Detect emerging variants in wastewater	COJAC for early detection (co-occurrence on amplicon) + Freyja for quantitative tracking; pin Freyja barcode version	Naive site-frequency aggregation; comparing across barcode versions
Track lineage frequencies over time	Multinomial logistic regression (Wenseleers / Bedford-Figgins) OR Bayesian renewal equation; report covariance among lineages	Plotting raw counts without CI; reporting single-lineage growth advantage without covariance
Build a regional surveillance phylogeny	Nextstrain Augur pipeline; subsample to manageable size; TreeTime for dates; document subsampling	BEAST on raw 10k+ samples (intractable); not documenting subsampling
Compare wastewater results across labs	Same primer scheme + same Freyja barcode + same Pangolin / Nextclade version	Mixing primer schemes; mixing barcode versions; comparing across pangolin-data versions
QC a new SARS-CoV-2 genome	Nextclade (alignment QC; mutation outliers; recombination indicators)	Pangolin alone (passes confidently on bad genomes)
Detect recombinant lineages	Trust Pango-designation X-prefix assignments; for novel candidates use RDP5 / 3SEQ / Bolotie + manual review	Manual eyeballing of mutation patterns; ignoring designation lag
Phylogenetic context for outbreak	UShER + matUtils subtree extraction; re-estimate branch lengths via TreeTime for downstream R_e	Re-treeing from scratch every time
Estimate vaccine-escape risk	Lab assays (neutralisation, escape mutants) + structural prediction; genomic surveillance flags candidates	Pure genomic prediction without lab validation
Wastewater-to-cases conversion	Variant-specific shedding rate (Omicron BA.1 shed less per case than Delta); pin barcode + report uncertainty	Fixed RNA-to-cases ratio across variants is wrong; variant-specific shedding has been documented in the wastewater literature

Methodology evolves; before any high-stakes lineage report, verify Pangolin's current default analysis-mode and Nextclade's bundled dataset against pango-designation issues for any emerging lineage.

Pangolin Lineage Assignment With Version Pinning

Goal: Assign Pango lineages to SARS-CoV-2 consensus sequences using UShER mode (the default since v4; pangoLEARN deprecated mid-2023), with full pangolin-data version provenance preserved for reproducibility.

Approach: Always pass --analysis-mode usher; record pangolin --all-versions output alongside every lineage call; for published or regulatory output, pin pangolin-data to a specific release tag and re-run the whole archive whenever the version is updated.

pangolin sequences.fasta --analysis-mode usher --outfile lineage_report.csv
pangolin --all-versions > pangolin_versions.txt

pangolin --all-versions prints: pangolin executable version, pangolin-data version (weekly updated; mandatory pin for reproducibility), scorpio version, and constellations version. All four are version-sensitive; in published surveillance reports, pin all four.

Nextclade With Dataset Pinning

Goal: Assign Nextstrain clade, Pango lineage, mutations, and QC flags to SARS-CoV-2 consensus sequences with explicit dataset version provenance.

Approach: Fetch the current dataset with nextclade dataset get --name sars-cov-2 --output-dir nc_dataset/sars-cov-2; record the pathogen.json tag / commit hash; run nextclade run --input-dataset on the pre-downloaded folder so the dataset version is locked in for the analysis.

nextclade dataset get --name sars-cov-2 --output-dir nc_dataset/sars-cov-2
NC_DATASET_TAG=$(jq -r '.tag // .version' nc_dataset/sars-cov-2/pathogen.json)

nextclade run \
    --input-dataset nc_dataset/sars-cov-2 \
    --output-tsv nextclade.tsv \
    --output-json nextclade.json \
    sequences.fasta

echo "nextclade_dataset_tag: ${NC_DATASET_TAG}" > nextclade.metadata

Different dataset versions assign different mutations as "lineage-defining" because internal-node placement can shift as the tree grows. Cross-version comparison of mutation reports is therefore method-dependent.

Wastewater Lineage Deconvolution With Freyja

Goal: Estimate per-lineage abundance in a wastewater pooled sample with explicit handling of the barcode forward-only date constraint, primer-scheme awareness, and residual mass interpretation.

Approach: Confirm barcode date postdates sample collection; if not, freyja barcode-build from the current UShER tree; variant call with freyja variants then deconvolve with freyja demix; inspect the resid column (residual mass NOT assigned to known lineages; high resid indicates a novel lineage is invisible); apply primer-scheme-aware coverage masking; report variant-specific uncertainty.

freyja update

freyja variants \
    sample.bam \
    --variants sample.variants.tsv \
    --depths sample.depths.tsv \
    --ref reference.fa

freyja demix \
    sample.variants.tsv \
    sample.depths.tsv \
    --output sample.demix.tsv

The Freyja --barcodes (or default bundled) date MUST postdate the sample collection date. Lineages designated after the barcode date cannot be detected -- the demixing silently fails and presents as elevated abundance of the closest parent lineage. For samples potentially containing emerging lineages, regenerate barcodes:

freyja barcode-build \
    --pb-and-meta usher_tree.pb \
    --output-dir custom_barcodes

Subsequent methodological extensions to Karthikeyan have appeared in the wastewater literature, and recent benchmarks comparing the major deconvolution tools (Freyja, COJAC, alcov, lineagespot, LCS) confirm Freyja and COJAC consistently perform well, with performance degrading at low coverage and for divergent lineages.

COJAC for Co-Occurrence Detection

Goal: Detect emerging variants in wastewater earlier than per-site frequency methods by requiring co-occurrence of two signature mutations on the same amplicon (read pair).

Approach: COJAC checks read pairs for joint occurrence of variant-defining mutations; the inferential leap is more robust because a single site can have shared mutations across lineages, but two signature mutations on the same read pair strongly imply a single lineage. Detected Alpha 13 days before clinical samples in Swiss data (Jahn 2022 Nat Microbiol 7:1151).

cojac cooc-mutbamscan \
    -a primer_scheme.bed \
    -m variants_definitions.yaml \
    -b sample.bam \
    -o sample.cooc.tsv

Nextstrain Augur Pipeline

Goal: Build a curated regional surveillance phylogeny with subsampling, alignment, tree, ancestral-trait inference, and time-scaling -- in Auspice-visualisable format. The Nextstrain platform was introduced by Hadfield 2018 Bioinformatics 34:4121; Augur is the Python CLI (Huddleston 2021 JOSS 6:2906).

Approach: Pull the latest official pathogen build from github.com/nextstrain/; subsample to manageable size (typically 3000-5000 genomes per global build; smaller regional); document subsampling configuration explicitly (it drives the result more than the underlying data per Hodcroft 2021).

augur align --sequences seqs.fasta --reference-sequence ref.gb --output aligned.fasta
augur tree --alignment aligned.fasta --output tree.nwk
augur refine \
    --tree tree.nwk \
    --alignment aligned.fasta \
    --metadata meta.tsv \
    --output-tree refined.nwk \
    --output-node-data branch_lengths.json \
    --timetree \
    --root oldest \
    --coalescent skyline
augur ancestral --tree refined.nwk --alignment aligned.fasta --output-node-data nt_muts.json
augur translate --tree refined.nwk --ancestral-sequences nt_muts.json --reference-sequence ref.gb --output-node-data aa_muts.json
augur traits --tree refined.nwk --metadata meta.tsv --columns country region --output-node-data traits.json
augur export v2 \
    --tree refined.nwk \
    --metadata meta.tsv \
    --node-data branch_lengths.json nt_muts.json aa_muts.json traits.json \
    --output auspice.json

Hodcroft 2021 Nature 591:30 documented that Nextstrain subsampling configurations drive lineage-time estimates more than the underlying data. Two researchers using the official pipeline with different subsampling can get different MRCA dates and migration patterns from the same raw genomes.

Per-Method Failure Modes

pangolin-data version skew between labs

Trigger: Two labs submit the same consensus genome to Pangolin with different pangolin-data versions; the lineage call differs.

Mechanism: Lineage designation happens through pango-designation GitHub issues -- days-to-weeks before pangolin-data releases include the lineage. During the lag, the same genome is callable as the parent (older pangolin-data) or the child (current). pangolin-data is updated weekly.

Symptom: Cross-lab lineage prevalence comparisons over time show implausible jumps coinciding with pangolin-data release dates rather than biology.

Fix: Pin pangolin-data version explicitly with pangolin --all-versions recorded alongside every call. For published or regulatory output, re-run the WHOLE archive against a single pangolin-data version before reporting.

Freyja barcode predates the sample collection date

Trigger: Wastewater sample collected after a new lineage was designated; Freyja barcode built before that designation.

Mechanism: Freyja barcodes are built from the UShER tree at a specific date; lineages designated AFTER the barcode date cannot be detected. The demixing silently fails -- the new lineage's signal is misassigned to its closest parent.

Symptom: Wastewater sample shows implausibly high abundance of a single parent lineage; new lineage that should be present is reported as 0%.

Fix: Run freyja update regularly; for samples potentially containing emerging lineages, regenerate barcodes with freyja barcode-build from the current UShER tree. Report resid (residual mass not assigned to known lineages); high resid indicates a novel lineage is being missed.

ARTIC primer dropout misread as deletion

Trigger: SARS-CoV-2 surveillance using ARTIC V4.1 amplicons; new variant has mutation at primer site; amplicons 64 / 76 / 88-90 silently drop out.

Mechanism: When a primer fails to bind, the amplicon doesn't amplify; consensus calling produces N's or reference-derived calls in that region. This LOOKS LIKE a deletion in downstream analysis but is actually missing data. Itokawa 2020 PLoS ONE 15:e0239403 documented primer interactions specifically.

Symptom: "Deletion" calls cluster in known dropout amplicons; Pangolin / Nextclade lineage call shifts when masked positions are filled with reference.

Fix: Inspect per-amplicon coverage with samtools depth -aa; mask consensus positions in dropped amplicons (use Ns -- Pangolin and Nextclade handle Ns gracefully). Document primer scheme version (V3 / V4 / V4.1 / V5.3.2 / Midnight) per isolate.

Recombinant assigned to one parent lineage

Trigger: A SARS-CoV-2 recombinant (e.g., XEC = KS.1.1 x KP.3.3) emerges; pango-designation has not yet issued the X-prefix designation; Pangolin assigns to one of the parents.

Mechanism: Pangolin in either mode assigns a recombinant to one parent lineage if no Pango-X designation exists yet. Identifying recombinants requires breakpoint detection (3SEQ, Bolotie, RDP4) and manual designation through pango-designation; the designation can lag emergence by weeks-to-months for novel recombinants.

Symptom: Outbreak interpretation conflates a recombinant lineage with its parent; transmissibility / immune-escape claims are wrong.

Fix: For any candidate emerging lineage with unusual mutations, run Bolotie or 3SEQ for recombination detection; cross-check Pangolin vs Nextclade lineage call; submit candidate recombinants to cov-lineages issue tracker if novel.

pangoLEARN result reported as authoritative

Trigger: Pangolin run with --analysis-mode pangolearn (or via legacy Docker image that defaults to pangoLEARN); user reports the call.

Mechanism: Pongmoragot 2024 Virus Evol 10:vead085 demonstrated UShER mode is significantly more accurate for recent / divergent lineages. pangoLEARN was officially deprecated mid-2023.

Symptom: Cross-lab comparison reveals one lab using pangoLEARN (legacy) and another using UShER; calls differ at borderline lineages.

Fix: Switch to --analysis-mode usher (default since v4). For longitudinal datasets crossing the mid-2023 mode-switch, re-run the historical archive against UShER mode.

Freyja barcode-build vs barcode_build flag

Trigger: Script written from older Freyja documentation using freyja barcode_build (underscore).

Mechanism: Current Freyja versions use barcode-build (hyphen); the underscore form may not be recognised.

Symptom: Subprocess fails with "unrecognized command".

Fix: Use freyja barcode-build (hyphen). Verify with freyja --help.

Wenseleers / Bedford lineage-growth CI hides covariance

Trigger: Reporting a single lineage's growth advantage 95% CI from a multinomial logistic regression.

Mechanism: The CI for any single lineage is conditional on all other lineages being held at their estimated growth rates; the marginal CI hides covariance among lineages. Early growth-advantage estimates are systematically too large; they shrink as more time passes (alternative explanations become identifiable).

Symptom: Initial published growth advantage > later refined estimate; "outlier-fast" lineages later moderated.

Fix: Report the full multinomial covariance matrix or at minimum the rank-ordered growth advantages with simultaneous CIs. Cite Abousamra 2024 PLoS Comput Biol 20:e1012443.

Nextstrain subsampling drives the result

Trigger: Nextstrain Augur build with default subsampling at 3000-5000 genomes from millions; user interprets the tree topology as authoritative.

Mechanism: Hodcroft 2021 Nature 591:30 commented that subsampling decisions drive lineage-time estimates more than the underlying data. Two researchers using the official Nextstrain pipeline with different subsampling configurations get different MRCA dates and migration patterns from the same raw genomes.

Symptom: Published Nextstrain tree differs from another analysis on the same raw data; conclusions sensitive to subsampling.

Fix: Document subsampling configuration explicitly in any Nextstrain build; run sensitivity analysis with alternative subsampling; treat MRCA dates and migration calls with appropriate uncertainty.

Reconciliation: When Methods Disagree

Pattern	Likely cause	Action
Pangolin "BA.2.86", Nextclade clade "23I"	Equivalent at different resolutions -- BA.2.86 is within 23I	Report both; Pango lineage for sub-clade resolution
Pangolin "BA.5.2", Nextclade "Unassigned"	Nextclade dataset older than pangolin-data; OR Nextclade QC failed	Update Nextclade dataset; re-run; inspect QC fields
Pangolin UShER and pangoLEARN disagree	pangoLEARN is the deprecated decision-tree classifier	Trust UShER call
Freyja shows 0% of expected lineage	Lineage absent from current barcode (post-dates barcode)	Rebuild barcodes (`freyja barcode-build`); confirm lineage is in the UShER tree the barcode is built from
Freyja confidence < 0.7 on dominant lineage	Sub-100x coverage OR amplicon dropout	Inspect per-amplicon coverage; consider re-sequencing; report as indeterminate
Nextclade and Pangolin disagree on recombinant	Recombinants inherently ambiguous; depends on which parent's SNPs dominate	Report as recombinant candidate; submit to cov-lineages if novel
Two consecutive pangolin-data releases call the same consensus differently	Lineage definitions revised between releases	Pin pangolin-data; record version + date alongside lineage
COJAC detects a variant Freyja does not	COJAC's co-occurrence requirement more sensitive at low abundance	Trust COJAC for early detection; Freyja for quantitative tracking
Wastewater Freyja result conflicts with clinical lineage prevalence	Barcode staleness; primer dropout in wastewater; faecal shedding rate varies by variant	Update barcode; check per-amplicon coverage; flag variant-specific shedding

Quantitative Thresholds

Quantity	Threshold	Source / rationale
Pangolin min coverage for lineage call	>=50% genome coverage (~14kb)	Pangolin convention
Nextclade QC stop-codon threshold	Per dataset; check `pathogen.json`	Nextclade dataset-specific
Freyja minimum coverage per site	>=10x typical	Freyja convention; per-site weighting accounts for variance
Freyja `resid` flag threshold	Project-specific; >0.1 typically indicates novel lineage missed	Freyja documentation
COJAC early-detection lead time vs clinical	Up to 13 days in Swiss data	Jahn 2022 Nat Microbiol 7:1151
Karthikeyan 2022 wastewater Omicron lead	11 days before clinical detection (San Diego)	Most-favourable configuration; subsequent retrospective analyses produced detection lags ranging from -5 to +3 days
ARTIC V4.1 known chronic dropouts	Amplicons 64, 76, 88-90	Itokawa 2020 / community documentation
Augur subsampling typical	3000-5000 genomes per global build	Nextstrain convention; document explicitly
Multinomial logistic growth-advantage early estimate inflation	Systematically too large; shrinks over time	Abousamra 2024 PLoS Comput Biol 20:e1012443
GISAID 2024-2025 weekly submission rate	~5,000-20,000/week (down from ~500,000/week peak early 2022)	Community-documented; emerging-lineage detection lag increased

Common Errors

Error / symptom	Cause	Solution
`pangolin --inference usher` rejected	Flag is `--analysis-mode usher`	`--analysis-mode usher`
`nextclade run --input-dataset DIR` rejected on v2	v2 used `--input-dataset` differently	Verify `nextclade --version`; v3+ accepts pre-downloaded dataset folder
`freyja barcode_build` rejected	Current is `barcode-build` (hyphen)	Use hyphen form
Pangolin output column not present	Column names changed between major releases	Introspect output schema; `pangolin --all-versions`
Freyja silently misassigns new lineage	Barcode predates lineage designation	Rebuild barcodes; check `resid`
Nextclade and Pangolin disagree	Different versions; recombinant; QC	Update both; reconcile per table
ARTIC consensus has Ns clustered in one region	Primer dropout in that amplicon	Mask the amplicon; document scheme version
Lineage frequency shows implausible jump	pangolin-data version drift	Pin version; re-run archive
Augur tree topology changes between runs	Subsampling randomness	Pin random seed; document subsampling
Augur `refine` requires `--root`	Shallow tree without explicit root strategy	`--root best` / `oldest` / `residual`
Wastewater Freyja result differs from clinical	Barcode staleness or primer-scheme mismatch	Update barcode; document scheme; check coverage
COJAC misses a known variant	Single-read amplicons (no co-occurrence)	Re-sequence with paired-end or long-read

Anticipated Reviewer Pushback

Pushback	Response
"Pangolin version?"	`pangolin --all-versions` recorded; pinned for the analysis; archive re-run on dataset update
"Nextclade dataset version?"	Dataset tag from `pathogen.json` recorded; pre-downloaded folder used to lock the version
"Why UShER not pangoLEARN?"	pangoLEARN deprecated mid-2023 (Pongmoragot 2024); UShER default since v4
"How were ARTIC dropouts handled?"	Per-amplicon coverage checked; failed amplicons masked; primer scheme documented per isolate
"Were recombinants checked for?"	Bolotie / 3SEQ run on candidates; cross-checked Pangolin vs Nextclade; submitted to cov-lineages for novel candidates
"Wastewater barcode date?"	Barcode date postdates sample collection; `freyja barcode-build` from current UShER tree if needed
"Wastewater-to-cases conversion?"	Variant-specific shedding rate flagged in the wastewater literature; not assumed constant
"Lineage growth-advantage CI?"	Multinomial covariance reported; early estimates noted as inflated (Abousamra 2024)
"Nextstrain subsampling?"	Configuration explicit; sensitivity analysis run; MRCA / migration treated with uncertainty (Hodcroft 2021)

References

O'Toole Á, Scher E, Underwood A et al (2021) Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol 7(2):veab064. doi:10.1093/ve/veab064
Aksamentov I, Roemer C, Hodcroft EB, Neher RA (2021) Nextclade: clade assignment, mutation calling and quality control for viral genomes. J Open Source Softw 6(67):3773. doi:10.21105/joss.03773
Karthikeyan S, Levy JI, De Hoff P et al (2022) Wastewater sequencing reveals early cryptic SARS-CoV-2 variant transmission. Nature 609(7925):101-108. doi:10.1038/s41586-022-05049-6
Hadfield J, Megill C, Bell SM et al (2018) Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34(23):4121-4123. doi:10.1093/bioinformatics/bty407
Huddleston J, Hadfield J, Sibley TR et al (2021) Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens. J Open Source Softw 6(57):2906. doi:10.21105/joss.02906
Turakhia Y, Thornlow B, Hinrichs AS et al (2021) Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat Genet 53(6):809-816. doi:10.1038/s41588-021-00862-7
Pongmoragot J, Pearson C, Borg ML et al (2024) Comparison of UShER-based and pangoLEARN-based Pangolin lineage assignments for SARS-CoV-2 sequences. Virus Evol 10(1):vead085. doi:10.1093/ve/vead085
Jahn K, Dreifuss D, Topolsky I et al (2022) Early detection and surveillance of SARS-CoV-2 genomic variants in wastewater using COJAC. Nat Microbiol 7(8):1151-1160. doi:10.1038/s41564-022-01185-x
Pechlivanis N, Tsagiopoulou M, Maniou MC et al (2022) Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater and a use-case in the metropolitan area of Thessaloniki, Greece. Sci Rep 12:2659. doi:10.1038/s41598-022-06625-6
Itokawa K, Sekizuka T, Hashino M, Tanaka R, Kuroda M (2020) Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR. PLoS ONE 15(9):e0239403. doi:10.1371/journal.pone.0239403
Hodcroft EB, De Maio N, Lanfear R et al (2021) Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature 591(7848):30-33. doi:10.1038/d41586-021-00525-x
Abousamra E, Figgins M, Bedford T (2024) Fitness models provide accurate short-term forecasts of SARS-CoV-2 variant frequency. PLoS Comput Biol 20(9):e1012443. doi:10.1371/journal.pcbi.1012443
Boni MF, Posada D, Feldman MW (2007) An exact nonparametric method for inferring mosaic structure in sequence triplets. Genetics 176(2):1035-1047. doi:10.1534/genetics.106.068874
Martin DP, Murrell B, Golden M, Khoosal A, Muhire B (2015) RDP4: detection and analysis of recombination patterns in virus genomes. Virus Evol 1(1):vev003. doi:10.1093/ve/vev003
Varabyou A, Pockrandt C, Salzberg SL, Pertea M (2021) Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie. Bioinformatics 37(15):2298-2300. doi:10.1093/bioinformatics/btab080

Related Skills

pathogen-typing - Lineage assignment overlaps with typing; this skill owns longitudinal frequency tracking and wastewater deconvolution
phylodynamics - Lineage-stratified BDSKY / BICEPS R_e estimation runs downstream of lineage assignment
transmission-inference - SARS-CoV-2 cluster definition combines lineage + 0-2 SNPs + epi link
amr-surveillance - Antiviral drug-resistance mutation tracking is the variant-surveillance analogue for AMR
phylogenetics/modern-tree-inference - IQ-TREE / RAxML for non-UShER topology
phylogenetics/tree-io - Tree parsing and format conversion for Augur output
comparative-genomics/whole-genome-alignment - Reference-based alignment for SNP calling
variant-calling/vcf-basics - VCF for lineage-defining mutations
variant-calling/variant-calling - Variant calling for wastewater (lofreq, ivar)
variant-calling/filtering-best-practices - Per-amplicon coverage filtering for ARTIC
read-alignment/bwa-alignment - Read mapping upstream
read-alignment/minimap2-alignment - Long-read alignment for ARTIC-Midnight 1200
read-qc/quality-reports - Sequencing QC upstream
database-access/sra-data - SRA / INSDC retrieval; GISAID is a separate restricted-access source
data-visualization/multipanel-figures - Lineage frequency / wastewater plotting
workflows/somatic-variant-pipeline - End-to-end orchestration patterns

name	bio-epidemiological-genomics-variant-surveillance
description	Assigns pathogen lineages (SARS-CoV-2 Pangolin via UShER mode; Nextclade clade + QC; pango-designation alias_key.json resolution) and tracks variant frequencies over time using Nextstrain (Augur + Auspice), wastewater deconvolution (Freyja, COJAC, alcov, lineagespot), lineage fitness modelling (Wenseleers / Bedford-Figgins multinomial logistic), and recombinant detection (3SEQ, RDP4, Bolotie). Covers Pangolin pangolin-data version pinning (mandatory for reproducibility), Nextclade dataset versioning (lineage-defining mutations change with dataset), Freyja barcode forward-only date constraint, ARTIC primer scheme version churn (V3 / V4 / V4.1 / V5.3.2 / Midnight 1200) with documented dropout regions, recombinant X-prefix Pango designation lag, GISAID vs INSDC dual-deposition tensions, and the Karthikeyan 2022 wastewater early-detection signal with explicit reproducibility caveats. Use when assigning Pango lineages and Nextclade clades to viral consensus sequences, building Nextstrain Augur surveillance pipelines, deconvolving wastewater pooled samples into lineage frequencies with Freyja, tracking lineage frequencies and growth advantages over time, pinning pangolin-data / Nextclade dataset versions for reproducibility, handling ARTIC primer dropouts (V4.1 amplicons 64 / 76 / 88-90), or running variant surveillance for SARS-CoV-2 / influenza / Mpox / RSV / H5N1 / measles.
tool_type	mixed
primary_tool	Pangolin

name	bio-epidemiological-genomics-variant-surveillance
description	Assigns pathogen lineages (SARS-CoV-2 Pangolin via UShER mode; Nextclade clade + QC; pango-designation alias_key.json resolution) and tracks variant frequencies over time using Nextstrain (Augur + Auspice), wastewater deconvolution (Freyja, COJAC, alcov, lineagespot), lineage fitness modelling (Wenseleers / Bedford-Figgins multinomial logistic), and recombinant detection (3SEQ, RDP4, Bolotie). Covers Pangolin pangolin-data version pinning (mandatory for reproducibility), Nextclade dataset versioning (lineage-defining mutations change with dataset), Freyja barcode forward-only date constraint, ARTIC primer scheme version churn (V3 / V4 / V4.1 / V5.3.2 / Midnight 1200) with documented dropout regions, recombinant X-prefix Pango designation lag, GISAID vs INSDC dual-deposition tensions, and the Karthikeyan 2022 wastewater early-detection signal with explicit reproducibility caveats. Use when assigning Pango lineages and Nextclade clades to viral consensus sequences, building Nextstrain Augur surveillance pipelines, deconvolving wastewater pooled samples into lineage frequencies with Freyja, tracking lineage frequencies and growth advantages over time, pinning pangolin-data / Nextclade dataset versions for reproducibility, handling ARTIC primer dropouts (V4.1 amplicons 64 / 76 / 88-90), or running variant surveillance for SARS-CoV-2 / influenza / Mpox / RSV / H5N1 / measles.
tool_type	mixed
primary_tool	Pangolin

bio-epidemiological-genomics-variant-surveillance

More from this repository

Version Compatibility

Variant Surveillance

The Single Most Important Modern Insight -- Lineage assignment is dataset-version-dependent

Algorithmic Taxonomy

Decision Tree by Scenario

Pangolin Lineage Assignment With Version Pinning

Nextclade With Dataset Pinning

Wastewater Lineage Deconvolution With Freyja

COJAC for Co-Occurrence Detection

Nextstrain Augur Pipeline

Per-Method Failure Modes

pangolin-data version skew between labs

Freyja barcode predates the sample collection date

ARTIC primer dropout misread as deletion

Recombinant assigned to one parent lineage

pangoLEARN result reported as authoritative

Freyja barcode-build vs barcode_build flag

Wenseleers / Bedford lineage-growth CI hides covariance

Nextstrain subsampling drives the result

Reconciliation: When Methods Disagree

Quantitative Thresholds

Common Errors

Anticipated Reviewer Pushback

References

Related Skills

Version Compatibility

Variant Surveillance

The Single Most Important Modern Insight -- Lineage assignment is dataset-version-dependent

Algorithmic Taxonomy

Decision Tree by Scenario

Pangolin Lineage Assignment With Version Pinning

Nextclade With Dataset Pinning

Wastewater Lineage Deconvolution With Freyja

COJAC for Co-Occurrence Detection

Nextstrain Augur Pipeline

Per-Method Failure Modes

pangolin-data version skew between labs

Freyja barcode predates the sample collection date

ARTIC primer dropout misread as deletion

Recombinant assigned to one parent lineage

pangoLEARN result reported as authoritative

Freyja barcode-build vs barcode_build flag

Wenseleers / Bedford lineage-growth CI hides covariance

Nextstrain subsampling drives the result

Reconciliation: When Methods Disagree

Quantitative Thresholds

Common Errors

Anticipated Reviewer Pushback

References

Related Skills

More from this repository