| name | bio-epidemiological-genomics-transmission-inference |
| description | Infers person-to-person transmission from pathogen genomes using outbreaker2 (Campbell 2018), TransPhylo (Didelot 2017), phybreak (Klinkenberg 2017), BadTrIP (De Maio 2018), SCOTTI (De Maio 2016), BEASTLIER (Hall 2015), and SNP-distance / cluster-picker approaches (HIV-TRACE for HIV; transcluster). Defines outbreak clusters using pathogen-specific SNP thresholds (NOT a universal cutoff -- TB <=12 SNPs / Walker 2013; MRSA <=15 / Coll 2017; C. difficile <=2 / Eyre 2013; Klebsiella <=21 / Snitkin 2012), models within-host diversity and transmission bottlenecks (Worby-Lipsitch-Hanage 2014; McCrone 2018; Sobel Leonard 2017; Lythgoe 2021 SARS-CoV-2), integrates contact-tracing data, distinguishes generation interval from serial interval (Britton & Scalia Tomba 2019; Ali 2020), attributes source via Bayesian source attribution (Mather 2013 DT104; islandR), and reconciles transmission-network reconstruction with epi metadata. Use when investigating outbreaks for who-infected-whom, defining SNP-cluster outbreak definitions, accounting for unsampled intermediates, choosing between outbreaker2 (rich epi data) and TransPhylo (genomic-only after dated phylogeny), running source attribution between host populations, calling HIV-TRACE thresholds appropriate to the local subtype, or distinguishing recent transmission from reactivation in TB / chronic HIV. |
| tool_type | mixed |
| primary_tool | TransPhylo |
Version Compatibility
Reference examples tested with: TransPhylo 1.4+ (R), outbreaker2 1.2+ (R), phybreak 0.5+ (R), BadTrIP via BEAST 2.7+ package manager, BEASTLIER via BEAST 1.10+, transcluster 1.0+ (R), HIV-TRACE 1.5+, snp-dists 0.8+, ape 5.8+ (R), igraph 1.6+ (R), TreeTime 0.11+, BactDating 1.1+ (R), BEAST 2.7.6+, lofreq 2.1+, deepSNV via Bioconductor 3.18+, pandas 2.2+, BioPython 1.84+.
Before using code patterns, verify installed versions match. If versions differ:
- R:
packageVersion('TransPhylo'); ?inferTTree to confirm arg names
- R:
packageVersion('outbreaker2'); ?create_config -- iteration count is set via n_iter in the config object, NOT as iters to outbreaker()
- Python:
pip show lofreq; check whether deep variant calling supports the target MAF
- CLI:
snp-dists --help; hiv-trace --help
If R rejects an argument, the function signature changed between minor releases; ?function_name is authoritative.
Transmission Inference
"Who infected whom in this outbreak, and is this even an outbreak?" -> Pick the question first (cluster definition vs WIWS who-infected-whom vs source attribution), then the method that fits the data (rich epi + dense sampling -> outbreaker2; sparse sampling + good dated tree -> TransPhylo; longitudinal within-host samples -> BEASTLIER / BadTrIP; rapid surveillance triage -> SNP-distance with pathogen-tuned threshold). Genomic distance is necessary but not sufficient for direction: two isolates 3 SNPs apart could be A->B, B->A, A->Unknown->B, or two-from-one common source. Direction inference requires temporal data, within-host diversity, contact-tracing data, or all three.
- R:
outbreaker2::outbreaker(data=outbreaker_data(dates=..., dna=..., w_dens=..., f_dens=..., ctd=...), config=create_config(n_iter=1e6)) -- dense outbreak with contact data
- R:
TransPhylo::inferTTree(ptree, mcmcIterations=1e5, w.shape=1.3, w.scale=10) -- sparse outbreak from a dated tree
- CLI:
snp-dists -c gubbins.filtered_polymorphic_sites.fasta > pairwise.csv -- pairwise SNP for cluster triage
- CLI:
hiv-trace --threshold 0.015 -- HIV cluster definition at the US-CDC default (subtype B); reconsider for non-B subtypes
The Single Most Important Modern Insight -- There is no universal SNP cutoff for transmission
The pathogen-specific SNP threshold varies by 10x across taxa (TB <=12 SNPs, C. difficile <=2, MRSA <=15, Salmonella cgMLST <=5, Klebsiella <=21, SARS-CoV-2 not defined by SNP alone). Substitution rate, recombination, generation time, within-host diversity, and (for Mpox) APOBEC3 editing all vary by 100x. Walker 2013 Lancet Infect Dis 13:137 derived the TB <=12 SNP cutoff from UK Oxfordshire (low-transmission, contact-traced, household settings); applying the same threshold in Cape Town or Mumbai inflates apparent recent-transmission rates 2-5x because clonal isolates linked through long-past common ancestors get pooled with truly recent transmissions. Worby, Lipsitch & Hanage 2014 PLoS Comput Biol 10:e1003549 formally showed that within-host bacterial diversity puts an irreducible upper bound on the resolution of SNP-distance transmission-network reconstruction even with repeated sampling. Always cite the pathogen-specific source AND its derivation population; never apply a threshold outside its validated context without an explicit caveat. For TB / HIV / chronic infections, naive SNP cutoffs fail because of reactivation and within-host coalescence -- use TransPhylo or outbreaker2 with within-host-aware priors.
Algorithmic Taxonomy
| Tool | Mechanism | Inputs | Output | Strength | Fails when |
|---|
| Pairwise SNP threshold (snp-dists; cluster picker) | Count SNPs between pairs; threshold + linkage | Core-SNP alignment | Adjacency at threshold | Fast; intuitive; standard for surveillance triage | Pathogen-specific cutoff; convergent evolution and recombination violate distance assumptions |
| HIV-TRACE (Kosakovsky Pond 2018 Mol Biol Evol 35:1812) | TN93 pairwise distance + threshold (default 1.5%) | HIV-1 pol or other gene | Cluster membership | CDC standard for US HIV surveillance | 1.5% threshold is US-CDC subtype B specific; under-clusters subtype C in southern Africa |
| outbreaker2 (Campbell 2018 BMC Bioinformatics 19:363) | MCMC; sequence + generation-interval + sampling-time + contact-tracing | Dated genomes + epi data | Posterior WIWS + unsampled intermediates + R_e | Integrates epi data explicitly; modular likelihood | ~100-200 cases practical limit; assumes one infection event per case (no within-host populations) |
| TransPhylo (Didelot 2017 Mol Biol Evol 34:997) | Coalescent within-host + birth-death between-host; colours a dated tree | Time-scaled tree + sampling dates | Posterior transmission tree + R_t + unsampled cases | Works from a tree, not raw genomes; scales to ~1000 tips; explicit within-host coalescence | Sensitive to within-host effective population size prior; requires good dated phylogeny |
| phybreak (Klinkenberg 2017 PLoS Comput Biol 13:e1005495) | Joint phylogeny + transmission inference via MCMC | Dated genomes | Posterior transmission tree | Proper within-host handling; fast for small outbreaks | <=100 cases; less benchmarked than outbreaker2/TransPhylo |
| BadTrIP (De Maio 2018 PLoS Comput Biol 14:e1006117) | Bayesian; explicit handling of multi-strain infections | Dated genomes | Posterior transmission tree with strain-level resolution | Handles within-host diversity / mixed infections (TB, HIV) | Slow; specialist tool |
| SCOTTI (De Maio 2016 PLoS Comput Biol 12:e1005130) | Structured-coalescent transmission inference (BEAST 2 package) | Dated genomes | Posterior transmission tree under structured coalescent | Sampling-aware; correctly models unsampled intermediates | Computationally heavy; specialist setup |
| BEASTLIER (Hall 2015 PLoS Comput Biol 11:e1004613) | Joint phylogeny + transmission partitioning | Dated genomes; ideally with multiple isolates per host | Posterior transmission tree with within-host partition | Postdoc-grade identifiability with within-host samples | Single-isolate-per-host data is under-identified |
| transcluster (Stimson 2019 Mol Biol Evol 36:587) | Per-pair posterior probability under SNP + time prior | Dated genomes | Per-pair cluster membership probability | Probabilistic; pathogen-tuned priors | Pair-level only; no full transmission tree |
| Sobel Leonard 2017 J Virol 91:e00171-17 beta-binomial bottleneck | Estimate transmission bottleneck size from donor-recipient deep sequencing | Donor + recipient deep-sequence allele frequencies | Bottleneck Nb posterior | Estimates an otherwise unobservable quantity | Requires deep-sequenced donor-recipient pairs |
| islandR / Bayesian source attribution (Mather 2013 Science 341:1514) | Bayesian per-population allele-frequency model | Reference collections per host source + query genome | Per-source posterior probability | Standard in Salmonella / Campylobacter food-safety surveillance | Source-attribution circularity: trained-on-distribution reproduces that distribution |
Decision Tree by Scenario
| Scenario | Recommended approach | Why wrong choices fail |
|---|
| "Is this even an outbreak?" routine surveillance triage | snp-dists after Gubbins; pathogen-tuned threshold (Walker 2013 for TB, Eyre 2013 for C. diff, EFSA cgMLST <=5 for Salmonella); cross-check cgMLST distance | Universal SNP threshold across pathogens (10x variation) |
| Densely sampled outbreak with contact-tracing data | outbreaker2 with ctd contact matrix + generation-time prior + sampling-time prior | TransPhylo without epi data (loses information from contacts); naive SNP threshold (ignores within-host diversity) |
| Sparsely sampled, longer-time-scale outbreak | TransPhylo on a BactDating-derived dated tree | outbreaker2 (sampling-completeness assumption broken); SNP threshold inflates clusters with unsampled intermediates |
| TB outbreak with possible reactivation | TransPhylo + transcluster with TB-tuned priors; long within-host coalescent matters | SNP cutoff insufficient -- reactivation can have 0 SNPs from years-old strains |
| Hospital outbreak with possible mixed infection | BadTrIP / SCOTTI | Consensus-only methods (SNP distance, outbreaker2) ambiguous on mixed-strain |
| Multi-site outbreak with import suspected | TransPhylo + MASCOT-derived migration; source-attribution as separate analysis | Source attribution needs phylogeographic component beyond TransPhylo alone |
| Food-vehicle / environmental source attribution | islandR / Bayesian source attribution (Mather 2013 framework); manual cluster + phylogeographic plot | Naive phylogenetic placement loses the per-source priors |
| Sub-sampled outbreak (<50% cases sequenced) | outbreaker2 (handles unsampled cases explicitly with pi sampling parameter) | Raw SNP cutoff -- unsampled intermediates break SNP-distance reasoning |
| Recombining pathogen (S. pneumo, E. coli STEC, K. pneumoniae) | Gubbins / ClonalFrameML mask FIRST; then any of the above | Recombination inflates apparent SNP distance and creates false convergent transmission inference |
| HIV cluster definition | HIV-TRACE 1.5% for subtype B (US-CDC standard); reconsider for non-B subtypes | Applying 1.5% threshold globally without subtype caveat |
| Estimate transmission bottleneck | Sobel Leonard 2017 beta-binomial on deep-sequenced donor-recipient pairs | Consensus-only sequences cannot quantify bottleneck size |
Methodology evolves; before any high-stakes who-infected-whom claim, web-search "outbreak transmission inference benchmark 2025" for current best practice.
outbreaker2 With Contact Data
Goal: Infer who-infected-whom posterior for a densely sampled outbreak with epi metadata, jointly estimating generation interval and unsampled-case proportion.
Approach: Build outbreaker_data with sampling dates, DNA alignment, generation-time density w_dens, sampling-time density f_dens, and contact-tracing matrix ctd; configure MCMC via create_config(n_iter=N); run; summarise posterior over WIWS.
library(outbreaker2)
library(ape)
dna <- read.dna('alignment.fasta', format='fasta')
dates <- read.csv('sampling_dates.csv')
ctd_matrix <- as.matrix(read.csv('contact_matrix.csv', row.names=1))
w_dens <- dgamma(1:30, shape=2.5, scale=2)
f_dens <- dgamma(1:30, shape=2, scale=3)
data <- outbreaker_data(dates=dates$collection_date, dna=dna,
w_dens=w_dens, f_dens=f_dens, ctd=ctd_matrix)
cfg <- create_config(n_iter=1e6, sample_every=200, find_import=TRUE)
res <- outbreaker(data=data, config=cfg)
summary(res)
w_dens is the generation-time distribution (time from infection of A to infection of B) -- NOT the serial interval (time between symptom onsets); using one in place of the other biases inference. Britton & Scalia Tomba J R Soc Interface 16:20180670 (2019) formalised the bias for emerging epidemics; for SARS-CoV-2 with substantial pre-symptomatic transmission (Ali 2020 Science 369:1106), the serial interval shortened from 7.8 to 2.2 days under NPI, and naive SI-based inference was biased.
TransPhylo From a Dated Tree
Goal: Infer transmission tree posterior from a time-scaled phylogeny when raw genomes are not directly usable or when the outbreak is too large for outbreaker2 (>200 cases).
Approach: Time-scale the tree first (BactDating after Gubbins for bacteria; BEAST or TreeTime for viruses); convert to TransPhylo ptree with ptreeFromPhylo; run inferTTree with generation-time prior and within-host effective population size prior; summarise via medTTree (medoid transmission tree) and posterior probabilities per WIWS pair.
library(TransPhylo)
library(ape)
tree <- read.nexus('dated_tree.nexus')
date_last_sample <- 2024.95
ptree <- ptreeFromPhylo(tree, dateLastSample=date_last_sample)
w.shape <- 1.3
w.scale <- 10
ws.shape <- 1.1
ws.scale <- 7
neg <- 0.5
res <- inferTTree(ptree, mcmcIterations=1e5,
w.shape=w.shape, w.scale=w.scale,
ws.shape=ws.shape, ws.scale=ws.scale,
startNeg=neg, dateT=date_last_sample + 0.1)
med_tree <- medTTree(res)
pairs <- extractTTree(med_tree)$ttree
w.* is the generation-time Gamma prior; ws.* is the sampling-time Gamma prior. Both must reflect the pathogen's biology (e.g., TB w.scale = months; SARS-CoV-2 w.scale = days). Wrong priors silently bias the transmission-tree posterior.
SNP-Cluster Definition With Pathogen-Specific Thresholds
Goal: Define outbreak clusters from a recombination-masked core-SNP alignment using the published pathogen-specific threshold, with the threshold's source population caveated.
Approach: Snippy -> snippy-core -> Gubbins on core.full.aln for bacteria -> snp-dists -> single-linkage clustering at the pathogen-specific threshold; cite Walker 2013 (TB), Eyre 2013 (C. diff), Coll 2017 (MRSA), Snitkin 2012 (Klebsiella) per organism; flag any extrapolation outside the threshold's validation population.
snippy-core --ref reference.fa --prefix core snippy_out/*
run_gubbins.py --prefix gubbins core.full.aln
snp-dists -c gubbins.filtered_polymorphic_sites.fasta > pairwise.csv
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
dist = pd.read_csv('pairwise.csv', index_col=0)
condensed = dist.values[np.triu_indices(len(dist), k=1)]
THRESHOLD_TB = 12
THRESHOLD_MRSA = 15
THRESHOLD_CDIFF = 2
THRESHOLD_KPNEUMO = 21
linkage_matrix = linkage(condensed, method='single')
clusters = fcluster(linkage_matrix, t=THRESHOLD_TB, criterion='distance')
Per-Method Failure Modes
Pairwise SNP threshold applied outside its validation population
Trigger: Walker 2013 UK 5/12-SNP TB threshold applied to Cape Town or Mumbai high-transmission settings.
Mechanism: Walker 2013 Lancet Infect Dis 13:137 calibrated the 5/12 SNP threshold on Oxfordshire community / household contact-traced data (low-transmission). In high-prevalence settings, clonal isolates linked through long-past common ancestors fall within the threshold without recent direct transmission.
Symptom: Country-level Mtb genomic-epi report shows 60-80% of cases in "transmission clusters", far exceeding clinical contact-tracing rates.
Fix: Cite the threshold's source population; for high-prevalence settings, derive a local threshold from epidemiologically-anchored case pairs in the local cohort rather than importing a UK-low-transmission cutoff. For transmission-direction claims, supplement with TransPhylo / outbreaker2.
Direction of transmission asserted from pairwise SNP distance alone
Trigger: Outbreak report concluding "A -> B" because A has earlier sampling date and 3 SNPs from B.
Mechanism: A 3-SNP pairwise difference is consistent with A->B, B->A, Unknown->both, or A->Unknown->B. Worby, Lipsitch & Hanage 2014 PLoS Comput Biol 10:e1003549 formalised the irreducible uncertainty. Earlier sampling date does not establish earlier infection date because of within-host evolution and asymptomatic carriage.
Symptom: Outbreak conclusions claim directionality without within-host data or contact tracing; reviewers from the Didelot / Worby groups push back.
Fix: Use "transmission consistent with genomics" not "transmission demonstrated". For direction claims, require within-host samples (BEASTLIER), contact-tracing data (outbreaker2 with ctd), or both. Cite Worby 2014 as the upper bound on what SNP distance can establish.
Unsampled intermediates collapsed into A->B direct links
Trigger: Outbreak with <50% sequencing coverage; transmission inference assumes all cases sampled.
Mechanism: When sampling is incomplete, inferred A->B "direct" transmissions are routinely A->Unknown->B chains. This systematically inflates inferred R_e (longer chains compressed), underestimates generation interval, and biases topology toward bushy trees.
Symptom: Inferred R_e is implausibly high (each "tip" appears to spawn extra children once unsampled intermediates collapse into apparent direct links); generation interval estimate is implausibly short; topology appears bushier than expected.
Fix: Use outbreaker2 with explicit pi (sampling proportion) parameter, or TransPhylo / SCOTTI which model unsampled intermediates explicitly. Cite the unsampled-intermediates caveat in every transmission-inference report.
Narrow transmission bottleneck makes consensus-only inference WORSE than coalescent intuition predicts
Trigger: Consensus-genome transmission-pair inference for a pathogen with documented narrow bottleneck (influenza 1-2 virions per McCrone 2018 eLife 7:e35962; SARS-CoV-2 <10 virions per Lythgoe 2021 Science 372:eabg0821).
Mechanism: When the transmission bottleneck is narrow, donor and recipient consensus genomes are near-identical by default -- the bottleneck strips most within-host diversity. Near-identity therefore does NOT discriminate direct transmission from infection by an unsampled intermediate or from a shared common source. Naive coalescent intuition predicts that "more transmissions = more divergence"; the opposite is true under a narrow bottleneck.
Symptom: Most pairs in a dense outbreak appear identical or 1 SNP apart; SNP-distance-based cluster definitions become uninformative; transmission-direction claims based on consensus difference are unfalsifiable.
Fix: For narrow-bottleneck pathogens, supplement consensus-based methods with deep within-host variant calling (lofreq / deepSNV / VarScan2 at MAF >= 1%) on donor-recipient pairs; estimate bottleneck size explicitly via Sobel Leonard 2017 J Virol 91:e00171-17 beta-binomial estimator; report transmission claims as "consistent with" rather than "demonstrated by" consensus identity. Pair-level resolution requires within-host data; without it, claim only cluster membership, not direction.
Generation interval and serial interval used interchangeably
Trigger: outbreaker2 / EpiNow2 / similar tools fed the serial-interval distribution (w_dens set from symptom-to-symptom data) when the model wants generation-interval (infection-to-infection).
Mechanism: Generation interval = time from infection of A to infection of B; serial interval = time from symptom onset of A to symptom onset of B. They differ when incubation periods vary or pre-symptomatic transmission is substantial. Britton & Scalia Tomba 2019 J R Soc Interface 16:20180670 formalised the bias for emerging epidemics; Ali 2020 Science 369:1106 showed for SARS-CoV-2 the SI shortened from 7.8 to 2.2 days under NPI.
Symptom: Inferred R_e is biased; comparison to case-based R_t (also often SI-based) shows compounding bias.
Fix: Document which distribution w_dens actually encodes. For SARS-CoV-2 with substantial pre-symptomatic transmission, generation interval is ~5 days in the ancestral-strain literature; serial interval was ~4-5 days early but shortened to 2-3 under NPI. Cite Britton 2019.
HIV-TRACE 1.5% threshold applied to non-subtype-B HIV
Trigger: HIV-TRACE run on subtype C sequences from southern Africa with the default 1.5% TN93 threshold.
Mechanism: Kosakovsky Pond et al 2018 Mol Biol Evol 35:1812 documented HIV-TRACE methodology; the 1.5% threshold is the US-CDC default tuned for subtype B in MSM cohorts. Subtype C in southern Africa has higher diversity per unit time and more recent epidemics; the 1.5% threshold under-clusters there.
Symptom: Cluster definitions in southern African subtype C HIV surveillance under-detect transmission; comparison to US surveillance literature shows incompatible cluster sizes.
Fix: Tune threshold for the local subtype and population; cite the local validation. UKHSA / ECDC use different thresholds; document which.
Source attribution circularity
Trigger: Bayesian source attribution model (Mather 2013 Science 341:1514 framework) trained on a reference collection that over-represents one host population.
Mechanism: Source-attribution models reproduce the host-distribution of their training data unless explicitly corrected. If 80% of training isolates are from cattle, the model will tend to attribute new isolates to cattle even when the true source is poultry.
Symptom: Source attribution reproduces the sampling intensity of the reference collection; conclusions are circular.
Fix: Weight by inverse sampling intensity per source category; use rarefied reference collections; report attribution alongside the reference-collection composition as a caveat.
Primer-scheme dropout misread as real divergence
Trigger: SARS-CoV-2 outbreak comparison across samples sequenced with different ARTIC primer schemes (V3 / V4 / V4.1 / V5.3.2); "differences" concentrated in one amplicon are interpreted as real SNPs.
Mechanism: ARTIC primer dropouts produce N's or reference-derived consensus calls in failed amplicons (Itokawa 2020 PLoS ONE 15:e0239403); these LOOK LIKE deletions or reference matches in downstream analysis but are missing data. Cross-scheme comparison without masking failed amplicons produces spurious transmission differences.
Symptom: Cluster definitions differ implausibly between ARTIC-V3 and ARTIC-V4.1 samples; "differences" cluster in known dropout amplicons (V4.1 amplicons 64, 76, 88-90).
Fix: Mask failed amplicons per sample (samtools depth + per-amplicon coverage); document primer scheme version per isolate; for transmission inference, exclude positions in any sample's dropout regions.
Reconciliation: When Methods Disagree
| Pattern | Likely cause | Action |
|---|
| outbreaker2 and TransPhylo disagree on WIWS | Different sampling-completeness assumptions; outbreaker2 expects ~dense sampling, TransPhylo handles sparse | Pick the method whose assumption matches the data; cite the choice |
| SNP threshold cluster and outbreaker2 cluster differ | SNP threshold ignores temporal data and contacts | Trust outbreaker2 (integrates more evidence); SNP cluster is triage only |
| Two consecutive Pangolin versions give different lineage for a "transmission pair" | Lineage definitions revised | Re-run both samples against a single Pango / pangolin-data version |
| TB cluster definition flips between 5 and 12 SNP threshold | Walker 2013 ambiguous range | Run TransPhylo for transmission-direction posterior; report SNP-distance with cluster picker certainty |
| HIV cluster differs between HIV-TRACE 1.5% and 2.0% | Threshold sensitivity at boundary | Subtype-specific calibration; cite the chosen threshold's validation |
| Source attribution differs between islandR runs with different reference panels | Sampling-intensity bias | Re-run with rarefied or inverse-weighted reference; report multiple scenarios |
Quantitative Thresholds
| Pathogen | "Outbreak cluster" threshold | Source / rationale |
|---|
| Mycobacterium tuberculosis (whole-genome core SNP) | <=12 SNPs (likely transmission); <=5 SNPs (recent transmission) | Walker 2013 Lancet Infect Dis 13:137 (UK low-transmission setting) |
| Staphylococcus aureus (core genome) | <=15 SNPs (within hospital outbreak); <=40 SNPs (broader temporal cluster) | Coll 2017 Clin Infect Dis 65:1781 |
| Klebsiella pneumoniae (KPC outbreak) | <=21 SNPs | Snitkin 2012 Sci Transl Med 4:148ra116 |
| Salmonella enterica (cgMLST EnteroBase) | <=5 allelic differences (cluster); <=7 (extended cluster) | EnteroBase / EFSA harmonised |
| Listeria monocytogenes (PulseNet cgMLST) | <=4 allelic differences | PulseNet protocol convention |
| E. coli (cgMLST, EnteroBase) | <=10 allelic differences (STEC outbreak) | EnteroBase convention |
| Neisseria gonorrhoeae | <=25 core SNPs (transmission) | UKHSA STI framework |
| Clostridioides difficile (core SNP, recombination-masked) | <=2 SNPs (likely direct); <=10 (plausible within 6 months) | Eyre 2013 NEJM 369:1195 |
| SARS-CoV-2 (whole-genome) | No fixed cutoff; 0-2 SNPs + epi link + sampling window | Lythgoe 2021 Science 372:eabg0821 |
| HIV-1 subtype B (TN93 distance) | 1.5% genetic distance (HIV-TRACE default; US-CDC standard) | Kosakovsky Pond 2018 Mol Biol Evol 35:1812 |
| Mpox clade IIb | <=2 SNPs cluster threshold; APOBEC3 editing inflates apparent distance | Mpox 2022 outbreak APOBEC3-editing literature |
| Transmission bottleneck -- influenza | ~1-2 virions (narrow) | McCrone 2018 eLife 7:e35962 |
| Transmission bottleneck -- SARS-CoV-2 | <10 virions (tight) | Lythgoe 2021 Science 372:eabg0821 |
| Generation interval -- SARS-CoV-2 ancestral | ~5 days | SARS-CoV-2 ancestral-strain literature |
CRITICAL: a number from one pathogen does NOT transfer to another. Always cite the source population.
Common Errors
| Error / symptom | Cause | Solution |
|---|
outbreaker2 rejects iters arg | Iterations set via n_iter in the config object | create_config(n_iter=N) |
| TransPhylo MCMC fails to converge | Within-host Ne prior misspecified; bad input tree | Tune startNeg; verify tree dating quality |
| Cluster definition flips between linkage methods | Single-linkage vs complete-linkage on borderline pairs | Document; sensitivity analysis |
| outbreaker2 estimates implausible R_e | Sampling proportion mis-specified | Set pi based on epi knowledge or estimate within outbreaker2 |
| Transmission inferred between two distant lineages | Recombination unmasked | Run Gubbins on core.full.aln first |
| HIV-TRACE clusters incompatible across labs | Different subtype calibration | Document subtype; use locally validated threshold |
| Source attribution always pointing at one host | Reference-collection bias | Re-weight or rarify reference panel |
snp-dists -t rejected | -t flag doesn't exist; default IS tab; -c for CSV | Use -c for CSV; default for TSV |
| Snippy outputs disagree across samples | Different reference; reference mismatch silently shifts SNP coordinates | Always document reference; use same reference cross-lab |
Anticipated Reviewer Pushback
| Pushback | Response |
|---|
| "What SNP threshold and on what population?" | Cite Walker 2013 / Eyre 2013 / Coll 2017 per pathogen; caveat the population if extrapolating |
| "Were unsampled intermediates handled?" | outbreaker2 pi parameter or TransPhylo / SCOTTI explicit modelling; never a raw SNP-distance method on sub-sampled data |
| "Direction of transmission inference?" | Within-host samples + contact tracing required for direction; otherwise "consistent with" phrasing |
| "Generation interval vs serial interval?" | Documented w_dens source; cite Britton 2019 if SI used as approximation for GI |
| "Why TransPhylo / outbreaker2 / phybreak?" | Decision tree based on sampling completeness, dataset size, contact-tracing availability |
| "Was within-host diversity considered?" | TransPhylo's within-host coalescent OR BadTrIP for mixed-strain; bottleneck size from Sobel Leonard 2017 if relevant |
| "HIV-TRACE 1.5% threshold outside subtype B?" | Acknowledged US-CDC subtype B origin; either use locally validated threshold or document caveat |
| "Source attribution sampling-intensity bias?" | Re-weighted reference collection or rarified; cite Mather 2013 limitation |
| "Was forward simulation run as a sanity check?" | SLiM / FAVITES / SEEDY if claims are high-stakes; routinely under-done in published transmission inference |
References
- Worby CJ, Lipsitch M, Hanage WP (2014) Within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data. PLoS Comput Biol 10(3):e1003549. doi:10.1371/journal.pcbi.1003549
- Campbell F, Didelot X, Fitzjohn R, Ferguson N, Cori A, Jombart T (2018) outbreaker2: a modular platform for outbreak reconstruction. BMC Bioinformatics 19(Suppl 11):363. doi:10.1186/s12859-018-2330-z
- Didelot X, Fraser C, Gardy J, Colijn C (2017) Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol Biol Evol 34(4):997-1007. doi:10.1093/molbev/msw275
- Klinkenberg D, Backer JA, Didelot X, Colijn C, Wallinga J (2017) Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput Biol 13(5):e1005495. doi:10.1371/journal.pcbi.1005495
- De Maio N, Worby CJ, Wilson DJ, Stoesser N (2018) Bayesian reconstruction of transmission within outbreaks using genomic variants. PLoS Comput Biol 14(4):e1006117. doi:10.1371/journal.pcbi.1006117
- De Maio N, Wu CH, Wilson DJ (2016) SCOTTI: efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput Biol 12(9):e1005130. doi:10.1371/journal.pcbi.1005130
- Hall M, Woolhouse M, Rambaut A (2015) Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions of the node set. PLoS Comput Biol 11(12):e1004613. doi:10.1371/journal.pcbi.1004613
- Stimson J, Gardy J, Mathema B et al (2019) Beyond the SNP threshold: identifying outbreak clusters using inferred transmissions. Mol Biol Evol 36(3):587-603. doi:10.1093/molbev/msy242
- Walker TM, Ip CLC, Harrell RH et al (2013) Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis 13(2):137-146. doi:10.1016/S1473-3099(12)70277-3
- Coll F, Harrison EM, Toleman MS et al (2017) Longitudinal genomic surveillance of MRSA in the UK reveals transmission patterns in hospitals and the community. Clin Infect Dis 65(11):1781-1789. doi:10.1093/cid/cix645
- Eyre DW, Cule ML, Wilson DJ et al (2013) Diverse sources of C. difficile infection identified on whole-genome sequencing. N Engl J Med 369(13):1195-1205. doi:10.1056/NEJMoa1216064
- Snitkin ES, Zelazny AM, Thomas PJ et al (2012) Tracking a hospital outbreak of carbapenem-resistant Klebsiella pneumoniae with whole-genome sequencing. Sci Transl Med 4(148):148ra116. doi:10.1126/scitranslmed.3004129
- Lythgoe KA, Hall M, Ferretti L et al (2021) SARS-CoV-2 within-host diversity and transmission. Science 372(6539):eabg0821. doi:10.1126/science.abg0821
- McCrone JT, Woods RJ, Martin ET et al (2018) Stochastic processes constrain the within and between host evolution of influenza virus. eLife 7:e35962. doi:10.7554/eLife.35962
- Sobel Leonard A, Weissman DB, Greenbaum B, Ghedin E, Koelle K (2017) Transmission bottleneck size estimation from pathogen deep-sequencing data, with an application to human influenza A virus. J Virol 91(14):e00171-17. doi:10.1128/JVI.00171-17
- Britton T, Scalia Tomba G (2019) Estimation in emerging epidemics: biases and remedies. J R Soc Interface 16(150):20180670. doi:10.1098/rsif.2018.0670
- Ali ST, Wang L, Lau EHY et al (2020) Serial interval of SARS-CoV-2 was shortened over time by nonpharmaceutical interventions. Science 369(6507):1106-1109. doi:10.1126/science.abc9004
- Kosakovsky Pond SL, Weaver S, Leigh Brown AJ, Wertheim JO (2018) HIV-TRACE (TRAnsmission Cluster Engine): A tool for large-scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol Biol Evol 35(7):1812-1819. doi:10.1093/molbev/msy016
- Mather AE, Reid SWJ, Maskell DJ et al (2013) Distinguishable epidemics of multidrug-resistant Salmonella Typhimurium DT104 in different hosts. Science 341(6153):1514-1517. doi:10.1126/science.1240578
- Itokawa K, Sekizuka T, Hashino M, Tanaka R, Kuroda M (2020) Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR. PLoS ONE 15(9):e0239403. doi:10.1371/journal.pone.0239403
Related Skills
- pathogen-typing - SNP-cluster / cgMLST cluster definition feeds transmission inference
- phylodynamics - Time-scaled tree from BactDating / BEAST / TreeTime feeds TransPhylo
- amr-surveillance - Resistant-clone outbreak inference combines AMR + transmission
- variant-surveillance - Lineage assignment cross-checks transmission cluster boundaries
- phylogenetics/divergence-dating - Calibrated trees for non-pathogen contexts
- phylogenetics/bayesian-inference - BEAST mechanics beyond outbreak phylodynamics
- comparative-genomics/whole-genome-alignment - Core-genome alignment for SNP-typing
- variant-calling/vcf-basics - Per-isolate variant calls for SNP-typing
- variant-calling/variant-calling - SNP calling that feeds snp-dists
- read-alignment/bwa-alignment - Read mapping upstream
- data-visualization/network-visualization - Transmission tree visualisation
- workflows/somatic-variant-pipeline - End-to-end orchestration patterns