| name | bio-metagenomics-strain-tracking |
| description | Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics. |
| tool_type | cli |
| primary_tool | MASH |
Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, MetaPhlAn 4.1+, numpy 1.26+, pandas 2.2+, samtools 1.19+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package> then help(module.function) to check signatures
- CLI:
<tool> --version then <tool> --help to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
Strain Tracking
"Track bacterial strains across my samples" → Resolve sub-species variation using genome sketching (Mash/sourmash), average nucleotide identity (fastANI), or within-sample strain profiling (inStrain) for outbreak tracking and transmission analysis.
- CLI:
mash dist, sourmash compare, fastANI, inStrain profile
Identify and track bacterial strains at sub-species resolution.
Tool Comparison
| Tool | Method | Best For |
|---|
| MASH | MinHash sketches | Fast distance estimation |
| sourmash | MinHash + containment | Metagenome comparisons |
| fastANI | ANI calculation | Accurate species/strain ID |
| inStrain | SNV profiling | Strain dynamics in metagenomes |
MASH
Installation
conda install -c bioconda mash
Create Sketch
mash sketch -o genome.msh genome.fasta
mash sketch -o reference_db.msh genomes/*.fasta
mash sketch -m 2 -r -o reads.msh reads.fastq.gz
Calculate Distance
mash dist genome1.fasta genome2.fasta
mash dist reference_db.msh query.fasta > distances.tsv
mash screen reference_db.msh reads.fastq.gz > screen_results.tsv
Interpret MASH Distance
| Distance | Interpretation |
|---|
| < 0.05 | Same species/strain |
| 0.05-0.15 | Same species |
| 0.15-0.25 | Same genus |
| > 0.25 | Different genus |
Cluster Genomes
mash triangle genomes/*.fasta > distances.phylip
mash triangle -E genomes/*.fasta > distances.tsv
sourmash
Installation
conda install -c bioconda sourmash
Create Signatures
sourmash sketch dna -p scaled=1000,k=31 genome.fasta -o genome.sig
sourmash sketch dna -p scaled=1000,k=31 genomes/*.fasta -o genomes.sig
sourmash sketch protein -p scaled=100,k=10 proteins.faa -o proteins.sig
Compare Signatures
sourmash compare *.sig -o comparison.npy --csv comparison.csv
sourmash search query.sig database.sig --threshold 0.8
sourmash gather metagenome.sig database.sig -o gather_results.csv
Taxonomy Assignment
sourmash database download gtdb-rs214-k31.zip
sourmash lca classify --db gtdb-rs214-k31.lca.json.gz --query query.sig
sourmash lca summarize --db gtdb-rs214-k31.lca.json.gz --query metagenome.sig
fastANI
Installation
conda install -c bioconda fastani
Calculate ANI
fastANI -q query.fasta -r reference.fasta -o ani_result.txt
fastANI -q query.fasta --rl reference_list.txt -o ani_results.txt
fastANI --ql genome_list.txt --rl genome_list.txt -o all_vs_all.txt --matrix
Interpret ANI
| ANI | Interpretation |
|---|
| >99% | Same strain |
| 95-99% | Same species |
| <95% | Different species |
inStrain
For strain-level analysis in metagenomes.
Installation
conda install -c bioconda instrain
Profile Strains
bowtie2 -x reference -1 reads_1.fq -2 reads_2.fq | \
samtools sort -o mapped.bam
inStrain profile mapped.bam reference.fasta -o instrain_output -p 8
Compare Samples
for bam in sample*.bam; do
inStrain profile $bam reference.fasta -o ${bam%.bam}_IS -p 8
done
inStrain compare -i sample*_IS -o comparison_IS -p 8
Key Outputs
cat instrain_output/output/SNVs.tsv
cat instrain_output/output/gene_info.tsv
cat instrain_output/output/genome_info.tsv
Complete Workflow: Outbreak Tracking
Goal: Identify potential outbreak clusters by computing pairwise genomic distances across isolate genomes using multiple complementary methods.
Approach: Sketch genomes with MASH for fast distance estimation, compute ANI with fastANI for accurate species-level resolution, compare sourmash signatures for containment analysis, and cluster close matches to identify transmission pairs.
#!/bin/bash
set -euo pipefail
GENOMES_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
echo "=== MASH sketching ==="
mash sketch -o $OUTPUT_DIR/genomes.msh $GENOMES_DIR/*.fasta
echo "=== MASH distances ==="
mash dist $OUTPUT_DIR/genomes.msh $OUTPUT_DIR/genomes.msh > $OUTPUT_DIR/mash_distances.tsv
echo "=== fastANI ==="
ls $GENOMES_DIR/*.fasta > $OUTPUT_DIR/genome_list.txt
fastANI --ql $OUTPUT_DIR/genome_list.txt \
--rl $OUTPUT_DIR/genome_list.txt \
-o $OUTPUT_DIR/fastani_results.txt \
--matrix
echo "=== sourmash signatures ==="
sourmash sketch dna -p scaled=1000,k=31 $GENOMES_DIR/*.fasta -o $OUTPUT_DIR/all.sig
sourmash compare $OUTPUT_DIR/all.sig -o $OUTPUT_DIR/sourmash.npy --csv $OUTPUT_DIR/sourmash.csv
echo "=== Identify clusters ==="
python3 << 'EOF'
import pandas as pd
import numpy as np
mash = pd.read_csv('${OUTPUT_DIR}/mash_distances.tsv', sep='\t', header=None,
names=['ref', 'query', 'distance', 'pvalue', 'shared'])
close = mash[(mash['distance'] < 0.001) & (mash['ref'] != mash['query'])]
print("Potential outbreak pairs (MASH distance < 0.001):")
print(close[['ref', 'query', 'distance']])
EOF
echo "=== Complete ==="
Python Analysis
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
mash = pd.read_csv('mash_distances.tsv', sep='\t', header=None,
names=['ref', 'query', 'dist', 'pval', 'shared'])
samples = sorted(set(mash['ref'].tolist()))
dist_matrix = mash.pivot(index='ref', columns='query', values='dist').fillna(0)
dist_matrix = dist_matrix.loc[samples, samples]
condensed = squareform(dist_matrix.values)
Z = linkage(condensed, method='average')
clusters = fcluster(Z, t=0.05, criterion='distance')
cluster_df = pd.DataFrame({'sample': samples, 'cluster': clusters})
print(cluster_df.groupby('cluster').size())
Related Skills
- metagenomics/kraken-classification - Taxonomic classification
- genome-assembly/contamination-detection - Contamination screening
- phylogenetics/modern-tree-inference - Phylogenetic analysis
- metagenomics/metaphlan-profiling - Species profiling