with one click
bio-vcf-statistics
// Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents.
// Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | bio-vcf-statistics |
| description | Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents. |
| tool_type | cli |
| primary_tool | bcftools |
Reference examples tested with: bcftools 1.19+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Generate statistics and quality metrics using bcftools.
| Command | Purpose |
|---|---|
bcftools stats | Comprehensive variant statistics |
bcftools gtcheck | Sample concordance and relatedness |
bcftools query | Custom summaries |
Goal: Generate comprehensive variant statistics including counts, Ti/Tv ratio, and quality distributions.
Approach: Run bcftools stats and parse section-tagged output lines (SN, TSTV, AF, QUAL, DP).
"How many variants are in this VCF?" ā Compute summary counts, substitution types, and quality distributions from variant records.
bcftools stats input.vcf.gz > stats.txt
bcftools stats input.vcf.gz | grep "^SN"
Output sections:
SN - Summary numbersTSTV - Transitions/transversionsSiS - Singleton statsAF - Allele frequency distributionQUAL - Quality distributionIDD - Indel distributionST - Substitution typesDP - Depth distributionbcftools stats input.vcf.gz | grep "^SN" | cut -f3-
Reports:
bcftools stats input.vcf.gz | grep "^TSTV"
Transitions (purine-to-purine: AāG, or pyrimidine-to-pyrimidine: CāT) are chemically favored over transversions because they preserve the purine/pyrimidine ring structure. CpG deamination (methylated CāT) is the single most common point mutation in vertebrate genomes and is a transition, which further inflates the Ti/Tv ratio. Exomes have higher Ti/Tv than whole genomes because coding regions are enriched for CpG dinucleotides.
Expected Ti/Tv ratio:
bcftools stats -s - input.vcf.gz > per_sample.txt
bcftools stats input1.vcf.gz input2.vcf.gz > comparison.txt
bcftools stats -r chr1:1000000-2000000 input.vcf.gz > region_stats.txt
bcftools stats -R exome.bed input.vcf.gz > exome_stats.txt
Goal: Visualize variant statistics as publication-quality plots.
Approach: Pipe bcftools stats output to plot-vcfstats to generate PDF and PNG plots.
bcftools stats input.vcf.gz > stats.txt
plot-vcfstats -p output_dir stats.txt
Creates:
output_dir/summary.pdfbcftools stats file1.vcf.gz file2.vcf.gz > comparison.txt
plot-vcfstats -p comparison_dir comparison.txt
Interpreting variant statistics requires context-dependent thresholds. A metric that looks normal in WGS may be alarming in WES.
| Metric | WGS Expected | WES Expected | Flag If |
|---|---|---|---|
| Ti/Tv ratio | ~2.0-2.1 | ~2.8-3.3 | WGS: <1.8 or >2.5; WES: <2.5 or >3.5 |
| Het/Hom ratio | 1.5-2.0 | 1.5-2.0 | Outlier relative to cohort |
| Call rate per variant | >95% | >95% | <90% |
| Call rate per sample | >95% | >95% | <90% |
| Singleton rate (WGS) | ~100-200k | Variable | >2x cohort mean |
| Mendelian error rate (trios) | <0.5% | <0.5% | >1% |
Interpretation notes:
A variant caller with 99% overall accuracy may perform at 70% in difficult genomic regions. Always evaluate stratified by region complexity.
bcftools stats -R easy_regions.bed input.vcf.gz > easy_stats.txt
bcftools stats -R difficult_regions.bed input.vcf.gz > difficult_stats.txt
Key stratification categories (using GIAB stratification BED files):
GIAB stratification BED files are available at github.com/genome-in-a-bottle/genome-stratifications. Comparing Ti/Tv ratio between easy and difficult regions is a quick diagnostic: a drop in Ti/Tv within difficult regions confirms elevated false-positive rates there.
For multi-sample VCFs, per-sample and per-site population metrics reveal systematic issues invisible in single-sample analyses.
Per-sample checks:
Per-site checks:
bcftools query -f '%CHROM\t%POS\t%INFO/InbreedingCoeff\n' input.vcf.gz | \
awk '$3 < -0.3' > excess_het_sites.txt
Goal: Verify sample identity and detect sample swaps by comparing genotype concordance.
Approach: Use bcftools gtcheck to compute pairwise discordance rates; interpret thresholds based on expected relatedness.
bcftools gtcheck -G 1 input.vcf.gz > relatedness.txt
bcftools gtcheck -g reference.vcf.gz query.vcf.gz
Discordance thresholds for interpreting results:
Output format (DC lines):
DC 0 sample1 sample2 0.95 1234 1200
Fields: DC tag, index, sample1, sample2, discordance rate, sites compared, discordant sites. Sample swaps are one of the most common errors in genomics studies. Running gtcheck is essential for any multi-sample study and should be performed early in the QC pipeline before downstream analysis.
Goal: Compute targeted summary statistics using bcftools query and shell tools.
Approach: Extract specific fields with bcftools query/view and aggregate with awk for counts, means, and distributions.
bcftools view -H input.vcf.gz | wc -l
# SNPs
bcftools view -v snps -H input.vcf.gz | wc -l
# Indels
bcftools view -v indels -H input.vcf.gz | wc -l
bcftools view -f PASS -H input.vcf.gz | wc -l
bcftools query -f '%QUAL\n' input.vcf.gz | \
awk '{sum+=$1; count++} END {print "Mean QUAL:", sum/count}'
bcftools query -f '%INFO/DP\n' input.vcf.gz | \
awk '{sum+=$1; count++} END {print "Mean DP:", sum/count}'
# Count heterozygous sites per sample
bcftools query -f '[%GT\t]\n' input.vcf.gz | \
awk -F'\t' '{for(i=1;i<=NF;i++) if($i=="0/1" || $i=="0|1") het[i]++}
END {for(i in het) print "Sample", i, "het:", het[i]}'
bcftools query -f '%INFO/AF\n' input.vcf.gz | \
awk '{
if($1<0.01) rare++
else if($1<0.05) low++
else if($1<0.5) common++
else freq++
} END {
print "Rare (<1%):", rare
print "Low (1-5%):", low
print "Common (5-50%):", common
print "Frequent (>50%):", freq
}'
Goal: Compute per-sample variant counts, genotype distributions, and missingness rates.
Approach: Use bcftools query/view/stats per sample to tabulate sample-level metrics.
bcftools query -l input.vcf.gz
bcftools query -l input.vcf.gz | wc -l
for sample in $(bcftools query -l input.vcf.gz); do
count=$(bcftools view -s "$sample" -H input.vcf.gz | \
bcftools view -c 1 -H | wc -l)
echo "$sample: $count"
done
bcftools stats -s - input.vcf.gz | grep "^PSC"
Goal: Compute variant statistics programmatically in Python for custom analyses.
Approach: Iterate variants with cyvcf2, accumulate counts by type, and compute quality/genotype distributions with numpy.
from cyvcf2 import VCF
stats = {'snps': 0, 'indels': 0, 'other': 0}
for variant in VCF('input.vcf.gz'):
if variant.is_snp:
stats['snps'] += 1
elif variant.is_indel:
stats['indels'] += 1
else:
stats['other'] += 1
print(f'SNPs: {stats["snps"]}')
print(f'Indels: {stats["indels"]}')
print(f'Other: {stats["other"]}')
from cyvcf2 import VCF
import numpy as np
quals = []
for variant in VCF('input.vcf.gz'):
if variant.QUAL:
quals.append(variant.QUAL)
quals = np.array(quals)
print(f'Mean QUAL: {np.mean(quals):.1f}')
print(f'Median QUAL: {np.median(quals):.1f}')
print(f'Min QUAL: {np.min(quals):.1f}')
print(f'Max QUAL: {np.max(quals):.1f}')
from cyvcf2 import VCF
vcf = VCF('input.vcf.gz')
samples = vcf.samples
hom_ref = [0] * len(samples)
het = [0] * len(samples)
hom_alt = [0] * len(samples)
missing = [0] * len(samples)
for variant in vcf:
for i, gt in enumerate(variant.gt_types):
if gt == 0:
hom_ref[i] += 1
elif gt == 1:
het[i] += 1
elif gt == 3:
hom_alt[i] += 1
else:
missing[i] += 1
for i, sample in enumerate(samples):
print(f'{sample}: HOM_REF={hom_ref[i]}, HET={het[i]}, HOM_ALT={hom_alt[i]}, MISS={missing[i]}')
from cyvcf2 import VCF
transitions = 0
transversions = 0
ti_pairs = {('A', 'G'), ('G', 'A'), ('C', 'T'), ('T', 'C')}
for variant in VCF('input.vcf.gz'):
if not variant.is_snp:
continue
ref = variant.REF
alt = variant.ALT[0]
if (ref, alt) in ti_pairs:
transitions += 1
else:
transversions += 1
ratio = transitions / transversions if transversions > 0 else 0
print(f'Transitions: {transitions}')
print(f'Transversions: {transversions}')
print(f'Ti/Tv ratio: {ratio:.2f}')
bcftools stats input.vcf.gz > stats.txt
grep "^SN" stats.txt | cut -f3-
grep "^TSTV" stats.txt | cut -f5
plot-vcfstats -p qc_plots stats.txt
bcftools stats raw.vcf.gz filtered.vcf.gz > comparison.txt
grep "^SN" comparison.txt | head -20
| Task | Command |
|---|---|
| Full stats | bcftools stats input.vcf.gz |
| Summary only | bcftools stats input.vcf.gz | grep "^SN" |
| Ti/Tv ratio | bcftools stats input.vcf.gz | grep "^TSTV" |
| Per-sample | bcftools stats -s - input.vcf.gz |
| Compare VCFs | bcftools stats file1.vcf.gz file2.vcf.gz |
| Sample check | bcftools gtcheck -G 1 input.vcf.gz |
| Plot stats | plot-vcfstats -p dir stats.txt |
| Error | Cause | Solution |
|---|---|---|
No data | Empty VCF | Check if VCF has variants |
plot-vcfstats not found | Not installed | Install with bcftools |
Cannot open | Invalid VCF | Check file format |