with one click
bio-pileup-generation
// Generate pileup data for variant calling using samtools mpileup and pysam. Use when preparing data for variant calling, analyzing per-position read data, or calculating allele frequencies.
// Generate pileup data for variant calling using samtools mpileup and pysam. Use when preparing data for variant calling, analyzing per-position read data, or calculating allele frequencies.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | bio-pileup-generation |
| description | Generate pileup data for variant calling using samtools mpileup and pysam. Use when preparing data for variant calling, analyzing per-position read data, or calculating allele frequencies. |
| tool_type | cli |
| primary_tool | samtools |
Reference examples tested with: bcftools 1.19+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Generate pileup data for variant calling and position-level analysis.
"Generate pileup from BAM" → Produce per-position read summaries showing depth, bases, and qualities.
samtools mpileup -f ref.fa input.bambam.pileup(chrom, start, end) (pysam)"Count alleles at a position" → Extract per-base read support at a specific genomic coordinate.
pileup_column.pileups and count bases (pysam)Pileup shows all reads covering each position in the reference, used for:
samtools mpileup -g/-u (BCF output for variant calling) has been deprecated since samtools 1.9 -- the genotype-likelihood code now lives in bcftools mpileup, which keeps mpileup logic versioned alongside bcftools call and avoids version-skew bugs.
| Use case | Recommended tool |
|---|---|
| Quick allele counts at known sites | samtools mpileup or pysam pileup |
| Germline variant calling (small genomes, simple cohorts) | bcftools mpileup -> bcftools call |
| Germline WGS / WES production | DeepVariant or HaplotypeCaller (not mpileup) |
| Somatic SNV/indel | Mutect2 / VarDict / VarScan2 (direct from BAM) |
| Long-read small variants | clair3 / DeepVariant ONT (direct from BAM) |
| Long-read SV | Sniffles / cuteSV (direct from BAM) |
| Ultra-low-frequency (ctDNA / MRD) | fgbio consensus -> bcftools call or hot-spot Mutect2 |
| Per-position allele counts (custom) | pysam pileup |
samtools mpileup (without -g) is still the standard tool for human-readable per-position read summaries.
samtools mpileup -f reference.fa input.bam > pileup.txt
samtools mpileup -f reference.fa -r chr1:1000000-2000000 input.bam
samtools mpileup -f reference.fa -l targets.bed input.bam
samtools mpileup -f reference.fa sample1.bam sample2.bam sample3.bam > pileup.txt
Text pileup format (6 columns per sample):
chr1 1000 A 15 ............... FFFFFFFFFFF
chr1 1001 T 12 ............ FFFFFFFFFFFF
| Column | Description |
|---|---|
| 1 | Chromosome |
| 2 | Position (1-based) |
| 3 | Reference base |
| 4 | Read depth |
| 5 | Read bases |
| 6 | Base qualities |
| Symbol | Meaning |
|---|---|
. | Match on forward strand |
, | Match on reverse strand |
ACGT | Mismatch (uppercase = forward) |
acgt | Mismatch (lowercase = reverse) |
^Q | Start of read (Q = MAPQ as ASCII) |
$ | End of read |
+NNN | Insertion of N bases |
-NNN | Deletion of N bases |
* | Deleted base |
> / < | Reference skip (intron) |
samtools mpileup -f reference.fa -q 20 input.bam
samtools mpileup -f reference.fa -Q 20 input.bam
samtools mpileup -f reference.fa -q 20 -Q 20 input.bam
# samtools mpileup default -d 8000 silently truncates targeted / mt-DNA / amplicon / UMI-deduped data
# bcftools mpileup default -d 250 is far lower; both must be set explicitly when piping
samtools mpileup -f reference.fa -d 0 input.bam # no cap
samtools mpileup -f reference.fa -d 1000000 input.bam # explicit high cap
# WRONG -- samtools 8000 cap, then bcftools 250 cap re-applied
samtools mpileup -f ref.fa in.bam | bcftools call -mv
# RIGHT -- single tool, explicit -d
bcftools mpileup -d 1000000 -f ref.fa in.bam | bcftools call -mv
When -f ref.fa is passed, BAQ is enabled by default. BAQ Phred-scales the probability that a base is misaligned (HMM realignment over a small window) and reduces base quality near indels. Tradeoffs: ~30% slower; suppresses FP SNVs near indels; hurts indel detection sensitivity.
| Flag | Behavior |
|---|---|
(default with -f) | BAQ on (computed from CIGAR if MD missing) |
-B / --no-BAQ | Disable BAQ -- raw qualities |
-E / --redo-BAQ | Force recompute (after BQSR; if MD stale) |
BAQ ON for: short-read germline SNV (BWA, Bowtie2, HISAT2), short-read somatic SNV.
BAQ OFF (-B) for: long-read variant calling (ONT, PacBio HiFi), SV calling, RNA-seq near splice junctions, viral / amplicon, ultra-deep ctDNA from consensus reads (consensus quality already inflated), aDNA (qualities pre-rescaled by mapDamage).
-A (count anomalous read pairs / orphans) is required for amplicon -- amplicon reads are by design not properly paired.
-aa (output all positions, including zero-coverage) is required for ARTIC SARS-CoV-2 consensus generation.
| Library | Flags |
|---|---|
| Short-read germline WGS (BWA) | -q 20 -Q 20 -d 0 (BAQ on default) |
| Short-read tumor WGS | -q 1 -Q 13 -d 0 -B (low MAPQ kept; BAQ off) |
| Amplicon viral (ARTIC) | -aa -A -d 600000 -B -Q 20 |
| Capture / exome | -q 20 -Q 20 -d 250 |
| Long-read ONT R10.4+ | -q 30 -Q 0 -B -d 0 --max-BQ 50 |
| PacBio HiFi | -q 20 -Q 0 -B -d 0 |
| RNA-seq variants | -q 20 -Q 20 -B -d 0 |
| Forensic / aDNA | -q 0 -Q 0 -A -d 0 -B |
Goal: Call variants from alignment data using the pileup-based approach.
Approach: Use bcftools mpileup (not samtools mpileup -g) so genotype-likelihood code is co-versioned with bcftools call. Apply quality and depth caps explicitly; annotate FORMAT fields needed for downstream filtering.
bcftools mpileup -f reference.fa -d 1000000 -q 20 -Q 20 \
--annotate FORMAT/AD,FORMAT/DP,FORMAT/SP,INFO/AD \
input.bam | \
bcftools call -mv -Oz -o variants.vcf.gz
bcftools index -t variants.vcf.gz
bcftools mpileup -f reference.fa --threads 4 -d 250 -q 20 -Q 20 \
-a FORMAT/AD,FORMAT/DP s1.bam s2.bam s3.bam | \
bcftools call -mv --threads 4 -Oz -o joint.vcf.gz
For somatic / low-VAF, prefer Mutect2 / Strelka2 / DeepVariant -- materially better than mpileup-based callers.
When fragment length < 2 * read_length, R1 and R2 overlap. bcftools mpileup detects overlap and counts once by default; samtools mpileup does not. Doubled support inflates somatic VAFs at sites covered by overlapping pairs (especially in cfDNA / FFPE). Prefer bcftools mpileup or use samtools mpileup --ignore-overlaps.
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1001000):
print(f'{pileup_column.reference_name}:{pileup_column.pos} depth={pileup_column.n}')
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1000001, truncate=True):
print(f'Position: {pileup_column.pos}')
print(f'Depth: {pileup_column.n}')
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
print(' Deletion')
elif pileup_read.is_refskip:
print(' Reference skip')
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
qual = pileup_read.alignment.query_qualities[qpos]
print(f' {base} (Q{qual})')
import pysam
from collections import Counter
def allele_counts(bam_path, chrom, pos):
counts = Counter()
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup_column in bam.pileup(chrom, pos, pos + 1, truncate=True):
if pileup_column.pos != pos:
continue
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
counts['DEL'] += 1
elif pileup_read.is_refskip:
continue
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
counts[base.upper()] += 1
return dict(counts)
counts = allele_counts('input.bam', 'chr1', 1000000)
print(counts) # {'A': 45, 'G': 5}
import pysam
from collections import Counter
def allele_frequency(bam_path, chrom, pos, min_qual=20):
counts = Counter()
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup_column in bam.pileup(chrom, pos, pos + 1, truncate=True,
min_base_quality=min_qual):
if pileup_column.pos != pos:
continue
for pileup_read in pileup_column.pileups:
if pileup_read.is_del or pileup_read.is_refskip:
continue
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
counts[base.upper()] += 1
total = sum(counts.values())
if total == 0:
return {}
return {base: count / total for base, count in counts.items()}
freq = allele_frequency('input.bam', 'chr1', 1000000)
for base, f in sorted(freq.items(), key=lambda x: -x[1]):
print(f'{base}: {f:.1%}')
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup_column in bam.pileup('chr1', 1000000, 1001000,
truncate=True,
min_mapping_quality=20,
min_base_quality=20):
print(f'{pileup_column.pos}: {pileup_column.n}')
import pysam
def pileup_text(bam_path, ref_path, chrom, start, end):
with pysam.AlignmentFile(bam_path, 'rb') as bam:
with pysam.FastaFile(ref_path) as ref:
for pileup_column in bam.pileup(chrom, start, end, truncate=True):
pos = pileup_column.pos
ref_base = ref.fetch(chrom, pos, pos + 1)
depth = pileup_column.n
bases = []
for pileup_read in pileup_column.pileups:
if pileup_read.is_del:
bases.append('*')
elif pileup_read.is_refskip:
bases.append('>')
else:
qpos = pileup_read.query_position
base = pileup_read.alignment.query_sequence[qpos]
if base.upper() == ref_base.upper():
bases.append('.' if not pileup_read.alignment.is_reverse else ',')
else:
bases.append(base.upper() if not pileup_read.alignment.is_reverse else base.lower())
print(f'{chrom}\t{pos+1}\t{ref_base}\t{depth}\t{"".join(bases)}')
pileup_text('input.bam', 'reference.fa', 'chr1', 1000000, 1000100)
| Option | Description | Common pitfall |
|---|---|---|
-f FILE | Reference FASTA | Triggers BAQ ON by default |
-r REGION | Restrict to region | |
-l FILE | BED file of regions | |
-q INT | Min mapping quality | Aligner-dependent semantics |
-Q INT | Min base quality | -Q 0 with default overlap detection has subtle behavior |
-d INT | Max depth | Default 8000 silently truncates; bcftools mpileup default is 250 |
-B | Disable BAQ | Often correct for long reads, SV, viral, consensus |
-A | Count anomalous pairs | Required for amplicon |
-aa | Output all positions | Required for consensus generation |
--ignore-overlaps | Disable mate-overlap correction | Rarely correct |
--max-BQ INT | Cap BQ (default 60) | Useful for ONT (Q values inflated) |
-g (DEPRECATED) | Old BCF output | Use bcftools mpileup instead |
| Task | Command |
|---|---|
| Basic pileup | samtools mpileup -f ref.fa in.bam |
| Quality filter | samtools mpileup -f ref.fa -q 20 -Q 20 in.bam |
| Region | samtools mpileup -f ref.fa -r chr1:1-1000 in.bam |
| To bcftools | bcftools mpileup -f ref.fa -d 1000000 in.bam | bcftools call -mv |
| Error | Cause | Solution |
|---|---|---|
No FASTA reference | Missing -f option | Add -f reference.fa |
Reference mismatch | Wrong reference | Use same reference as alignment |
| Out of memory | High coverage region | Use -d to cap depth |