Run any Skill in Manus with one click

bio-alignment-indexing

Stars943

Forks165

UpdatedMay 29, 2026 at 16:17

Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

GPTomics

GPTomics/bioSkills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

3 files

SKILL.md

readonly

More from this repository

same repository

bio-ribo-seq-initiation-site-mapping

GPTomics/bioSkills

Map translation initiation sites, including non-AUG and alternative starts, from initiation-drug ribosome profiling (TI-seq). Use when locating start codons, detecting near-cognate or upstream initiation, or analyzing harringtonine, lactimidomycin (GTI-seq/QTI-seq), or retapamulin (Ribo-RET) data.

2026-06-20943

bio-ribo-seq-orf-detection

GPTomics/bioSkills

Detect and quantify translated ORFs from Ribo-seq using 3-nucleotide periodicity, including uORFs, internal ORFs, dORFs, and novel ORFs. Use when finding actively translated regions beyond annotated CDS, classifying ORFs by the 2022 community standard, quantifying ORF-level translation, or choosing between periodicity-based callers.

2026-06-20943

bio-ribo-seq-riboseq-preprocessing

GPTomics/bioSkills

Preprocess ribosome profiling reads with UMI handling, adapter trimming, contaminant/rRNA depletion, and footprint-aware alignment. Use when preparing Ribo-seq FASTQ for periodicity QC, ORF detection, translation efficiency, or stalling analysis, or when deciding how to deduplicate, which aligner to use, or how to size-select ribosome-protected fragments.

2026-06-20943

bio-ribo-seq-ribosome-periodicity

GPTomics/bioSkills

Validate Ribo-seq library quality by measuring 3-nucleotide periodicity and calibrating read-length-specific P-site offsets. Use when checking whether footprints capture genuine translation, determining P-site offsets for downstream ORF/TE/stalling analysis, or deciding which read lengths to keep.

2026-06-20943

bio-ribo-seq-ribosome-stalling

GPTomics/bioSkills

Detect ribosome pausing and stalling at codon resolution from Ribo-seq, using local-relative occupancy metrics and A-site assignment. Use when studying elongation dynamics, codon dwell times, pause motifs, or ribosome collisions, and when judging whether a pause is real biology or a cycloheximide artifact.

2026-06-20943

bio-ribo-seq-translation-efficiency

GPTomics/bioSkills

Quantify translation efficiency (TE) as ribosome occupancy relative to mRNA abundance and test for differential TE between conditions. Use when separating translational from transcriptional regulation, distinguishing genuine translational control from buffering, or choosing between riborex, Xtail, anota2seq, and DESeq2 interaction models.

2026-06-20943

name	bio-alignment-indexing
description	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
tool_type	cli
primary_tool	samtools

Version Compatibility

Reference examples tested with: pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Alignment Indexing

Create indices for random access to alignment files using samtools and pysam.

"Index a BAM file" -> Create a .bai/.csi index enabling random access to genomic regions.

CLI: samtools index file.bam
Python: pysam.index('file.bam')

Index Types

Index	Extension	Max contig	Bin shift	When required
BAI	`.bai` / `.bam.bai`	2^29-1 = ~536 Mbp	fixed (16 kb)	Default for human, mouse, fly, fish
CSI	`.csi` / `.bam.csi`	2^(min_shift + depth*3)	configurable via `-m`	Required for any contig >536 Mbp
CRAI	`.crai` / `.cram.crai`	chunk-based	n/a	CRAM only
TBI	`.tbi`	2^29-1	fixed	tabix VCF/BED -- same limit as BAI

Which Index for Which Genome

Genome	Largest contig	Index
GRCh38 / GRCh37 (human)	248 Mbp	BAI
GRCm39 (mouse)	195 Mbp	BAI
GRCz11 (zebrafish), TAIR10 (Arabidopsis)	78 Mbp / 30 Mbp	BAI
Wheat IWGSC (Triticum aestivum)	~830 Mbp avg	CSI
Pine, fir, axolotl, sugar pine	multi-Gbp	CSI with larger `-m`
Long-read assembly with very large contigs	varies	check `cut -f2 ref.fa.fai \| sort -nr \| head -1`

For polyploid plants and salamander-scale genomes, increase the bin shift:

# Default CSI matches BAI bin layout: 2^(14 + 5*3) = 2^29 ≈ 512 Mbp per contig
samtools index -c file.bam

# Larger min_shift for contigs >512 Mbp (wheat, axolotl, sugar pine)
samtools index -c -m 18 file.bam   # 2^(18+15) = 2^33 = ~8.5 Gbp per contig

Index file precedence trap: when both .bai and .csi exist, samtools uses .bai. After re-indexing to CSI for a long contig, delete the old .bai or operations fail confusingly.

samtools index

Create BAI Index

samtools index input.bam
# Creates input.bam.bai

Create CSI Index

samtools index -c input.bam
# Creates input.bam.csi

Specify Output Name

samtools index input.bam output.bai

Multi-threaded Indexing

samtools index -@ 4 input.bam

Index CRAM

samtools index input.cram
# Creates input.cram.crai

Index Requirements

Indexing requires coordinate-sorted files:

# Check sort order
samtools view -H input.bam | grep "^@HD"
# Should show SO:coordinate

# Sort if needed, then index
samtools sort -o sorted.bam input.bam
samtools index sorted.bam

Using Indices for Region Access

Goal: Extract reads overlapping specific genomic coordinates from an indexed BAM.

Approach: With the index present, samtools view or pysam.fetch() can jump directly to the relevant file offset instead of scanning the entire file.

samtools view with Region

# Requires index file present
samtools view input.bam chr1:1000000-2000000

Multiple Regions

samtools view input.bam chr1:1000-2000 chr2:3000-4000

Regions from BED File

samtools view -L regions.bed input.bam

pysam Python Alternative

Create Index

import pysam

pysam.index('input.bam')
# Creates input.bam.bai

Create CSI Index

# pysam.index passes through to samtools index; pass the -c flag for CSI.
pysam.index('-c', 'input.bam')
# Produces input.bam.csi.

Fetch with Index

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    # fetch() requires index
    for read in bam.fetch('chr1', 1000000, 2000000):
        print(read.query_name)

Check if Indexed

import pysam
from pathlib import Path

def is_indexed(bam_path):
    bam_path = Path(bam_path)
    return (bam_path.with_suffix('.bam.bai').exists() or
            Path(str(bam_path) + '.bai').exists() or
            bam_path.with_suffix('.bam.csi').exists())

if not is_indexed('input.bam'):
    pysam.index('input.bam')

Fetch Multiple Regions

regions = [('chr1', 1000, 2000), ('chr1', 5000, 6000), ('chr2', 1000, 2000)]

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for chrom, start, end in regions:
        count = sum(1 for _ in bam.fetch(chrom, start, end))
        print(f'{chrom}:{start}-{end}: {count} reads')

Count Reads in Region

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    count = bam.count('chr1', 1000000, 2000000)
    print(f'Reads in region: {count}')

Get Reads Covering Position

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for read in bam.fetch('chr1', 1000000, 1000001):
        if read.reference_start <= 1000000 < read.reference_end:
            print(f'{read.query_name} covers position 1000000')

Index File Locations

samtools looks for indices in two locations:

input.bam.bai   # Standard location
input.bai       # Alternative location

For CRAM:

input.cram.crai

idxstats - Index Statistics

Get Per-Chromosome Counts

samtools idxstats input.bam

Output format:

chr1    248956422    5000000    0
chr2    242193529    4500000    0
*       0            0          10000

Columns: reference name, length, mapped reads, unmapped reads.

What "mapped" Actually Counts (Caveat)

The mapped column counts every alignment record with that RNAME, including secondary AND supplementary. For long-read minimap2 output, where a single read can produce many supplementary chimeric alignments, idxstats overcounts input reads -- typically 1.5-3x.

For unique read counts, use primary-only:

samtools view -c -F 2304 input.bam chr1   # primary only

Cross-check unmapped consistency (a senior sanity check):

samtools idxstats file.bam | awk '{sum+=$4} END {print sum}'   # idxstats unmapped (sum across all rows; PE orphans get a contig RNAME)
samtools view -c -f 4 -F 2304 file.bam                         # primary unmapped (should match)

Sum Total Mapped Reads

samtools idxstats input.bam | awk '{sum += $3} END {print sum}'

pysam idxstats

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for stat in bam.get_index_statistics():
        print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')

FASTA Index (faidx)

Related but different - index reference FASTA for random access:

samtools faidx reference.fa
# Creates reference.fa.fai

# Fetch region from indexed FASTA
samtools faidx reference.fa chr1:1000-2000

pysam FastaFile

with pysam.FastaFile('reference.fa') as ref:
    seq = ref.fetch('chr1', 1000, 2000)
    print(seq)

Quick Reference

Task	samtools	pysam
Create BAI	`samtools index file.bam`	`pysam.index('file.bam')`
Create CSI	`samtools index -c file.bam`	`pysam.index('file.bam', csi=True)`
Fetch region	`samtools view file.bam chr1:1-1000`	`bam.fetch('chr1', 0, 1000)`
Count in region	`samtools view -c file.bam chr1:1-1000`	`bam.count('chr1', 0, 1000)`
Index stats	`samtools idxstats file.bam`	`bam.get_index_statistics()`
Index FASTA	`samtools faidx ref.fa`	Automatic with FastaFile

Index Staleness

If the BAM was modified after indexing, the index points to wrong file offsets and region queries return wrong (or zero) reads. Quick check:

if [ input.bam -nt input.bam.bai ]; then
    echo "Index older than BAM; re-indexing"
    samtools index input.bam
fi

Contig-Naming Sanity Check

A leading cause of "my variant calling produced empty VCFs" tickets: querying chrM against a BAM that uses MT (or chr1 vs 1). Always inspect contig conventions before region queries:

samtools view -H input.bam | grep '^@SQ' | head -3
# Compare with reference dict:
samtools dict ref.fa | head -3

UCSC convention uses chr1/chrM; Ensembl/NCBI uses 1/MT. The two are not interchangeable; tools fail with "contig not found" or silently return zero reads.

Common Errors

Error	Cause	Solution
`random alignment retrieval only works for indexed BAM`	Missing index	Run `samtools index file.bam`
`file is not sorted`	Unsorted BAM	Sort first with `samtools sort`
`chromosome not found`	Wrong chromosome name	Check names with `samtools view -H`
Region query returns zero reads on a known-populated locus	Stale BAI / `chr` vs no-`chr` mismatch	Re-index; verify naming convention
BAI silently truncates reads on contigs >536 Mbp	Plant / amphibian / amplified genome	Use CSI: `samtools index -c file.bam`

Related Skills

sam-bam-basics - View and convert alignment files
alignment-sorting - Sort BAM files (required before indexing)
alignment-filtering - Filter by regions using index
bam-statistics - Use idxstats for quick counts
sequence-io/read-sequences - Index FASTA with SeqIO.index_db()