| name | bio-genome-intervals-gtf-gff-handling |
| description | Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats. |
| tool_type | mixed |
| primary_tool | gffread |
Version Compatibility
Reference examples tested with: bedtools 2.31+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package> then help(module.function) to check signatures
- CLI:
<tool> --version then <tool> --help to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
GTF/GFF Handling
"Parse gene annotations from GTF/GFF" → Read gene models, extract features by type (gene, exon, CDS), and query attributes from annotation files.
- Python:
gffutils.create_db('file.gtf') (gffutils), pyranges.read_gtf() (pyranges)
- CLI:
awk on tab-delimited GTF fields
GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.
Format Comparison
| Feature | GTF | GFF3 |
|---|
| Coordinate system | 1-based, inclusive | 1-based, inclusive |
| Hierarchy | Implicit (gene_id, transcript_id) | Explicit (Parent attribute) |
| Attribute format | key "value"; | key=value; |
| Comments | # | # |
| Fasta sequences | Not standard | ##FASTA directive |
GTF Format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
GFF3 Format
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1
chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972
chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328
Parse GTF with gtfparse (Python)
Installation
pip install gtfparse
Basic Parsing
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
print(df.columns)
genes = df[df['feature'] == 'gene']
transcripts = df[df['feature'] == 'transcript']
exons = df[df['feature'] == 'exon']
gene_df = df[df['gene_name'] == 'TP53']
Extract Gene Coordinates
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]
genes_bed = genes.copy()
genes_bed['start'] = genes_bed['start'] - 1
genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']]
genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)
Get Exons for Gene
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')]
tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']]
print(tp53_exons)
Parse GFF with gffutils (Python)
Installation
pip install gffutils
Create Database
import gffutils
db = gffutils.create_db('annotation.gff3', 'annotation.db',
force=True, merge_strategy='create_unique')
db = gffutils.FeatureDB('annotation.db')
Query Features
import gffutils
db = gffutils.FeatureDB('annotation.db')
for featuretype in db.featuretypes():
count = db.count_features_of_type(featuretype)
print(f'{featuretype}: {count}')
for gene in db.features_of_type('gene'):
print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')
gene = db['ENSG00000141510']
print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')
for transcript in db.children(gene, featuretype='mRNA'):
print(f' Transcript: {transcript.id}')
for exon in db.children(transcript, featuretype='exon'):
print(f' Exon: {exon.start}-{exon.end}')
Get Introns
import gffutils
db = gffutils.FeatureDB('annotation.db')
transcript = db['ENST00000269305']
introns = list(db.interfeatures(db.children(transcript, featuretype='exon'),
new_featuretype='intron'))
for intron in introns:
print(f'Intron: {intron.start}-{intron.end}')
Convert Formats with gffread (CLI)
Installation
conda install -c bioconda gffread
GTF to GFF3
gffread annotation.gtf -o annotation.gff3
GFF3 to GTF
gffread annotation.gff3 -T -o annotation.gtf
Extract Sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf
gffread -x cds.fa -g genome.fa annotation.gtf
gffread -y proteins.fa -g genome.fa annotation.gtf
Filter Features
gffread annotation.gtf -C -o coding.gtf
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf
Extract Regions with bedtools
Get Promoters
awk '$3 == "transcript"' annotation.gtf | \
awk -v OFS='\t' '{
if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
else print $1, $5-1, $5, ".", ".", $7;
}' > tss.bed
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed
Get Gene Bodies
awk '$3 == "gene"' annotation.gtf | \
awk -v OFS='\t' '{
split($0, a, "gene_id \""); split(a[2], b, "\"");
print $1, $4-1, $5, b[1], ".", $7;
}' > genes.bed
Get Exons
awk '$3 == "exon"' annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \
sort -k1,1 -k2,2n | uniq > exons.bed
Python: GTF to BED Conversion
Goal: Convert GTF annotation features to BED format with proper coordinate system translation for use with bedtools operations.
Approach: Load the GTF with gtfparse, filter to the desired feature type (gene, exon, etc.), convert from 1-based inclusive GTF coordinates to 0-based half-open BED coordinates by subtracting 1 from start positions.
import gtfparse
import pandas as pd
def gtf_to_bed(gtf_path, feature_type='gene', output_path=None):
'''Convert GTF features to BED format.'''
df = gtfparse.read_gtf(gtf_path)
features = df[df['feature'] == feature_type].copy()
bed = pd.DataFrame({
'chrom': features['seqname'],
'start': features['start'] - 1,
'end': features['end'],
'name': features.get('gene_name', features.get('gene_id', '.')),
'score': 0,
'strand': features['strand']
})
if output_path:
bed.to_csv(output_path, sep='\t', header=False, index=False)
return bed
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed')
exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')
Validate GTF/GFF
gffread -E annotation.gtf
gffread -E annotation.gff3
gt gff3validator annotation.gff3
Common Attributes
GTF Attributes
| Attribute | Description |
|---|
| gene_id | Ensembl gene ID |
| gene_name | Gene symbol |
| gene_biotype | protein_coding, lncRNA, etc. |
| transcript_id | Ensembl transcript ID |
| transcript_name | Transcript symbol |
| exon_number | Exon position in transcript |
| exon_id | Ensembl exon ID |
GFF3 Attributes
| Attribute | Description |
|---|
| ID | Unique feature identifier |
| Name | Display name |
| Parent | Parent feature ID |
| Dbxref | Database cross-references |
| gene_biotype | Gene type |
Memory-Efficient Processing
import gtfparse
df = gtfparse.read_gtf('annotation.gtf',
features=['gene', 'exon'])
Related Skills
- bed-file-basics - BED format and conversion
- interval-arithmetic - Gene/exon overlap analysis
- proximity-operations - TSS proximity analysis
- differential-expression/de-results - Gene coordinate mapping