Run any Skill in Manus with one click

bio-expression-matrix-gene-id-mapping

Stars943

Forks165

UpdatedMay 25, 2026 at 18:23

Maps between gene identifier systems (Ensembl, Entrez, HGNC symbol, UniProt, RefSeq, MANE) using AnnotationDbi, biomaRt, mygene, pyensembl, and Ensembl REST. Encodes Ensembl version stripping with GENCODE _PAR_Y preservation, the Ziemann 2016 Excel autocorrect debacle and Bruford 2020 HGNC renames (SEPT*->SEPTIN*, MARCH*->MARCHF*, MARC*->MTARC*, DEC1->DELEC1), OCT4/POU5F1 alias resolution, biomaRt archive endpoints for release pinning, the `filters` (plural) gotcha, MANE Select for clinical reporting, cross-species orthology via Ensembl Compara / OMA / OrthoDB, and tx2gene construction for tximport. Use when converting gene IDs across systems, handling renamed symbols, building tx2gene, pinning to a specific Ensembl release for reproducibility, or mapping cross-species orthologs.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

GPTomics

GPTomics/bioSkills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

3 files

SKILL.md

readonly

Version Compatibility

Reference examples tested with: biomaRt 2.58+, AnnotationDbi 1.66+, org.Hs.eg.db 3.18+, org.Mm.eg.db 3.18+, GenomicFeatures 1.54+, mygene 1.38+ (Python), pyensembl 2.3+, pandas 2.2+, rtracklayer 1.62+

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters
Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Gene ID Mapping

"Convert gene IDs from X to Y" -> Query the appropriate annotation source (local org.db for speed, biomaRt for Ensembl-specific attributes, mygene for cross-database aliases, Ensembl REST for low-level access), with version pinning for reproducibility and explicit handling of one-to-many mappings, withdrawn symbols, and species-specific naming.

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Ziemann, Eren, El-Osta 2016 Genome Biol 17:177 scanned 18 leading genomics journals and found ~20% of papers with Excel-attached supplementary gene lists had silently mangled symbols (SEPT2 -> 2-Sep, MARCH1 -> 1-Mar, ...). Five years later the problem persisted. HGNC's response (Bruford, Braschi, Denny, Jones, Seal, Tweedie 2020 Nat Genet 52:754) was to rename the affected genes:

Old	New	Affected
`SEPT#`	`SEPTIN#`	SEPT1 - SEPT14 -> SEPTIN1 - SEPTIN14
`MARCH#`	`MARCHF#`	MARCH1 - MARCH11 -> MARCHF1 - MARCHF11
`MARC#`	`MTARC#`	MARC1, MARC2 -> MTARC1, MTARC2
`DEC1`	`DELEC1`	DEC1 -> DELEC1

Code that hard-codes old symbols silently drops these genes when joined against post-2020 annotations. Detection on import: if a gene column contains ^\d{1,2}-(Jan|Feb|Mar|...|Dec)$ patterns, the file was Excel-corrupted. Always read.csv(colClasses=c(gene='character')) (R) or pd.read_csv(dtype={'gene': str}) (Python) -- but the damage is at Excel-save time, not import time.

Two related insights that determine half the practical work:

Ensembl version suffixes matter sometimes and not others. ENSG00000123456.7 is release-specific; the unversioned ENSG00000123456 is the stable cross-release ID. STRIP for cross-release joins, MSigDB lookups, gene-set databases. KEEP for intra-release reproducibility and clinical reports. CRITICAL: the naive sub('\\..*', '', x) regex ALSO strips the GENCODE _PAR_Y suffix in releases 25-43, collapsing chrY PAR duplicates onto their chrX counterparts. Use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x).
Never use HGNC symbols as the primary computational key. Symbols change. Use Ensembl or Entrez as keys; carry symbols only as display labels in the final results table.

Algorithmic Taxonomy

Tool	Source	Speed	Strength	Use for
AnnotationDbi + `org.Hs.eg.db` / `org.Mm.eg.db`	NCBI Gene snapshot, pinned at Bioc install	Fast, local	Stable, version-pinned	Default for Ensembl <-> Entrez <-> Symbol within Bioconductor
biomaRt	Ensembl BioMart over HTTP	Slow for >5k queries; timeouts	Ensembl-specific attributes (biotype, transcript versions, paralogs, orthologs)	Need Ensembl-specific fields; archive endpoints for release pinning
mygene.info / mygene (Python)	REST API to a curated meta-database	Server-side batching of 1000 IDs	Best for symbol/alias/prev_symbol resolution	Cross-database; HGNC withdrawn symbol resolution; non-R environments
Ensembl REST	Direct REST API to Ensembl	Rate-limited (15 req/sec)	Low-level access to variant consequence, sequence, etc.	Specialized queries not covered by biomaRt
pyensembl	Local Ensembl database (Python)	Fast, local, version-pinned	Reproducible offline; gene objects with transcript and exon access	Python pipelines needing rich annotation
HGNC API direct	https://rest.genenames.org	REST	Authoritative source for HGNC	Symbol provenance, prev/alias detection

Decision Tree by Scenario

Scenario	Recommended approach	Why
R Bioconductor pipeline, Ensembl <-> Entrez <-> Symbol	`AnnotationDbi::mapIds(org.Hs.eg.db, ...)`	Fastest, version-pinned, stable
Need Ensembl-only attributes (biotype, paralog, ortholog)	`biomaRt::useEnsembl(version=N)`	Only biomaRt exposes these
Cross-database with alias and withdrawn-symbol fallback	mygene `querymany(scopes='symbol,alias,prev_symbol')`	Designed for this case
Python pipeline, reproducible	`pyensembl` with pinned release	Offline, version-locked
Clinical report needing canonical transcript per gene	MANE Select (Morales 2022 Nature 604:310)	Cross-database consensus (RefSeq + Ensembl)
Cross-species mouse <-> human	Ensembl Compara `getLDS` filtered to `one2one`	Compara has best coverage; one2one most defensible
Building tx2gene for tximport	`GenomicFeatures::makeTxDbFromGFF` on the SAME GTF used in quantification	Annotation pinning matters
Need to reproduce a 2023 analysis exactly	`useEnsembl(version=109)` (or whichever release was used)	Without `version=`, biomaRt floats to current release
GRCh37 (legacy clinical)	`useEnsembl(GRCh=37)` dedicated permanent endpoint	GRCh37 -> GRCh38 mappings are not 1:1

AnnotationDbi + org.db

Goal: Map Ensembl gene IDs to symbols, Entrez IDs, or descriptions using a local Bioconductor annotation package.

Approach: mapIds() with the source keytype and target column; handle one-to-many via multiVals.

library(org.Hs.eg.db)
library(AnnotationDbi)

ensembl_ids <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

symbols  <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'SYMBOL',
                    multiVals = 'first')
entrez   <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'ENTREZID',
                    multiVals = 'first')
descrips <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'GENENAME',
                    multiVals = 'first')

keytypes(org.Hs.eg.db)

multiVals options: 'first' (silent), 'asNA' (NA for ambiguous), 'list' (preserve all). For DE results tables, 'first' is typical but the mapping rate should be reported.

For mouse: org.Mm.eg.db. For other organisms: check Bioconductor AnnotationData -> OrgDb list.

biomaRt with Version Pinning

Goal: Query Ensembl BioMart with the EXACT release version, for reproducibility.

Approach: useEnsembl(version=N) pins; listEnsemblArchives() lists available archives.

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes',
                       dataset = 'hsapiens_gene_ensembl',
                       version = 110)

ensembl_grch37 <- useEnsembl(biomart = 'genes',
                              dataset = 'hsapiens_gene_ensembl',
                              GRCh = 37)

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id',
                   'gene_biotype', 'description'),
    filters    = 'ensembl_gene_id',
    values     = ensembl_ids,
    mart       = ensembl
)

The filters= argument is PLURAL. The singular filter= may work via R's partial matching but breaks unpredictably if another argument starts with f. Always spell filters= and values= fully.

Multiple filters:

genes_in_region <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = c('chromosome_name', 'start', 'end'),
    values     = list('16', 1100000, 1250000),
    mart       = ensembl
)

Without version=, biomaRt floats to the current release -- a script written in 2023 against Ensembl 109 produces different mappings in 2026 against Ensembl 113. ALWAYS pin for any published analysis. Cache the mapping table alongside the analysis for reproducibility.

listEnsemblArchives() shows the available historical releases.

mygene (Python)

Goal: Map between any identifier systems using the curated MyGene.info meta-database with alias fallback.

Approach: MyGeneInfo().querymany(ids, scopes, fields, species); auto-batches at 1000 IDs server-side.

import mygene

mg = mygene.MyGeneInfo()

results = mg.querymany(['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736'],
                        scopes='ensembl.gene', fields='symbol,entrezgene,uniprot',
                        species='human')

mapping = {r['query']: r.get('symbol', None) for r in results}

results = mg.querymany(['SEPT1', 'MARCH1', 'OCT4'],
                        scopes='symbol,alias,prev_symbol',
                        fields='symbol,entrezgene,ensembl.gene',
                        species='human')

For paper-derived gene lists where symbols may be old or aliases (OCT4 vs POU5F1, MARCH1 vs MARCHF1, SEPT2 vs SEPTIN2), scopes='symbol,alias,prev_symbol' handles the resolution. The MyGene database aggregates HGNC's prev/alias columns.

OCT4 is the common usage; POU5F1 is the official HGNC symbol; in MSigDB the gene is POU5F1; in a Western blot legend it's "Oct4". For mapping a stem-cell paper to an Ensembl-quantified matrix, scope to aliases.

pyensembl

from pyensembl import EnsemblRelease

ensembl = EnsemblRelease(110, species='human')

gene = ensembl.gene_by_id('ENSG00000141510')
gene.gene_name

gene = ensembl.genes_by_name('TP53')[0]
gene.gene_id

mapping = {}
for eid in ensembl_ids:
    try:
        gene = ensembl.gene_by_id(eid.split('.')[0])
        mapping[eid] = gene.gene_name
    except ValueError:
        mapping[eid] = None

pyensembl downloads and caches the release database on first use; thereafter offline and version-locked.

Apply Mapping to Count Matrix -- Handling One-to-Many

Goal: Convert the gene index of a count matrix to a different ID type, summing reads from multiple source IDs that map to the same target.

Approach: Look up mapping, replace index, aggregate duplicates by SUM (not mean -- counts add).

import pandas as pd
import mygene

def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol',
                         species='human'):
    '''Map gene IDs in count matrix index, summing reads when multiple source map to one target.'''
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in counts.index]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping = {r['query']: r[to_type] for r in results if to_type in r}
    new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
    counts_mapped = counts.copy()
    counts_mapped.index = new_index
    counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()
    return counts_mapped

mapped = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')

Counts ADD when collapsing multiple source genes to one target. Means or medians would be wrong (they understate library size for the merged target).

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)

clean <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = 'ensembl_gene_id',
    values     = clean,
    mart       = ensembl
)

counts_df <- as.data.frame(counts)
counts_df$ensembl <- clean
merged <- merge(counts_df, mapping, by.x = 'ensembl', by.y = 'ensembl_gene_id')
counts_by_symbol <- aggregate(. ~ hgnc_symbol,
                              data = merged[, setdiff(colnames(merged), 'ensembl')],
                              FUN = sum)
rownames(counts_by_symbol) <- counts_by_symbol$hgnc_symbol
counts_by_symbol$hgnc_symbol <- NULL

Handle Unmapped IDs

def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
    import mygene
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in gene_ids]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping, unmapped = {}, []
    for r in results:
        original = gene_ids[clean.index(r['query'])]
        if to_type in r:
            mapping[original] = r[to_type]
        else:
            mapping[original] = original
            unmapped.append(original)
    print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
    return mapping, unmapped

Unmapped fraction is a QC signal:

<5% unmapped: normal (rare genes, recent deprecations)
5-20% unmapped: check Ensembl release alignment; check for HGNC renames
20% unmapped: wrong annotation release, wrong species, or wrong source ID type

MANE Select for Clinical Reporting

Goal: Use the single representative transcript per gene with identical exon/CDS in RefSeq AND Ensembl for clinical variant reporting.

Approach: Download the MANE TSV; join on Ensembl_Gene -> Ensembl_nuc (transcript) and RefSeq_nuc.

Morales J, Pujar S, Loveland JE et al. 2022 Nature 604:310-315 established MANE Select. ~19,000+ protein-coding genes have a single agreed transcript with matched coordinates across RefSeq (NM_xxxxxx) and Ensembl/GENCODE (ENST00000xxxxxxx). MANE Plus Clinical adds extra transcripts at loci where Select misses clinical variants.

For clinical reports with HGVS notation like NM_000546.6:c.215C>G, use the MANE Select RefSeq accession. The MANE TSV (downloadable from NCBI) provides the Ensembl crosswalk.

Cross-Species Orthologs

Goal: Map mouse <-> human (or any pair) for cross-species integration or pathway transfer.

Approach: Ensembl Compara via biomaRt getLDS; filter to orthology type appropriate to use.

library(biomaRt)

human <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
mouse <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', version = 110)

orthologs <- getLDS(
    attributes  = c('hgnc_symbol', 'ensembl_gene_id'),
    filters     = 'ensembl_gene_id',
    values      = human_gene_ids,
    mart        = human,
    attributesL = c('mgi_symbol', 'ensembl_gene_id', 'mmusculus_homolog_orthology_type'),
    martL       = mouse
)

Strategy	When	Trade-off
`one2one` orthologs only	Cross-species scRNA-seq integration; conservative DE comparison	Loses genes with paralog expansions; lower coverage
Include `one2many`	Broader gene coverage needed	Must select within group (highest confidence; highest expression)
Include `many2many`	Maximum inclusivity	Introduces ambiguity; use with caution

The "homology threshold" problem: no automatic threshold reliably separates true orthologs from paralogs across all gene families. For pathway transfer (mouse signature -> human), filter to one2one and accept the coverage loss.

Alternative sources: OMA (Hierarchical Orthologous Groups, cleaner one2one when present, smaller coverage); OrthoDB (hierarchical at multiple taxonomic levels). OrthoFinder for custom genomes.

PAR Gene Complications

Pseudo-autosomal region (PAR) genes exist on both X and Y with identical sequences. In GENCODE 25-43, the chrY copy has a _PAR_Y suffix. In GENCODE 44+ (Ensembl 110+), chrY PAR genes get their own ENSG accessions.

par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
dup_ids = counts.index[counts.index.duplicated()].unique()
if len(dup_ids) > 0:
    print(f'Duplicate gene entries: {len(dup_ids)}')
    counts = counts.groupby(counts.index).sum()

Reads from PAR regions cannot be unambiguously assigned to X or Y. Some references mask the Y-chromosome PAR to avoid double-counting; verify what the alignment reference does before building the matrix.

Build tx2gene for tximport

Goal: Create the transcript-to-gene mapping needed by tximport for gene-level summarization.

Approach: Build from the SAME GTF used to construct the Salmon/kallisto index, OR pull from biomaRt with version pinning.

library(GenomicFeatures)

txdb <- makeTxDbFromGFF('annotation.gtf.gz')
k <- keys(txdb, keytype = 'TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')

library(biomaRt)

mart <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
tx2gene <- getBM(
    attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
    mart       = mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')

import pandas as pd

def tx2gene_from_gtf(gtf_path):
    records = []
    with open(gtf_path) as f:
        for line in f:
            if line.startswith('#') or '\ttranscript\t' not in line:
                continue
            attrs = line.strip().split('\t')[8]
            gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
            tx_id   = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
            records.append({'TXNAME': tx_id, 'GENEID': gene_id})
    return pd.DataFrame(records).drop_duplicates()

CRITICAL: the tx2gene MUST use the same versioning convention as the Salmon/kallisto index. If the index used ENST00000269305.9 and tx2gene has ENST00000269305 (unversioned), tximport drops the transcripts. Mismatched versions silently lose data.

ID Type Reference

Type	Example	Stability	Use case
Ensembl Gene	ENSG00000141510	Stable across releases; versioned	RNA-seq, GTFs, primary computational key
Ensembl Transcript	ENST00000269305	Stable; versioned	Transcript-level analysis
Entrez Gene	7157	Stable; never reused	NCBI databases, KEGG pathways
HGNC Symbol	TP53	Changes (see SEPT/MARCH renames)	Display labels only
UniProt	P04637	Stable; versioned releases	Protein databases
RefSeq mRNA	NM_000546	Stable; versioned	Clinical reports, HGVS notation
MANE Select	NM_000546.6 / ENST00000269305.9	Stable consensus	Clinical variant reporting

Per-Method Failure Modes

`_PAR_Y` stripped, chrY duplicates collapsed

Trigger: GENCODE v40 count matrix; rownames(counts) <- sub('\\..*', '', rownames(counts)); duplicate row indices and inflated chrY PAR gene counts.

Mechanism: Default regex strips _PAR_Y along with the version suffix. Two distinct rows (chrX and chrY copies) become the same ENSG ID; aggregate sums them.

Symptom: Counts for PAR genes double; sex check shows females expressing chrY genes; downstream rowGroupBy returns warnings.

Fix: Use the preserving regex: sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x). Or upgrade quantification to GENCODE 44+ where _PAR_Y is retired.

`biomaRt` returned 0 rows without warning

Trigger: getBM(attributes=..., filter='ensembl_gene_id', values=ids, mart=mart) -- note singular filter.

Mechanism: R's partial matching usually resolves filter -> filters, but in some package versions or with conflicting argument names, the call silently passes nothing.

Symptom: Empty result data frame; no error.

Fix: Always spell filters= and values= fully.

HGNC SEPT/MARCH symbols silently dropped

Trigger: Code copies a pre-2020 list of septin genes (SEPT1, SEPT2, ...); current org.db / biomaRt returns no matches.

Mechanism: HGNC renamed all SEPT# to SEPTIN# in 2020.

Symptom: 0% mapping rate for septin genes; functional analyses missing septin pathways.

Fix: Use mygene scopes='symbol,alias,prev_symbol'; or update the input list to current symbols.

biomaRt drift between runs

Trigger: A 2023 analysis used useEnsembl() without version=; rerun in 2026 produces 200 fewer significant genes.

Mechanism: Without version=, biomaRt floats to the current release. Symbols, biotypes, and gene boundaries change between releases.

Symptom: Non-reproducible results across runs of the same script.

Fix: Pin useEnsembl(version=N) where N is the release used in the original analysis. Cache the mapping table.

tx2gene version mismatch with Salmon index

Trigger: tximport(files, type='salmon', tx2gene) runs but the gene-level counts have far fewer genes than expected.

Mechanism: Salmon index built with versioned transcript IDs (ENST00000269305.9) but tx2gene has unversioned IDs (ENST00000269305). Transcripts silently drop during the mapping step.

Symptom: Lower-than-expected gene count; warning from tximport about missing transcript IDs.

Fix: Match versioning convention: rebuild tx2gene with the same versioning as the index. GenomicFeatures::makeTxDbFromGFF on the same GTF as the index is the safest path.

Cross-species mapping reports many2many, user picks one arbitrarily

Trigger: Mouse-to-human mapping returns 1.3 mouse genes per human gene on average; user takes the first row of each duplicate.

Mechanism: Many2many orthology is genuinely ambiguous; "first row" is unprincipled and irreproducible across biomaRt API versions.

Symptom: Different mappings on rerun; conflicting downstream gene sets.

Fix: Either filter to mmusculus_homolog_orthology_type == 'ortholog_one2one' (conservative) or aggregate via highest homology confidence score (mmusculus_homolog_perc_id_r1).

Common errors

Error / symptom	Cause	Fix
`filters` returns empty	Singular `filter=` partial-matched against another argument	Spell `filters=` fully
`1-Mar` in gene column	Excel autocorrected `MARCH1`	Re-import with explicit string type; map back to MARCHF1
pyensembl `ValueError: gene not found`	ID not in pinned release; or unversioned ID against versioned database	Strip version before lookup; verify release
Duplicate rownames after aggregate	Collapsed multiple source IDs to one target; OR `_PAR_Y` stripped	Sum-collapse expected; for PAR_Y use preserving regex
biomaRt timeout for >5k IDs	Query too large	Chunk into batches of 1000
Wrong species mapping	Default `species='human'` in mygene; mouse query returns nothing	Pass `species='mouse'` explicitly
`ENSEMBL` keytype not available	Older org.db package or non-human/mouse	`keytypes(orgdb)` to verify

References

Ziemann M, Eren Y, El-Osta A. 2016. Gene name errors are widespread in the scientific literature. Genome Biol 17:177. doi:10.1186/s13059-016-1044-7
Bruford EA, Braschi B, Denny P, Jones TEM, Seal RL, Tweedie S. 2020. Guidelines for human gene nomenclature. Nat Genet 52:754-758. doi:10.1038/s41588-020-0669-3
Morales J, Pujar S, Loveland JE, et al. 2022. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604:310-315. doi:10.1038/s41586-022-04558-8
Durinck S, Spellman PT, Birney E, Huber W. 2009. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4(8):1184-1191. doi:10.1038/nprot.2009.97
Carlson M, Falcon S, Pages H, Li N. 2019. org.Hs.eg.db: Genome wide annotation for Human. R package. Bioconductor.
Wu C et al. 2013. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res 41(D1):D561-D565. doi:10.1093/nar/gks1114
Frankish A et al. 2021. GENCODE 2021. Nucleic Acids Res 49(D1):D916-D923. doi:10.1093/nar/gkaa1087
Howe KL et al. 2021. Ensembl 2021. Nucleic Acids Res 49(D1):D884-D891. doi:10.1093/nar/gkaa942
Soneson C, Love MI, Robinson MD. 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4:1521. doi:10.12688/f1000research.7563.2

Related Skills

counts-ingest - Building count matrices and tx2gene
metadata-joins - Joining annotation with sample tables
normalization - Biotype filtering before normalization
sparse-handling - Single-cell row metadata in AnnData
differential-expression/de-results - Annotating DE results; gene-symbol display
rna-quantification/tximport-workflow - Detailed tximport + tx2gene workflow
pathway-analysis/go-enrichment - Entrez IDs required
pathway-analysis/kegg-pathways - Entrez IDs; strain-specific organism codes
database-access/biomart-queries - General biomaRt patterns
database-access/uniprot-access - UniProt mapping details
database-access/ortholog-inference - De novo ortholog inference for custom genomes

name	bio-expression-matrix-gene-id-mapping
description	Maps between gene identifier systems (Ensembl, Entrez, HGNC symbol, UniProt, RefSeq, MANE) using AnnotationDbi, biomaRt, mygene, pyensembl, and Ensembl REST. Encodes Ensembl version stripping with GENCODE _PAR_Y preservation, the Ziemann 2016 Excel autocorrect debacle and Bruford 2020 HGNC renames (SEPT->SEPTIN, MARCH->MARCHF, MARC->MTARC, DEC1->DELEC1), OCT4/POU5F1 alias resolution, biomaRt archive endpoints for release pinning, the `filters` (plural) gotcha, MANE Select for clinical reporting, cross-species orthology via Ensembl Compara / OMA / OrthoDB, and tx2gene construction for tximport. Use when converting gene IDs across systems, handling renamed symbols, building tx2gene, pinning to a specific Ensembl release for reproducibility, or mapping cross-species orthologs.
tool_type	mixed
primary_tool	biomaRt

bio-expression-matrix-gene-id-mapping

More from this repository

More from this repository

Version Compatibility

Gene ID Mapping

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Algorithmic Taxonomy

Decision Tree by Scenario

AnnotationDbi + org.db

biomaRt with Version Pinning

mygene (Python)

pyensembl

Apply Mapping to Count Matrix -- Handling One-to-Many

Handle Unmapped IDs

MANE Select for Clinical Reporting

Cross-Species Orthologs

PAR Gene Complications

Build tx2gene for tximport

ID Type Reference

Per-Method Failure Modes

_PAR_Y stripped, chrY duplicates collapsed

biomaRt returned 0 rows without warning

HGNC SEPT/MARCH symbols silently dropped

biomaRt drift between runs

tx2gene version mismatch with Salmon index

Cross-species mapping reports many2many, user picks one arbitrarily

Common errors

References

Related Skills

Version Compatibility

Gene ID Mapping

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Algorithmic Taxonomy

Decision Tree by Scenario

AnnotationDbi + org.db

biomaRt with Version Pinning

mygene (Python)

pyensembl

Apply Mapping to Count Matrix -- Handling One-to-Many

Handle Unmapped IDs

MANE Select for Clinical Reporting

Cross-Species Orthologs

PAR Gene Complications

Build tx2gene for tximport

ID Type Reference

Per-Method Failure Modes

_PAR_Y stripped, chrY duplicates collapsed

biomaRt returned 0 rows without warning

HGNC SEPT/MARCH symbols silently dropped

biomaRt drift between runs

tx2gene version mismatch with Salmon index

Cross-species mapping reports many2many, user picks one arbitrarily

Common errors

References

Related Skills

`_PAR_Y` stripped, chrY duplicates collapsed

`biomaRt` returned 0 rows without warning

`_PAR_Y` stripped, chrY duplicates collapsed

`biomaRt` returned 0 rows without warning