| name | dynamo-geneid-convert |
| description | Convert Ensembl-style gene IDs to gene symbols in `dynamo` with `dynamo.preprocessing.convert2gene_symbol` or `dynamo.preprocessing.convert2symbol`, including human and zebrafish IDs, version-suffix stripping, `AnnData.var_names` updates, and optional preprocessing handoff. Use when adapting `docs/tutorials/notebooks/110_geneid_convert_tutorial.ipynb`, standardizing `adata.var_names`, mapping Ensembl IDs to symbols, or doing identifier cleanup before a `Preprocessor` recipe. |
Dynamo Gene ID Convert
Goal
Convert Ensembl-style identifiers to gene symbols in a way that another agent can actually rerun: choose the correct conversion path, preserve traceability columns, handle species and version-suffix edge cases, and only hand off to Preprocessor after IDs are standardized.
Quick Workflow
- Inspect
adata.var_names or the raw ID list and determine whether the user needs a batch table or an in-place AnnData update.
- Strip version suffixes such as
.1 into a query column before you treat mapping quality as final.
- Use
convert2gene_symbol(...) when you need an explicit mapping table, reproducible merge logic, or non-human species control.
- Use
convert2symbol(adata, ...) only when in-place AnnData mutation is the right abstraction and the prefix / scopes rules are satisfied.
- Keep the original identifier in
adata.var, set adata.var_names from symbol only after validating duplicates and missing mappings, and run preprocessing afterward, not before.
Interface Summary
convert2gene_symbol(input_names, scopes='ensembl.gene', ensembl_release=None, species=None, force_rebuild=False) returns a DataFrame indexed by query with at least symbol and _score.
- The live source uses vendored
pyensembl data access, not a remote MyGene batch API.
convert2symbol(adata, scopes=None, subset=True) updates adata.var, adds query plus symbol, and can rewrite adata.var_names in place.
Preprocessor.preprocess_adata(adata, recipe='monocle', tkey=None, experiment_type=None) is only the downstream handoff. Current source exposes five recipe branches:
monocle, seurat, sctransform, pearson_residuals, monocle_pearson_residuals.
Read references/source-grounding.md before documenting parameters more narrowly than the notebook does.
Conversion Path Selection
- Use
convert2gene_symbol(...) as the default for notebook conversion work.
It is the safest path when you need explicit species, ensembl_release, manual merge logic, or per-ID validation.
- Use
convert2symbol(adata) for human or mouse Ensembl-style adata.var_names when direct in-place conversion is acceptable.
- Use
convert2symbol(adata, scopes='ensembl.gene') for zebrafish or other non-human prefixes if you still want the in-place helper.
- Do not trust the notebook's generic
scopes explanation alone.
In current source, convert2gene_symbol keeps scopes only for compatibility, while convert2symbol still branches on scopes and prefix heuristics.
Minimal Execution Patterns
For an explicit mapping table and controlled merge:
import dynamo as dyn
adata = dyn.sample_data.hematopoiesis_raw()
adata.var["ensembl_id"] = adata.var_names
adata.var["query"] = adata.var_names.str.split(".").str[0]
mapping = dyn.preprocessing.convert2gene_symbol(
adata.var["query"].tolist(),
species="human",
)
adata.var = (
adata.var
.merge(mapping, left_on="query", right_index=True, how="left")
.set_index(adata.var.index)
)
mapped = adata.var["symbol"].notna()
adata = adata[:, mapped].copy()
adata.var_names = adata.var["symbol"].astype(str)
For in-place conversion on supported prefixes:
import dynamo as dyn
adata = dyn.sample_data.hematopoiesis_raw()
adata.var["ensembl_id"] = adata.var_names
dyn.preprocessing.convert2symbol(adata, subset=True)
For zebrafish IDs with version suffixes:
import dynamo as dyn
result = dyn.preprocessing.convert2gene_symbol(
["ENSDARG00000035558.1"],
ensembl_release=77,
)
dyn.preprocessing.convert2symbol(adata, scopes="ensembl.gene")
Optional Preprocess Integration
- Perform gene-ID conversion before preprocessing unless the user explicitly wants to preserve notebook timing.
- If the user proceeds into preprocessing, default to
recipe='monocle' unless they ask for a different branch.
- If the user requests Pearson residuals but still cares about downstream velocity-safe layers, prefer
monocle_pearson_residuals.
Read references/preprocess-handoff.md before choosing a non-default recipe.
Validation
After conversion, check these items:
adata.var["query"] stores the version-stripped identifier actually used for lookup.
adata.var["symbol"] exists and the mapping rate is acceptable for the dataset.
- The original ID remains preserved, for example in
adata.var["ensembl_id"].
adata.var_names contains symbols only after duplicate and null handling is explicit.
- Representative conversions still match live source behavior:
ENSG00000141510 -> TP53 and ENSDARG00000035558(.1) -> gps2 with ensembl_release=77.
- If preprocessing follows,
Preprocessor.preprocess_adata(..., recipe=...) should run only after symbol assignment is settled.
Constraints
- Do not describe conversion as MyGene-backed just because older prose or notebook wording suggests that pattern; current source uses vendored
pyensembl.
- Do not assume
convert2symbol(adata) auto-detects every species. In current source it auto-classifies some human / mouse gene or transcript prefixes, but zebrafish without explicit scopes raises.
- Do not assume the docstring's
ensembl_release default is authoritative. Current code assigns 77 when the argument is omitted.
- The first conversion run may install
polars or download / index Ensembl data, so cold-start execution can be slower than the notebook suggests.
Resource Map
- Read
references/source-grounding.md for inspected signatures, live-source behavior, and branch coverage.
- Read
references/conversion-paths.md for human vs zebrafish decision rules and scopes handling.
- Read
references/preprocess-handoff.md for the downstream recipe branches.
- Read
references/source-notebook-map.md to see which notebook sections were preserved or intentionally dropped.
- Read
references/compatibility.md when notebook wording and current source behavior appear to disagree.