with one click
cellxgene-census
// Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.
// Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.
Unified Python interface to 40+ bioinformatics web services: UniProt proteins, KEGG pathways, ChEMBL/ChEBI/PubChem, BLAST, cross-database ID mapping, GO annotations, PPI. For deep single-DB queries use dedicated tools (gget for Ensembl, pubchempy for PubChem); bioservices excels at cross-database workflows.
Cancer genomics (TCGA et al.) via cBioPortal REST API. Retrieve somatic mutations, CNAs, expression, clinical data (survival/stage/treatment) across thousands of studies. Use for TMB, oncoprints, survival analysis. For population frequencies use gnomad-database; for drug-gene interactions use opentargets-database.
Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and embeddings. Design novel proteins, extract ML features, or fold sequences. Local GPU or EvolutionaryScale Forge API. Use AlphaFold for traditional folding; RDKit for small molecules.
Query UniProt REST API: search by gene/protein name, fetch FASTA, map IDs (Ensembl, PDB, RefSeq), access Swiss-Prot annotations. Use bioservices for multi-DB access; alphafold-database-access for structures.
Chunked N-D arrays with compression and cloud storage. NumPy-style indexing. Backends: local, S3, GCS, ZIP, memory. Dask/Xarray integration for parallel and labeled computation. For lineage use lamindb; for labeled arrays use xarray.
Molecular featurization hub (100+ featurizers) for ML. SMILES to fingerprints (ECFP, MACCS, MAP4), descriptors (RDKit 2D, Mordred), pretrained embeddings (ChemBERTa, GIN, Graphormer), pharmacophores. Scikit-learn compatible with parallelization/caching. For QSAR, virtual screening, similarity, and molecular DL.
| name | cellxgene-census |
| description | Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata. |
| license | MIT |
CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.
pip install cellxgene-census
# For ML workflows
pip install cellxgene-census[experimental]
API Rate Limits: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Get B cells from lung
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["cell_type", "disease", "donor_id"],
)
print(f"Retrieved {adata.n_obs} cells Ć {adata.n_vars} genes")
# Retrieved ~15000 cells Ć 60664 genes
Connect to Census and discover available data.
import cellxgene_census
# Open latest stable version (always use context manager)
with cellxgene_census.open_soma() as census:
# Summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]:,}")
# List all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
print(f"Total datasets: {len(datasets)}")
print(datasets[["dataset_title", "cell_count"]].head())
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Reproducible analysis code here
pass
Query cell-level metadata without downloading expression data.
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Get unique cell types in brain
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type", "disease", "assay"]
)
print(f"Total brain cells: {len(cell_metadata):,}")
print(cell_metadata["cell_type"].value_counts().head(10))
# Gene metadata query
gene_metadata = cellxgene_census.get_var(
census,
"homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
column_names=["feature_id", "feature_name", "feature_length"]
)
print(gene_metadata)
# Returns DataFrame with Ensembl IDs, gene symbols, and lengths
Retrieve expression matrices as AnnData objects for queries returning <100k cells.
import cellxgene_census
with cellxgene_census.open_soma() as census:
# Query by cell type + tissue + disease
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
print(f"Shape: {adata.shape}") # (n_cells, 4)
print(f"Metadata columns: {list(adata.obs.columns)}")
Filter syntax reference:
and, orfeature_name in ['CD4', 'CD8A']cell_count > 1000is_primary_data == True to avoid duplicate cellsStream expression data in chunks for queries exceeding available RAM.
import cellxgene_census
import tiledbsoma as soma
with cellxgene_census.open_soma() as census:
# Estimate query size first
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")
# If >100k cells, use streaming
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Incremental statistics
n_obs, total = 0, 0.0
for batch in query.X("raw").tables():
values = batch["soma_data"].to_numpy()
n_obs += len(values)
total += values.sum()
print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")
Check which datasets measured specific genes (not all genes are in all datasets).
import cellxgene_census
with cellxgene_census.open_soma() as census:
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'PTPRC']"
)
print(f"Presence matrix shape: {presence.shape}")
# (n_datasets, n_genes) ā True if gene measured in dataset
Train models directly on Census data using the experimental dataloader.
from cellxgene_census.experimental.ml import experiment_dataloader
import cellxgene_census
with cellxgene_census.open_soma() as census:
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"] # Cell metadata
print(f"Batch X shape: {X.shape}, labels: {list(labels.columns)}")
break # Show first batch only
The Census is organized as a SOMA (Stack of Matrices, Annotated) collection:
census/
āāā census_info/
ā āāā summary # Total cell counts
ā āāā datasets # Dataset metadata
āāā census_data/
āāā homo_sapiens/
ā āāā ms_RNA/
ā āāā obs # Cell metadata (61M+ rows)
ā āāā var # Gene metadata (~60k rows)
ā āāā X/raw # Expression matrix (sparse)
āāā mus_musculus/
āāā ...
| Field | Type | Description | Example Values |
|---|---|---|---|
cell_type | str | Cell Ontology label | "B cell", "neuron", "macrophage" |
tissue_general | str | Coarse tissue grouping | "brain", "lung", "blood" |
tissue | str | Specific tissue | "prefrontal cortex", "alveolar tissue" |
disease | str | Disease state | "normal", "COVID-19", "lung adenocarcinoma" |
assay | str | Sequencing assay | "10x 3' v3", "Smart-seq2" |
is_primary_data | bool | True = unique cell | Always filter True |
donor_id | str | Donor identifier | Used for batch effects |
tissue_general vs tissueUse tissue_general for broad cross-tissue analyses and tissue for specific tissue queries:
# Broad: all immune system cells
obs_value_filter = "tissue_general == 'immune system'"
# Specific: only PBMCs
obs_value_filter = "tissue == 'peripheral blood mononuclear cell'"
Goal: Compare macrophage gene expression across tissues.
import cellxgene_census
import scanpy as sc
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=(
"cell_type == 'macrophage' and "
"tissue_general in ['lung', 'liver', 'brain'] and "
"is_primary_data == True"
),
obs_column_names=["cell_type", "tissue_general", "donor_id", "disease"],
)
print(f"Macrophages: {adata.n_obs} cells from {adata.obs['tissue_general'].nunique()} tissues")
# Standard scanpy analysis
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
# Differential expression across tissues
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
sc.pl.umap(adata, color=["tissue_general", "disease"])
Goal: Compare marker gene expression between COVID-19 and healthy controls.
dotplot or matrixplot| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|---|---|---|---|---|
organism | get_anndata, get_obs | ā | "Homo sapiens", "Mus musculus" | Species selection |
census_version | open_soma | latest stable | Date string "YYYY-MM-DD" | Pin to specific data release |
obs_value_filter | get_anndata, get_obs | None | SOMA filter expression | Cell-level filtering |
var_value_filter | get_anndata, get_var | None | SOMA filter expression | Gene-level filtering |
obs_column_names | get_anndata, get_obs | all columns | list of field names | Reduces data transfer |
batch_size | experiment_dataloader | 128 | 32ā512 | PyTorch batch size |
shuffle | experiment_dataloader | False | True/False | Randomize training order |
Always filter is_primary_data == True: Without this filter, duplicate cells across datasets inflate counts and bias analyses.
Estimate query size before loading: Call get_obs() with column_names=["soma_joinid"] to count cells before downloading expression data. Use out-of-core processing for >100k cells.
Pin census_version for reproducibility: The default "latest stable" changes periodically. Always specify the version for published analyses.
Select only needed metadata columns: Passing obs_column_names reduces data transfer and memory usage significantly for large queries.
Use tissue_general for cross-tissue analyses: The tissue field has hundreds of specific values; tissue_general provides ~30 coarse groupings suitable for comparative analyses.
Anti-pattern ā querying all genes when you need a few: Specify var_value_filter to retrieve only genes of interest. Downloading the full ~60k gene matrix for 3 marker genes wastes bandwidth and memory.
import cellxgene_census
import pandas as pd
with cellxgene_census.open_soma() as census:
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="is_primary_data == True",
column_names=["tissue_general", "cell_type", "disease"]
)
summary = metadata.groupby("tissue_general").agg(
n_cells=("cell_type", "size"),
n_cell_types=("cell_type", "nunique"),
n_diseases=("disease", "nunique"),
).sort_values("n_cells", ascending=False)
print(summary.head(10))
import cellxgene_census
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general == 'heart' and is_primary_data == True",
obs_column_names=["cell_type", "disease", "donor_id", "assay"],
)
adata.write_h5ad("heart_cells.h5ad")
print(f"Saved {adata.n_obs} cells to heart_cells.h5ad")
| Problem | Cause | Solution |
|---|---|---|
MemoryError on get_anndata() | Query returns too many cells | Check count with get_obs() first; use out-of-core axis_query() for >100k cells |
| Duplicate cells in results | Missing is_primary_data == True filter | Add is_primary_data == True to all obs_value_filter queries |
| Gene not found | Wrong gene name or gene not in Census | Check spelling (case-sensitive); try Ensembl ID via feature_id; verify with get_presence_matrix() |
ConnectionError / timeout | Census backend temporarily unavailable | Retry after 1-2 minutes; pin a specific census_version for reliability |
| Version inconsistencies | Using default "latest" across sessions | Always specify census_version in production code |
| Slow query performance | Downloading all metadata columns | Specify only needed columns via obs_column_names |
ImportError: cellxgene_census | Package not installed | pip install cellxgene-census (note the hyphen) |