| name | lamindb-data-management |
| description | Open-source FAIR biology data framework. Version artifacts (AnnData, DataFrame, Zarr), track lineage, validate via ontologies (Bionty), query datasets. Integrates with Nextflow, Snakemake, W&B, scVI. For scRNA-seq use scanpy; for ontology lookups use bionty. |
| license | Apache-2.0 |
LaminDB — Biological Data Management
Overview
LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.
When to Use
- Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
- Tracking computational lineage (which code produced which data)
- Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
- Building queryable data lakehouses across multiple experiments
- Ensuring reproducibility with automatic environment and provenance capture
- Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
- Standardizing metadata with ontology-based annotation (Bionty)
- For single-cell analysis pipelines (clustering, DE), use scanpy instead
- For ontology lookups only without data management, use bionty directly
Prerequisites
pip install lamindb
pip install 'lamindb[bionty,zarr,fcs]'
Setup: Requires instance initialization before use:
lamin login
lamin init --storage ./my-data --name my-project
Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).
Quick Start
import lamindb as ln
ln.track()
import pandas as pd
df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]})
artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores")
artifact.save()
print(f"Saved: {artifact.uid}, size: {artifact.size}")
results = ln.Artifact.filter(key__startswith="results/").df()
print(f"Found {len(results)} artifacts")
ln.finish()
Core API
1. Artifacts — Data Objects
Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).
import lamindb as ln
import pandas as pd
import anndata as ad
ln.track()
df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]})
artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save()
print(f"ID: {artifact.uid}, Version: {artifact.version}")
adata = ad.read_h5ad("counts.h5ad")
artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save()
artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save()
df_loaded = artifact.load()
path = artifact.cache()
artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save()
print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}")
print(f"Latest version: {artifact_v2.is_latest}")
artifact.delete(permanent=False)
2. Lineage Tracking
Automatic provenance capture for reproducibility.
import lamindb as ln
ln.track(params={"method": "PCA", "n_components": 50})
input_data = ln.Artifact.get(key="raw/counts.h5ad")
adata = input_data.load()
output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save()
output.view_lineage()
ln.finish()
3. Querying and Filtering
Search and filter artifacts by metadata, features, and annotations.
import lamindb as ln
artifacts = ln.Artifact.filter(key__startswith="scrna/").df()
print(f"Found {len(artifacts)} scRNA-seq artifacts")
recent = ln.Artifact.filter(
created_at__gte="2026-01-01",
size__gt=1000000
).df()
immune = ln.Artifact.filter(
cell_types__name="T cell",
tissues__name="PBMC"
).df()
artifact = ln.Artifact.get(key="results/final.parquet")
artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none()
results = ln.Artifact.search("gene expression PBMC")
artifact = ln.Artifact.get(key="large_dataset.h5ad")
backed = artifact.open()
subset = backed[backed.obs["cell_type"] == "B cell"]
4. Annotation and Validation
Curate datasets against schemas and ontology terms.
import lamindb as ln
import bionty as bt
artifact = ln.Artifact.get(key="scrna/batch1.h5ad")
artifact.features.add_values({
"tissue": "PBMC",
"condition": "treated",
"organism": "human",
"batch": 1
})
curator = ln.curators.AnnDataCurator(adata, schema)
try:
curator.validate()
artifact = curator.save_artifact(key="validated/batch1.h5ad")
print("Validation passed")
except ln.errors.ValidationError as e:
print(f"Validation failed: {e}")
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
5. Biological Ontologies (Bionty)
Access standardized biological vocabularies for annotation.
import bionty as bt
bt.CellType.import_source()
results = bt.CellType.search("T helper")
print(results.head())
t_cell = bt.CellType.get(name="T cell")
print(f"Ontology ID: {t_cell.ontology_id}")
children = t_cell.children.all()
parents = t_cell.parents.all()
print(f"Children: {[c.name for c in children]}")
validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"])
6. Collections and Organization
Group related artifacts for batch operations.
import lamindb as ln
artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all()
collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save()
print(f"Collection: {collection.name}, {collection.n_objects} artifacts")
for artifact in collection.artifacts.all():
print(f" {artifact.key}: {artifact.size} bytes")
Key Concepts
Core Entity Model
| Entity | Purpose | Example |
|---|
| Artifact | Versioned data object | counts.h5ad, results.parquet |
| Run | Single code execution | Notebook run, script execution |
| Transform | Code definition (notebook, script, pipeline) | analysis.ipynb |
| Feature | Typed metadata field | tissue, condition, batch |
| Collection | Group of related artifacts | "Experiment batches" |
| ULabel | Universal label for custom categorization | "high_quality", "pilot" |
Data Types Supported
| Format | Method | Use Case |
|---|
| DataFrame | Artifact.from_df() | Tabular data, metadata tables |
| AnnData | Artifact.from_anndata() | Single-cell data |
| MuData | Artifact.from_mudata() | Multi-modal data |
| Any file | Artifact("path") | Images, FASTQ, custom formats |
| Zarr | Via zarr extra | Large array data |
| TileDB-SOMA | Via tiledbsoma extra | Scalable cell-level queries |
track() / finish() Pattern
Every analysis session should be wrapped:
ln.track(params={"key": "value"})
ln.finish()
Common Workflows
Workflow: Multi-Experiment Data Lakehouse
import lamindb as ln
import anndata as ad
ln.track()
data_files = ["batch1.h5ad", "batch2.h5ad", "batch3.h5ad"]
tissues = ["PBMC", "bone_marrow", "PBMC"]
conditions = ["control", "treated", "treated"]
for i, (file, tissue, condition) in enumerate(zip(data_files, tissues, conditions)):
adata = ad.read_h5ad(file)
artifact = ln.Artifact.from_anndata(
adata, key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}"
).save()
artifact.features.add_values({
"tissue": tissue, "condition": condition, "batch": i
})
print(f"Registered batch {i}: {artifact.uid}")
treated_pbmc = ln.Artifact.filter(
key__startswith="scrna/",
features__tissue="PBMC",
features__condition="treated"
).all()
print(f"Found {len(treated_pbmc)} matching datasets")
import anndata as ad
adatas = [a.load() for a in treated_pbmc]
combined = ad.concat(adatas)
print(f"Combined: {combined.shape}")
ln.finish()
Workflow: Validated Data Curation
import lamindb as ln
import bionty as bt
import anndata as ad
ln.track()
bt.CellType.import_source()
bt.Gene.import_source(organism="human")
adata = ad.read_h5ad("raw_counts.h5ad")
print(f"Raw: {adata.shape}")
validated = bt.CellType.validate(adata.obs["cell_type"].unique())
if not all(validated):
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
gene_validated = bt.Gene.validate(adata.var_names)
print(f"Valid genes: {sum(gene_validated)}/{len(gene_validated)}")
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="curated/validated_counts.h5ad")
print(f"Saved curated artifact: {artifact.uid}")
ln.finish()
Workflow: Nextflow Pipeline Integration
- In each Nextflow process, import lamindb and call
ln.track()
- Load input artifacts with
ln.Artifact.get(key=...); cache to local path
- Run analysis; save output as new artifact with
ln.Artifact(...).save()
- Call
ln.finish() — lineage automatically links inputs to outputs
Key Parameters
| Parameter | Function | Default | Options | Effect |
|---|
key | Artifact() | None | String path | Hierarchical storage key (e.g., "project/data.h5ad") |
description | Artifact() | None | String | Human-readable description |
revises | Artifact() | None | Artifact | Previous version to revise |
params | ln.track() | None | Dict | Parameters for the current run |
organism | bt.Gene.import_source() | None | "human", "mouse" | Organism for ontology |
permanent | .delete() | False | True/False | Permanent vs archive deletion |
__startswith | .filter() | — | String | Key prefix filter |
__gte, __lte | .filter() | — | Value | Greater/less than or equal |
__contains | .filter() | — | String | Substring match |
Best Practices
-
Always wrap analysis with ln.track() / ln.finish(): This captures lineage automatically. Without it, artifacts have no provenance.
-
Use hierarchical keys: Structure as project/experiment/datatype/file.ext (e.g., immunology/exp42/scrna/counts.h5ad). This enables prefix-based queries.
-
Anti-pattern — duplicating data instead of versioning: Use the revises= parameter to create new versions, not new keys for the same dataset.
-
Validate early: Run schema validation before analysis. Catching bad metadata early saves debugging time downstream.
-
Use ontologies for standardization: Map free-text labels to ontology terms (e.g., "T helper cell" → CL:0000912). This enables cross-dataset queries.
-
Anti-pattern — loading large files without checking size: Use .filter().df() to inspect metadata first, then .load() or .open() (backed mode) for large files.
-
Query metadata first, load data second: Filter with .filter() to find relevant artifacts, then load only what you need.
Common Recipes
Recipe: Bulk Dataset Registration
import lamindb as ln
from pathlib import Path
ln.track()
data_dir = Path("raw_data/")
for fcs_file in data_dir.glob("*.fcs"):
artifact = ln.Artifact(str(fcs_file), key=f"flow_cytometry/{fcs_file.name}").save()
artifact.features.add_values({"assay": "flow_cytometry", "source": "batch_import"})
print(f"Registered: {fcs_file.name} -> {artifact.uid}")
ln.finish()
Recipe: View and Export Lineage
import lamindb as ln
artifact = ln.Artifact.get(key="results/final_analysis.h5ad")
artifact.view_lineage()
run = artifact.run
print(f"Created by: {run.transform.name}")
print(f"User: {run.created_by.name}")
print(f"Date: {run.created_at}")
print(f"Input artifacts: {[a.key for a in run.input_artifacts.all()]}")
Recipe: Ontology Hierarchy Exploration
import bionty as bt
bt.CellType.import_source()
t_cell = bt.CellType.get(name="T cell")
print(f"Parents: {[p.name for p in t_cell.parents.all()]}")
print(f"Children: {[c.name for c in t_cell.children.all()]}")
descendants = t_cell.children.all()
for child in descendants:
grandchildren = child.children.all()
print(f" {child.name}: {[gc.name for gc in grandchildren]}")
Troubleshooting
| Problem | Cause | Solution |
|---|
InstanceNotSetupError | Instance not initialized | Run lamin init --storage ./data --name my-project |
ln.track() fails | No transform context | Run inside a notebook/script, not REPL; or pass transform explicitly |
Artifact key conflict | Key already exists (not a version) | Use revises= for versioning, or choose a different key |
ValidationError | Data doesn't match schema | Run curator.validate() to see specific failures; standardize terms |
| Slow queries on large instances | No index on filtered field | Use .df() for overview first; add database indexes for frequently filtered fields |
| Ontology import fails | Network issue or wrong organism | Check internet connection; specify organism="human" explicitly |
FileNotFoundError on .cache() | Cloud artifact not synced | Check storage connectivity; use artifact.load() instead for in-memory access |
Related Skills
- anndata-data-structure — AnnData format used as primary data container in LaminDB for single-cell data
- scanpy-scrna-seq — single-cell analysis pipeline; LaminDB manages data that scanpy analyzes
- scvi-tools-single-cell — deep learning models for single-cell; integrates with LaminDB for data/model tracking
References