| name | scvi-covariate-validation |
| description | Validate scVI batch correction covariate choices by comparing latent space quality metrics |
| author | smith6jt |
| date | "2026-02-23T00:00:00.000Z" |
scVI Covariate Validation - Research Notes
Experiment Overview
| Item | Details |
|---|
| Date | 2026-02-23 |
| Goal | Determine whether Age/Gender categorical covariates improve or degrade scVI latent space quality for disease-progression analysis |
| Environment | Python 3.10, scvi-tools 1.4.0, scanpy 1.11.5, NVIDIA RTX 4000 Ada, conda scvi-env |
| Status | Success — Original model with covariates validated as superior |
Context
The Islet Explorer project uses scVI to batch-correct 2.6M single-cell CODEX proteomic measurements across 15 nPOD pancreas donors (5 ND, 5 Aab+, 5 T1D). The original model was trained with categorical_covariate_keys=["Age", "Gender"] but NO batch_key. A question arose: since each donor has a unique Age value (float64 precision), Age+Gender effectively encodes donor identity. Does this help or hurt the latent space?
Key confounding: imageid (donor) is 1:1 with donor_status (disease group). Aggressively removing donor signal risks also removing disease biology. The covariate mechanism offers a softer correction than batch_key.
scVI Covariate Mechanisms
batch_key: Primary batch variable. scVI learns a batch-specific decoder, forcing the encoder to produce batch-invariant latent representations. Aggressive correction — can remove correlated biological signal.
categorical_covariate_keys: One-hot encoded and concatenated to the encoder and decoder inputs. Softer conditioning — the model can learn to use or ignore these covariates as needed. Does NOT force invariance.
Verified Workflow
1. Check covariate uniqueness
Before comparing models, verify whether your covariates uniquely identify batches:
import anndata as ad
adata = ad.read_h5ad("single_cell_analysis/CODEX_scvi_BioCov_phenotyped_newDuctal.h5ad")
donor_meta = adata.obs.groupby('imageid')[['Age', 'Gender']].first()
age_gender_pairs = donor_meta[['Age', 'Gender']].drop_duplicates()
print(f"Unique donors: {len(donor_meta)}")
print(f"Unique Age+Gender pairs: {len(age_gender_pairs)}")
2. Retrain without covariates (identical architecture)
Keep everything the same except remove categorical_covariate_keys:
import scvi
scvi.model.SCVI.setup_anndata(
adata, layer="counts",
categorical_covariate_keys=["Age", "Gender"]
)
scvi.model.SCVI.setup_anndata(
adata, layer="counts"
)
model = scvi.model.SCVI(
adata,
n_latent=10, n_layers=2, n_hidden=128,
dropout_rate=0.1, dispersion='gene-batch',
gene_likelihood='nb'
)
model.train(max_epochs=400, early_stopping=True,
early_stopping_patience=45)
3. Extract and aggregate latent representations
For islet-level analysis, aggregate single-cell latents to islet level:
import numpy as np
latent = model.get_latent_representation()
core_mask = adata.obs['Parent'].str.match(r'^Islet_\d+$')
core = adata.obs[core_mask].copy()
core['latent'] = list(latent[core_mask])
islet_latents = {}
for islet_id, group in core.groupby('islet_id'):
if len(group) >= 20:
islet_latents[islet_id] = np.mean(np.stack(group['latent'].values), axis=0)
4. Run diagnostic metrics (5 complementary approaches)
from sklearn.metrics import silhouette_samples
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.stack(list(islet_latents.values()))
imageid = [...]
status = [...]
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for label_name, labels in [('imageid', imageid), ('donor_status', status)]:
y = le.fit_transform(labels).reshape(-1, 1)
r2 = LinearRegression().fit(y, X).score(y, X)
print(f"R²({label_name}): {r2:.4f}")
from scipy.stats import f_oneway
for dim in range(X.shape[1]):
groups = {}
for label, val in zip(imageid, X[:, dim]):
groups.setdefault(label, []).append(val)
ss_between = sum(len(g) * (np.mean(g) - np.mean(X[:, dim]))**2 for g in groups.values())
ss_total = np.sum((X[:, dim] - np.mean(X[:, dim]))**2)
eta2 = ss_between / ss_total
print(f"Dim {dim}: eta²(donor)={eta2:.4f}")
sil_donor = silhouette_samples(X, imageid, metric='cosine')
sil_status = silhouette_samples(X, status, metric='cosine')
print(f"Sil(donor): {np.mean(sil_donor):.4f}")
print(f"Sil(status): {np.mean(sil_status):.4f}")
nn = NearestNeighbors(n_neighbors=16, metric='cosine').fit(X)
_, indices = nn.kneighbors(X)
same_donor = np.mean([
np.mean([imageid[j] == imageid[i] for j in indices[i, 1:]])
for i in range(len(X))
])
print(f"k-NN same-donor: {same_donor:.4f}")
pca = PCA(n_components=5).fit(X)
pc0 = pca.transform(X)[:, 0]
from scipy.stats import spearmanr
r, p = spearmanr(pc0, pseudotime)
print(f"PC0 vs pseudotime: r={r:.3f}")
5. Generate supplemental figure (8-panel comparison)
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|
Using batch_key="imageid" | imageid is 1:1 with donor_status (disease group). scVI learns batch-invariant latent → removes disease signal along with donor signal | Never use batch_key when batch variable is perfectly confounded with the biological variable of interest. Use categorical_covariate_keys for softer correction |
| Removing all covariates (no Age/Gender) | Donor variance in latent space increased by 40% (R²: 0.397→0.554). k-NN same-donor fraction jumped from 60%→85%. Pseudotime gradient in PC0 disappeared (r: 0.71→0.10) | Even soft covariates that are bijective with donor identity provide meaningful correction. The one-hot conditioning helps the model learn donor-agnostic features |
| Evaluating only R²(status) for biology | No-covariate model showed HIGHER R²(status) (0.149 vs 0.107), which looked better in isolation | Always evaluate biology metrics RELATIVE to donor metrics. The no-covariate model had higher status R² but also much higher donor R² — the extra biology signal was dwarfed by donor confounding. Use the excess eta² (donor - status) as the key metric |
| Relying solely on silhouette scores | Silhouette is sensitive to cluster shape assumptions and can be misleading for non-spherical distributions | Use multiple complementary metrics: R², eta², silhouette, k-NN mixing, and PCA analysis. Agreement across ≥4 metrics gives confidence |
| Low max_tokens in diagnostics script | Training completed but diagnostic output was truncated/lost when background task files were cleaned up | Save all diagnostic output to explicit log files (not just stdout). Use tee plus a dedicated log file path |
Final Parameters
n_latent: 10
n_layers: 2
n_hidden: 128
dropout_rate: 0.1
dispersion: gene-batch
gene_likelihood: nb
categorical_covariate_keys: ["Age", "Gender"]
batch_key: null
max_epochs: 400
early_stopping: true
early_stopping_patience: 45
n_islets_evaluated: 1024
metrics_used:
- R² (donor vs status, linear regression)
- eta² per latent dimension (ANOVA decomposition)
- Silhouette score (cosine metric)
- k-NN same-donor/same-status fraction (k=15)
- PCA component analysis (pseudotime correlation)
acceptance_criteria:
- Original model wins majority (≥3/5) of key metrics
- Donor silhouette < +0.1 (well-mixed)
- PC0 pseudotime |r| > 0.3 (biology drives principal axis)
conda_env: scvi-env
python: 3.10
scvi-tools: 1.4.0
gpu: NVIDIA RTX 4000 Ada Generation
training_time: ~82 min (2.6M cells, no covariates)
Key Insights
- Soft vs hard correction:
categorical_covariate_keys provides gentler correction than batch_key. When batch is confounded with biology (e.g., each disease group has unique donors), soft correction preserves biological signal while still reducing donor effects.
- Unique Age values = effective donor ID: Float64 age values (e.g., 20.59, 20.70, 20.75) are practically unique per donor. Combined with Gender, this is a bijection to imageid. The model learns donor-specific adjustments via these covariates.
- Excess eta² is the key metric:
eta²(donor) - eta²(status) captures how much latent variance is attributable to donor identity BEYOND what's explained by disease. Lower excess = better.
- PC0 pseudotime gradient is the visual proof: The most compelling evidence is comparing PCA plots colored by pseudotime. The original model shows a smooth disease-progression gradient; the no-covariate model buries it under donor clustering.
- Multiple metrics needed: No single metric tells the whole story. R² showed mixed results (biology slightly better without covariates), but silhouette, k-NN, eta², and PCA all agreed the original model was superior. Use ≥4 complementary metrics.
- Training time is identical: Both models trained in ~82 minutes on the same GPU. The covariates add negligible overhead (just one-hot encoding concatenation).
Execution Results (2026-02-23)
| Metric | Original (Age/Gender) | No Covariates | Winner |
|---|
| R² imageid | 0.397 | 0.554 | Original |
| R² donor_status | 0.107 | 0.149 | No covariates |
| eta² excess (donor-status) | 0.291 | 0.405 | Original |
| Silhouette donor (cosine) | -0.041 | +0.250 | Original |
| k-NN same-donor enrichment | 8.4x | 11.9x | Original |
| k-NN same-status enrichment | 2.17x | 2.62x | No covariates |
| PC0 pseudotime r | 0.712 | 0.101 | Original |
| Training time | N/A (pre-existing) | 82.5 min | — |
| Overall winner | 4/5 key metrics | 1/5 | Original |
References