بنقرة واحدة
scvi-covariate-validation
Validate scVI batch correction covariate choices by comparing latent space quality metrics
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Validate scVI batch correction covariate choices by comparing latent space quality metrics
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.
| name | scvi-covariate-validation |
| description | Validate scVI batch correction covariate choices by comparing latent space quality metrics |
| author | smith6jt |
| date | "2026-02-23T00:00:00.000Z" |
| Item | Details |
|---|---|
| Date | 2026-02-23 |
| Goal | Determine whether Age/Gender categorical covariates improve or degrade scVI latent space quality for disease-progression analysis |
| Environment | Python 3.10, scvi-tools 1.4.0, scanpy 1.11.5, NVIDIA RTX 4000 Ada, conda scvi-env |
| Status | Success — Original model with covariates validated as superior |
The Islet Explorer project uses scVI to batch-correct 2.6M single-cell CODEX proteomic measurements across 15 nPOD pancreas donors (5 ND, 5 Aab+, 5 T1D). The original model was trained with categorical_covariate_keys=["Age", "Gender"] but NO batch_key. A question arose: since each donor has a unique Age value (float64 precision), Age+Gender effectively encodes donor identity. Does this help or hurt the latent space?
Key confounding: imageid (donor) is 1:1 with donor_status (disease group). Aggressively removing donor signal risks also removing disease biology. The covariate mechanism offers a softer correction than batch_key.
batch_key: Primary batch variable. scVI learns a batch-specific decoder, forcing the encoder to produce batch-invariant latent representations. Aggressive correction — can remove correlated biological signal.categorical_covariate_keys: One-hot encoded and concatenated to the encoder and decoder inputs. Softer conditioning — the model can learn to use or ignore these covariates as needed. Does NOT force invariance.Before comparing models, verify whether your covariates uniquely identify batches:
import anndata as ad
adata = ad.read_h5ad("single_cell_analysis/CODEX_scvi_BioCov_phenotyped_newDuctal.h5ad")
# Check if Age+Gender uniquely identifies donors
donor_meta = adata.obs.groupby('imageid')[['Age', 'Gender']].first()
age_gender_pairs = donor_meta[['Age', 'Gender']].drop_duplicates()
print(f"Unique donors: {len(donor_meta)}")
print(f"Unique Age+Gender pairs: {len(age_gender_pairs)}")
# If these are equal, Age+Gender IS a donor identifier
Keep everything the same except remove categorical_covariate_keys:
import scvi
# ORIGINAL (with covariates)
scvi.model.SCVI.setup_anndata(
adata, layer="counts",
categorical_covariate_keys=["Age", "Gender"]
)
# NEW (no covariates) — compare against this
scvi.model.SCVI.setup_anndata(
adata, layer="counts"
# NO covariate keys
)
model = scvi.model.SCVI(
adata,
n_latent=10, n_layers=2, n_hidden=128,
dropout_rate=0.1, dispersion='gene-batch',
gene_likelihood='nb'
)
model.train(max_epochs=400, early_stopping=True,
early_stopping_patience=45)
For islet-level analysis, aggregate single-cell latents to islet level:
import numpy as np
latent = model.get_latent_representation()
# Filter to core islet cells only
core_mask = adata.obs['Parent'].str.match(r'^Islet_\d+$')
core = adata.obs[core_mask].copy()
core['latent'] = list(latent[core_mask])
# Aggregate: mean latent per islet (min_cells filter)
islet_latents = {}
for islet_id, group in core.groupby('islet_id'):
if len(group) >= 20:
islet_latents[islet_id] = np.mean(np.stack(group['latent'].values), axis=0)
from sklearn.metrics import silhouette_samples
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.stack(list(islet_latents.values())) # (n_islets, 10)
imageid = [...] # donor labels per islet
status = [...] # ND/Aab+/T1D per islet
# --- Metric 1: R² (MANOVA-style variance explained) ---
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for label_name, labels in [('imageid', imageid), ('donor_status', status)]:
y = le.fit_transform(labels).reshape(-1, 1)
r2 = LinearRegression().fit(y, X).score(y, X)
print(f"R²({label_name}): {r2:.4f}")
# GOOD: R²(donor) is LOW, R²(status) is HIGH
# --- Metric 2: Per-dimension eta² (ANOVA decomposition) ---
from scipy.stats import f_oneway
for dim in range(X.shape[1]):
groups = {}
for label, val in zip(imageid, X[:, dim]):
groups.setdefault(label, []).append(val)
ss_between = sum(len(g) * (np.mean(g) - np.mean(X[:, dim]))**2 for g in groups.values())
ss_total = np.sum((X[:, dim] - np.mean(X[:, dim]))**2)
eta2 = ss_between / ss_total
print(f"Dim {dim}: eta²(donor)={eta2:.4f}")
# GOOD: No single dimension has eta²(donor) > 0.9
# --- Metric 3: Silhouette scores (cosine) ---
sil_donor = silhouette_samples(X, imageid, metric='cosine')
sil_status = silhouette_samples(X, status, metric='cosine')
print(f"Sil(donor): {np.mean(sil_donor):.4f}") # GOOD: near 0 (mixed)
print(f"Sil(status): {np.mean(sil_status):.4f}") # GOOD: positive (separated)
# --- Metric 4: k-NN mixing (k=15) ---
nn = NearestNeighbors(n_neighbors=16, metric='cosine').fit(X)
_, indices = nn.kneighbors(X)
same_donor = np.mean([
np.mean([imageid[j] == imageid[i] for j in indices[i, 1:]])
for i in range(len(X))
])
print(f"k-NN same-donor: {same_donor:.4f}") # GOOD: close to random (1/n_donors)
# --- Metric 5: PCA pseudotime gradient ---
pca = PCA(n_components=5).fit(X)
pc0 = pca.transform(X)[:, 0]
from scipy.stats import spearmanr
r, p = spearmanr(pc0, pseudotime)
print(f"PC0 vs pseudotime: r={r:.3f}") # GOOD: |r| > 0.5 (biology drives PC0)
# See scripts/plot_scvi_comparison.py for full implementation
# Key panels:
# A: PCA colored by donor — visual mixing assessment
# B: PCA colored by disease — biological signal preservation
# C: Per-dimension eta² bars — where donor variance lives
# D: Donor silhouette distributions — mixing quality
# E: k-NN neighbor composition — same-donor fraction
# F: Summary metrics comparison — overall scorecard
# G: Original model pseudotime gradient — biology dominates
# H: No-covariate pseudotime gradient — donor dominates
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
Using batch_key="imageid" | imageid is 1:1 with donor_status (disease group). scVI learns batch-invariant latent → removes disease signal along with donor signal | Never use batch_key when batch variable is perfectly confounded with the biological variable of interest. Use categorical_covariate_keys for softer correction |
| Removing all covariates (no Age/Gender) | Donor variance in latent space increased by 40% (R²: 0.397→0.554). k-NN same-donor fraction jumped from 60%→85%. Pseudotime gradient in PC0 disappeared (r: 0.71→0.10) | Even soft covariates that are bijective with donor identity provide meaningful correction. The one-hot conditioning helps the model learn donor-agnostic features |
| Evaluating only R²(status) for biology | No-covariate model showed HIGHER R²(status) (0.149 vs 0.107), which looked better in isolation | Always evaluate biology metrics RELATIVE to donor metrics. The no-covariate model had higher status R² but also much higher donor R² — the extra biology signal was dwarfed by donor confounding. Use the excess eta² (donor - status) as the key metric |
| Relying solely on silhouette scores | Silhouette is sensitive to cluster shape assumptions and can be misleading for non-spherical distributions | Use multiple complementary metrics: R², eta², silhouette, k-NN mixing, and PCA analysis. Agreement across ≥4 metrics gives confidence |
| Low max_tokens in diagnostics script | Training completed but diagnostic output was truncated/lost when background task files were cleaned up | Save all diagnostic output to explicit log files (not just stdout). Use tee plus a dedicated log file path |
# scVI model (CANONICAL — with covariates)
n_latent: 10
n_layers: 2
n_hidden: 128
dropout_rate: 0.1
dispersion: gene-batch
gene_likelihood: nb
categorical_covariate_keys: ["Age", "Gender"]
batch_key: null # intentionally NOT set
max_epochs: 400
early_stopping: true
early_stopping_patience: 45
# Comparison diagnostics
n_islets_evaluated: 1024 # with min_cells=20
metrics_used:
- R² (donor vs status, linear regression)
- eta² per latent dimension (ANOVA decomposition)
- Silhouette score (cosine metric)
- k-NN same-donor/same-status fraction (k=15)
- PCA component analysis (pseudotime correlation)
acceptance_criteria:
- Original model wins majority (≥3/5) of key metrics
- Donor silhouette < +0.1 (well-mixed)
- PC0 pseudotime |r| > 0.3 (biology drives principal axis)
# Environment
conda_env: scvi-env
python: 3.10
scvi-tools: 1.4.0
gpu: NVIDIA RTX 4000 Ada Generation
training_time: ~82 min (2.6M cells, no covariates)
categorical_covariate_keys provides gentler correction than batch_key. When batch is confounded with biology (e.g., each disease group has unique donors), soft correction preserves biological signal while still reducing donor effects.eta²(donor) - eta²(status) captures how much latent variance is attributable to donor identity BEYOND what's explained by disease. Lower excess = better.| Metric | Original (Age/Gender) | No Covariates | Winner |
|---|---|---|---|
| R² imageid | 0.397 | 0.554 | Original |
| R² donor_status | 0.107 | 0.149 | No covariates |
| eta² excess (donor-status) | 0.291 | 0.405 | Original |
| Silhouette donor (cosine) | -0.041 | +0.250 | Original |
| k-NN same-donor enrichment | 8.4x | 11.9x | Original |
| k-NN same-status enrichment | 2.17x | 2.62x | No covariates |
| PC0 pseudotime r | 0.712 | 0.101 | Original |
| Training time | N/A (pre-existing) | 82.5 min | — |
| Overall winner | 4/5 key metrics | 1/5 | Original |
categorical_covariate_keys vs batch_keydata/DATA_PROVENANCE.md — Full data lineagescripts/retrain_scvi_no_covariates.py — Retraining scriptscripts/plot_scvi_comparison.py — Supplemental figure generatorscripts/scvi_comparison_results/ — All comparison outputs