원클릭으로
scvi-covariate-validation
Validate scVI batch correction covariate choices by comparing latent space quality metrics
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Validate scVI batch correction covariate choices by comparing latent space quality metrics
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.
v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.
DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.
Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.
KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.
KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.
| name | scvi-covariate-validation |
| description | Validate scVI batch correction covariate choices by comparing latent space quality metrics |
| author | smith6jt |
| date | "2026-02-23T00:00:00.000Z" |
| Item | Details |
|---|---|
| Date | 2026-02-23 |
| Goal | Determine whether Age/Gender categorical covariates improve or degrade scVI latent space quality for disease-progression analysis |
| Environment | Python 3.10, scvi-tools 1.4.0, scanpy 1.11.5, NVIDIA RTX 4000 Ada, conda scvi-env |
| Status | Success — Original model with covariates validated as superior |
The Islet Explorer project uses scVI to batch-correct 2.6M single-cell CODEX proteomic measurements across 15 nPOD pancreas donors (5 ND, 5 Aab+, 5 T1D). The original model was trained with categorical_covariate_keys=["Age", "Gender"] but NO batch_key. A question arose: since each donor has a unique Age value (float64 precision), Age+Gender effectively encodes donor identity. Does this help or hurt the latent space?
Key confounding: imageid (donor) is 1:1 with donor_status (disease group). Aggressively removing donor signal risks also removing disease biology. The covariate mechanism offers a softer correction than batch_key.
batch_key: Primary batch variable. scVI learns a batch-specific decoder, forcing the encoder to produce batch-invariant latent representations. Aggressive correction — can remove correlated biological signal.categorical_covariate_keys: One-hot encoded and concatenated to the encoder and decoder inputs. Softer conditioning — the model can learn to use or ignore these covariates as needed. Does NOT force invariance.Before comparing models, verify whether your covariates uniquely identify batches:
import anndata as ad
adata = ad.read_h5ad("single_cell_analysis/CODEX_scvi_BioCov_phenotyped_newDuctal.h5ad")
# Check if Age+Gender uniquely identifies donors
donor_meta = adata.obs.groupby('imageid')[['Age', 'Gender']].first()
age_gender_pairs = donor_meta[['Age', 'Gender']].drop_duplicates()
print(f"Unique donors: {len(donor_meta)}")
print(f"Unique Age+Gender pairs: {len(age_gender_pairs)}")
# If these are equal, Age+Gender IS a donor identifier
Keep everything the same except remove categorical_covariate_keys:
import scvi
# ORIGINAL (with covariates)
scvi.model.SCVI.setup_anndata(
adata, layer="counts",
categorical_covariate_keys=["Age", "Gender"]
)
# NEW (no covariates) — compare against this
scvi.model.SCVI.setup_anndata(
adata, layer="counts"
# NO covariate keys
)
model = scvi.model.SCVI(
adata,
n_latent=10, n_layers=2, n_hidden=128,
dropout_rate=0.1, dispersion='gene-batch',
gene_likelihood='nb'
)
model.train(max_epochs=400, early_stopping=True,
early_stopping_patience=45)
For islet-level analysis, aggregate single-cell latents to islet level:
import numpy as np
latent = model.get_latent_representation()
# Filter to core islet cells only
core_mask = adata.obs['Parent'].str.match(r'^Islet_\d+$')
core = adata.obs[core_mask].copy()
core['latent'] = list(latent[core_mask])
# Aggregate: mean latent per islet (min_cells filter)
islet_latents = {}
for islet_id, group in core.groupby('islet_id'):
if len(group) >= 20:
islet_latents[islet_id] = np.mean(np.stack(group['latent'].values), axis=0)
from sklearn.metrics import silhouette_samples
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.stack(list(islet_latents.values())) # (n_islets, 10)
imageid = [...] # donor labels per islet
status = [...] # ND/Aab+/T1D per islet
# --- Metric 1: R² (MANOVA-style variance explained) ---
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for label_name, labels in [('imageid', imageid), ('donor_status', status)]:
y = le.fit_transform(labels).reshape(-1, 1)
r2 = LinearRegression().fit(y, X).score(y, X)
print(f"R²({label_name}): {r2:.4f}")
# GOOD: R²(donor) is LOW, R²(status) is HIGH
# --- Metric 2: Per-dimension eta² (ANOVA decomposition) ---
from scipy.stats import f_oneway
for dim in range(X.shape[1]):
groups = {}
for label, val in zip(imageid, X[:, dim]):
groups.setdefault(label, []).append(val)
ss_between = sum(len(g) * (np.mean(g) - np.mean(X[:, dim]))**2 for g in groups.values())
ss_total = np.sum((X[:, dim] - np.mean(X[:, dim]))**2)
eta2 = ss_between / ss_total
print(f"Dim {dim}: eta²(donor)={eta2:.4f}")
# GOOD: No single dimension has eta²(donor) > 0.9
# --- Metric 3: Silhouette scores (cosine) ---
sil_donor = silhouette_samples(X, imageid, metric='cosine')
sil_status = silhouette_samples(X, status, metric='cosine')
print(f"Sil(donor): {np.mean(sil_donor):.4f}") # GOOD: near 0 (mixed)
print(f"Sil(status): {np.mean(sil_status):.4f}") # GOOD: positive (separated)
# --- Metric 4: k-NN mixing (k=15) ---
nn = NearestNeighbors(n_neighbors=16, metric='cosine').fit(X)
_, indices = nn.kneighbors(X)
same_donor = np.mean([
np.mean([imageid[j] == imageid[i] for j in indices[i, 1:]])
for i in range(len(X))
])
print(f"k-NN same-donor: {same_donor:.4f}") # GOOD: close to random (1/n_donors)
# --- Metric 5: PCA pseudotime gradient ---
pca = PCA(n_components=5).fit(X)
pc0 = pca.transform(X)[:, 0]
from scipy.stats import spearmanr
r, p = spearmanr(pc0, pseudotime)
print(f"PC0 vs pseudotime: r={r:.3f}") # GOOD: |r| > 0.5 (biology drives PC0)
# See scripts/plot_scvi_comparison.py for full implementation
# Key panels:
# A: PCA colored by donor — visual mixing assessment
# B: PCA colored by disease — biological signal preservation
# C: Per-dimension eta² bars — where donor variance lives
# D: Donor silhouette distributions — mixing quality
# E: k-NN neighbor composition — same-donor fraction
# F: Summary metrics comparison — overall scorecard
# G: Original model pseudotime gradient — biology dominates
# H: No-covariate pseudotime gradient — donor dominates
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
Using batch_key="imageid" | imageid is 1:1 with donor_status (disease group). scVI learns batch-invariant latent → removes disease signal along with donor signal | Never use batch_key when batch variable is perfectly confounded with the biological variable of interest. Use categorical_covariate_keys for softer correction |
| Removing all covariates (no Age/Gender) | Donor variance in latent space increased by 40% (R²: 0.397→0.554). k-NN same-donor fraction jumped from 60%→85%. Pseudotime gradient in PC0 disappeared (r: 0.71→0.10) | Even soft covariates that are bijective with donor identity provide meaningful correction. The one-hot conditioning helps the model learn donor-agnostic features |
| Evaluating only R²(status) for biology | No-covariate model showed HIGHER R²(status) (0.149 vs 0.107), which looked better in isolation | Always evaluate biology metrics RELATIVE to donor metrics. The no-covariate model had higher status R² but also much higher donor R² — the extra biology signal was dwarfed by donor confounding. Use the excess eta² (donor - status) as the key metric |
| Relying solely on silhouette scores | Silhouette is sensitive to cluster shape assumptions and can be misleading for non-spherical distributions | Use multiple complementary metrics: R², eta², silhouette, k-NN mixing, and PCA analysis. Agreement across ≥4 metrics gives confidence |
| Low max_tokens in diagnostics script | Training completed but diagnostic output was truncated/lost when background task files were cleaned up | Save all diagnostic output to explicit log files (not just stdout). Use tee plus a dedicated log file path |
# scVI model (CANONICAL — with covariates)
n_latent: 10
n_layers: 2
n_hidden: 128
dropout_rate: 0.1
dispersion: gene-batch
gene_likelihood: nb
categorical_covariate_keys: ["Age", "Gender"]
batch_key: null # intentionally NOT set
max_epochs: 400
early_stopping: true
early_stopping_patience: 45
# Comparison diagnostics
n_islets_evaluated: 1024 # with min_cells=20
metrics_used:
- R² (donor vs status, linear regression)
- eta² per latent dimension (ANOVA decomposition)
- Silhouette score (cosine metric)
- k-NN same-donor/same-status fraction (k=15)
- PCA component analysis (pseudotime correlation)
acceptance_criteria:
- Original model wins majority (≥3/5) of key metrics
- Donor silhouette < +0.1 (well-mixed)
- PC0 pseudotime |r| > 0.3 (biology drives principal axis)
# Environment
conda_env: scvi-env
python: 3.10
scvi-tools: 1.4.0
gpu: NVIDIA RTX 4000 Ada Generation
training_time: ~82 min (2.6M cells, no covariates)
categorical_covariate_keys provides gentler correction than batch_key. When batch is confounded with biology (e.g., each disease group has unique donors), soft correction preserves biological signal while still reducing donor effects.eta²(donor) - eta²(status) captures how much latent variance is attributable to donor identity BEYOND what's explained by disease. Lower excess = better.| Metric | Original (Age/Gender) | No Covariates | Winner |
|---|---|---|---|
| R² imageid | 0.397 | 0.554 | Original |
| R² donor_status | 0.107 | 0.149 | No covariates |
| eta² excess (donor-status) | 0.291 | 0.405 | Original |
| Silhouette donor (cosine) | -0.041 | +0.250 | Original |
| k-NN same-donor enrichment | 8.4x | 11.9x | Original |
| k-NN same-status enrichment | 2.17x | 2.62x | No covariates |
| PC0 pseudotime r | 0.712 | 0.101 | Original |
| Training time | N/A (pre-existing) | 82.5 min | — |
| Overall winner | 4/5 key metrics | 1/5 | Original |
categorical_covariate_keys vs batch_keydata/DATA_PROVENANCE.md — Full data lineagescripts/retrain_scvi_no_covariates.py — Retraining scriptscripts/plot_scvi_comparison.py — Supplemental figure generatorscripts/scvi_comparison_results/ — All comparison outputs