Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

Commencer

scvi-covariate-validation

Étoiles2

Forks0

Mis à jour5 mars 2026 à 23:14

Validate scVI batch correction covariate choices by comparing latent space quality metrics

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

smith6jt-cop

smith6jt-cop/Skills_Registry

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Scientifiques des donnéesProfessions informatiques et mathématiques·SOC 15-2051

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

cycle-dir-normalization

smith6jt-cop/Skills_Registry

Normalize long-form CODEX cycle folders to short form before notebooks run. Trigger: cyc001_reg001_*, hard-coded cyc paths breaking, staged CODEX raw data failing in Notebooks 1/2.

2026-04-162

joint-multi-tf-v560

smith6jt-cop/Skills_Registry

v5.6.0 joint multi-TF model: single model per symbol with broadcast 1Hour context replaces dual 15Min/1Hour models. Trigger: (1) replacing weighted-voting model aggregation, (2) adding broadcast features to vectorized env, (3) limited training data + worried about overfitting from doubling obs_dim, (4) backtest builder mismatch with newer feature counts.

2026-04-112

multi-timeframe-training

smith6jt-cop/Skills_Registry

DEPRECATED in v5.6.0 — see joint-multi-tf-v560 skill. Documents the v5.2.0 dual-model approach (train separate 15Min/1Hour models, combine via weighted voting). Still relevant for: (1) loading legacy v5.5.0 dual models, (2) understanding the historical aggregation layer, (3) resampling pattern via origin='start'.

2026-04-112

dashboard-feature-discovery

smith6jt-cop/Skills_Registry

Surface a shipped-but-undocumented CLI feature in user-facing docs. Trigger: user reports a known feature missing from README/readthedocs even though the CLI command exists.

2026-04-082

live-aware-account-routing

smith6jt-cop/Skills_Registry

KINTSUGI Snakefile + CLI changes that route SLURM jobs around accounts saturated by OTHER users on the same QOS pool. Trigger: QOSGrpMemLimit, jobs stuck pending despite available GPU slots in config, noisy neighbor on shared QOS, multi-user investment pool exhaustion, _build_cycle_assignment static-vs-live.

2026-04-082

slurm-concurrent-processing

smith6jt-cop/Skills_Registry

KINTSUGI SLURM batch processing: Maximize throughput using multi-account resource calculation with GPU+CPU pools per account. Trigger: SLURM job submission, batch processing, resource maximization, GPU+CPU concurrent, headless processing, resource pool.

2026-04-082

name	scvi-covariate-validation
description	Validate scVI batch correction covariate choices by comparing latent space quality metrics
author	smith6jt
date	"2026-02-23T00:00:00.000Z"

scVI Covariate Validation - Research Notes

Experiment Overview

Item	Details
Date	2026-02-23
Goal	Determine whether Age/Gender categorical covariates improve or degrade scVI latent space quality for disease-progression analysis
Environment	Python 3.10, scvi-tools 1.4.0, scanpy 1.11.5, NVIDIA RTX 4000 Ada, conda `scvi-env`
Status	Success — Original model with covariates validated as superior

Context

The Islet Explorer project uses scVI to batch-correct 2.6M single-cell CODEX proteomic measurements across 15 nPOD pancreas donors (5 ND, 5 Aab+, 5 T1D). The original model was trained with categorical_covariate_keys=["Age", "Gender"] but NO batch_key. A question arose: since each donor has a unique Age value (float64 precision), Age+Gender effectively encodes donor identity. Does this help or hurt the latent space?

Key confounding: imageid (donor) is 1:1 with donor_status (disease group). Aggressively removing donor signal risks also removing disease biology. The covariate mechanism offers a softer correction than batch_key.

scVI Covariate Mechanisms

batch_key: Primary batch variable. scVI learns a batch-specific decoder, forcing the encoder to produce batch-invariant latent representations. Aggressive correction — can remove correlated biological signal.
categorical_covariate_keys: One-hot encoded and concatenated to the encoder and decoder inputs. Softer conditioning — the model can learn to use or ignore these covariates as needed. Does NOT force invariance.

Verified Workflow

1. Check covariate uniqueness

Before comparing models, verify whether your covariates uniquely identify batches:

import anndata as ad
adata = ad.read_h5ad("single_cell_analysis/CODEX_scvi_BioCov_phenotyped_newDuctal.h5ad")

# Check if Age+Gender uniquely identifies donors
donor_meta = adata.obs.groupby('imageid')[['Age', 'Gender']].first()
age_gender_pairs = donor_meta[['Age', 'Gender']].drop_duplicates()
print(f"Unique donors: {len(donor_meta)}")
print(f"Unique Age+Gender pairs: {len(age_gender_pairs)}")
# If these are equal, Age+Gender IS a donor identifier

2. Retrain without covariates (identical architecture)

Keep everything the same except remove categorical_covariate_keys:

import scvi

# ORIGINAL (with covariates)
scvi.model.SCVI.setup_anndata(
    adata, layer="counts",
    categorical_covariate_keys=["Age", "Gender"]
)

# NEW (no covariates) — compare against this
scvi.model.SCVI.setup_anndata(
    adata, layer="counts"
    # NO covariate keys
)

model = scvi.model.SCVI(
    adata,
    n_latent=10, n_layers=2, n_hidden=128,
    dropout_rate=0.1, dispersion='gene-batch',
    gene_likelihood='nb'
)
model.train(max_epochs=400, early_stopping=True,
            early_stopping_patience=45)

3. Extract and aggregate latent representations

For islet-level analysis, aggregate single-cell latents to islet level:

import numpy as np

latent = model.get_latent_representation()
# Filter to core islet cells only
core_mask = adata.obs['Parent'].str.match(r'^Islet_\d+$')
core = adata.obs[core_mask].copy()
core['latent'] = list(latent[core_mask])

# Aggregate: mean latent per islet (min_cells filter)
islet_latents = {}
for islet_id, group in core.groupby('islet_id'):
    if len(group) >= 20:
        islet_latents[islet_id] = np.mean(np.stack(group['latent'].values), axis=0)

4. Run diagnostic metrics (5 complementary approaches)

from sklearn.metrics import silhouette_samples
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.stack(list(islet_latents.values()))  # (n_islets, 10)
imageid = [...]   # donor labels per islet
status = [...]    # ND/Aab+/T1D per islet

# --- Metric 1: R² (MANOVA-style variance explained) ---
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for label_name, labels in [('imageid', imageid), ('donor_status', status)]:
    y = le.fit_transform(labels).reshape(-1, 1)
    r2 = LinearRegression().fit(y, X).score(y, X)
    print(f"R²({label_name}): {r2:.4f}")
# GOOD: R²(donor) is LOW, R²(status) is HIGH

# --- Metric 2: Per-dimension eta² (ANOVA decomposition) ---
from scipy.stats import f_oneway
for dim in range(X.shape[1]):
    groups = {}
    for label, val in zip(imageid, X[:, dim]):
        groups.setdefault(label, []).append(val)
    ss_between = sum(len(g) * (np.mean(g) - np.mean(X[:, dim]))**2 for g in groups.values())
    ss_total = np.sum((X[:, dim] - np.mean(X[:, dim]))**2)
    eta2 = ss_between / ss_total
    print(f"Dim {dim}: eta²(donor)={eta2:.4f}")
# GOOD: No single dimension has eta²(donor) > 0.9

# --- Metric 3: Silhouette scores (cosine) ---
sil_donor = silhouette_samples(X, imageid, metric='cosine')
sil_status = silhouette_samples(X, status, metric='cosine')
print(f"Sil(donor): {np.mean(sil_donor):.4f}")   # GOOD: near 0 (mixed)
print(f"Sil(status): {np.mean(sil_status):.4f}")  # GOOD: positive (separated)

# --- Metric 4: k-NN mixing (k=15) ---
nn = NearestNeighbors(n_neighbors=16, metric='cosine').fit(X)
_, indices = nn.kneighbors(X)
same_donor = np.mean([
    np.mean([imageid[j] == imageid[i] for j in indices[i, 1:]])
    for i in range(len(X))
])
print(f"k-NN same-donor: {same_donor:.4f}")  # GOOD: close to random (1/n_donors)

# --- Metric 5: PCA pseudotime gradient ---
pca = PCA(n_components=5).fit(X)
pc0 = pca.transform(X)[:, 0]
from scipy.stats import spearmanr
r, p = spearmanr(pc0, pseudotime)
print(f"PC0 vs pseudotime: r={r:.3f}")  # GOOD: |r| > 0.5 (biology drives PC0)

5. Generate supplemental figure (8-panel comparison)

# See scripts/plot_scvi_comparison.py for full implementation
# Key panels:
# A: PCA colored by donor — visual mixing assessment
# B: PCA colored by disease — biological signal preservation
# C: Per-dimension eta² bars — where donor variance lives
# D: Donor silhouette distributions — mixing quality
# E: k-NN neighbor composition — same-donor fraction
# F: Summary metrics comparison — overall scorecard
# G: Original model pseudotime gradient — biology dominates
# H: No-covariate pseudotime gradient — donor dominates

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Using `batch_key="imageid"`	imageid is 1:1 with donor_status (disease group). scVI learns batch-invariant latent → removes disease signal along with donor signal	Never use `batch_key` when batch variable is perfectly confounded with the biological variable of interest. Use `categorical_covariate_keys` for softer correction
Removing all covariates (no Age/Gender)	Donor variance in latent space increased by 40% (R²: 0.397→0.554). k-NN same-donor fraction jumped from 60%→85%. Pseudotime gradient in PC0 disappeared (r: 0.71→0.10)	Even soft covariates that are bijective with donor identity provide meaningful correction. The one-hot conditioning helps the model learn donor-agnostic features
Evaluating only R²(status) for biology	No-covariate model showed HIGHER R²(status) (0.149 vs 0.107), which looked better in isolation	Always evaluate biology metrics RELATIVE to donor metrics. The no-covariate model had higher status R² but also much higher donor R² — the extra biology signal was dwarfed by donor confounding. Use the excess eta² (donor - status) as the key metric
Relying solely on silhouette scores	Silhouette is sensitive to cluster shape assumptions and can be misleading for non-spherical distributions	Use multiple complementary metrics: R², eta², silhouette, k-NN mixing, and PCA analysis. Agreement across ≥4 metrics gives confidence
Low max_tokens in diagnostics script	Training completed but diagnostic output was truncated/lost when background task files were cleaned up	Save all diagnostic output to explicit log files (not just stdout). Use `tee` plus a dedicated log file path

Final Parameters

# scVI model (CANONICAL — with covariates)
n_latent: 10
n_layers: 2
n_hidden: 128
dropout_rate: 0.1
dispersion: gene-batch
gene_likelihood: nb
categorical_covariate_keys: ["Age", "Gender"]
batch_key: null  # intentionally NOT set
max_epochs: 400
early_stopping: true
early_stopping_patience: 45

# Comparison diagnostics
n_islets_evaluated: 1024  # with min_cells=20
metrics_used:
  - R² (donor vs status, linear regression)
  - eta² per latent dimension (ANOVA decomposition)
  - Silhouette score (cosine metric)
  - k-NN same-donor/same-status fraction (k=15)
  - PCA component analysis (pseudotime correlation)
acceptance_criteria:
  - Original model wins majority (≥3/5) of key metrics
  - Donor silhouette < +0.1 (well-mixed)
  - PC0 pseudotime |r| > 0.3 (biology drives principal axis)

# Environment
conda_env: scvi-env
python: 3.10
scvi-tools: 1.4.0
gpu: NVIDIA RTX 4000 Ada Generation
training_time: ~82 min (2.6M cells, no covariates)

Key Insights

Soft vs hard correction: categorical_covariate_keys provides gentler correction than batch_key. When batch is confounded with biology (e.g., each disease group has unique donors), soft correction preserves biological signal while still reducing donor effects.
Unique Age values = effective donor ID: Float64 age values (e.g., 20.59, 20.70, 20.75) are practically unique per donor. Combined with Gender, this is a bijection to imageid. The model learns donor-specific adjustments via these covariates.
Excess eta² is the key metric: eta²(donor) - eta²(status) captures how much latent variance is attributable to donor identity BEYOND what's explained by disease. Lower excess = better.
PC0 pseudotime gradient is the visual proof: The most compelling evidence is comparing PCA plots colored by pseudotime. The original model shows a smooth disease-progression gradient; the no-covariate model buries it under donor clustering.
Multiple metrics needed: No single metric tells the whole story. R² showed mixed results (biology slightly better without covariates), but silhouette, k-NN, eta², and PCA all agreed the original model was superior. Use ≥4 complementary metrics.
Training time is identical: Both models trained in ~82 minutes on the same GPU. The covariates add negligible overhead (just one-hot encoding concatenation).

Execution Results (2026-02-23)

Metric	Original (Age/Gender)	No Covariates	Winner
R² imageid	0.397	0.554	Original
R² donor_status	0.107	0.149	No covariates
eta² excess (donor-status)	0.291	0.405	Original
Silhouette donor (cosine)	-0.041	+0.250	Original
k-NN same-donor enrichment	8.4x	11.9x	Original
k-NN same-status enrichment	2.17x	2.62x	No covariates
PC0 pseudotime r	0.712	0.101	Original
Training time	N/A (pre-existing)	82.5 min	—
Overall winner	4/5 key metrics	1/5	Original

References

scVI: Deep generative modeling for single-cell transcriptomics — Lopez et al. 2018
scvi-tools documentation: covariates — categorical_covariate_keys vs batch_key
Islet Explorer: data/DATA_PROVENANCE.md — Full data lineage
Islet Explorer: scripts/retrain_scvi_no_covariates.py — Retraining script
Islet Explorer: scripts/plot_scvi_comparison.py — Supplemental figure generator
Islet Explorer: scripts/scvi_comparison_results/ — All comparison outputs