| name | tooluniverse-variant-predictor-dms-validation |
| description | Validate a variant-effect predictor (AlphaMissense, ESM-C SAE, ESM logits, EVE, conservation scores, or any per-variant numeric score) against experimental deep mutational scanning (DMS) data. Computes per-variant predictor scores, splits variants into neutral vs disruptive groups by DMS effect, runs a Mann-Whitney U test on the predictor scores, and sweeps the stratification thresholds for robustness. Use when you need to know whether a predictor's scores track real functional disruption on a specific protein. |
| disable-model-invocation | true |
Variant-effect predictor benchmarking against DMS
The core user question: "I have a variant-effect predictor — does it actually
correlate with experimental DMS measurements on this protein?"
The predictor can be anything that assigns a numeric score to single missense
variants:
- ESM-C 6B Sparse Autoencoder (SAE) feature drops at the mutation site
- AlphaMissense pathogenicity scores
- ESM logits-based variant scoring (
ESM_score_sequence)
- EVE / EVE++ scores
- Conservation scores (ConSurf, Rate4Site)
- DynaMut2 ΔΔG predictions
- A custom in-house model
This skill validates ALL of these against a DMS dataset with the same statistical
framework. SAE is shown as the worked example because the surrounding skills in
this collection are SAE-themed, but the procedure is predictor-agnostic.
When to use this skill
- You picked a variant-effect predictor and want to know if it's worth trusting
on your protein of interest
- You're comparing two predictors on the same DMS dataset (run this skill twice,
compare the resulting per-K Mann-Whitney U p-values + effect sizes)
- Reviewers want robustness evidence — the parameter sweep (K, neutral band,
disruptive quantile) is exactly that
- You're publishing a new variant-effect method and need a benchmark figure
Not for:
- Per-position / per-feature interpretation — use
tooluniverse-residue-functional-mechanism-interpretation
- Single-variant interpretation (you have one variant, no DMS) — use
tooluniverse-protein-sae-variant-interpretation or
tooluniverse-protein-lof-mechanism
- Building a predictor from scratch — out of scope
Required inputs
| Input | Format | Example |
|---|
| DMS effect matrix | (20 amino acids × n_positions) np.array, NaN for unmeasured | from MaveDB_get_effect_matrix |
| Disruptive tail convention | "top" (ΔΔG positive = destabilizing) or "bottom" (fitness low = LoF) | metadata from DMS retrieval step |
| Per-variant predictor scores | (20 × n_positions) np.array matching DMS layout | computed from your chosen predictor — see Step 2 |
| Aggregation K | int, mean of top-K when the predictor outputs many sub-scores per variant | only relevant for multi-feature predictors like SAE; ignore otherwise |
Workflow
Step 1: Retrieve DMS as an effect matrix
One call returns a ready-to-analyze (20 × n_positions) matrix — HGVS parsing,
single-missense filtering, score-field detection, and (optional) UniProt
numbering verification are done inside the tool:
r = MaveDB_get_effect_matrix(
urn="urn:mavedb:00000115-a-7",
uniprot_id="P01116",
)
matrix = np.array(r["data"]["matrix"], dtype=np.float32)
positions = r["data"]["positions"]
amino_acid_order = r["data"]["amino_acid_order"]
If numbering_offset != 0, the MaveDB position numbering differs from the
UniProt canonical sequence — apply the offset before joining to any other
source (PDB / SAE features / AlphaMissense).
Step 2: Compute per-variant predictor scores
Pick one of these predictor sources. The choice changes Step 2 only; Steps
3–5 are identical.
Predictor option A — ESM-C 6B SAE (the worked example)
For every variant, sum SAE activations across the residue window, compute
drop = max(0, WT − mut), and aggregate to one score per variant via the
top-K mean of drops:
import numpy as np
def sae_drop_per_variant(wt_pooled, variant_pooled, K=3):
"""SAE drop for one variant = mean of the K largest feature drops."""
drops = np.maximum(0.0, wt_pooled - variant_pooled)
sorted_desc = -np.sort(-drops)
return float(sorted_desc[:K].mean())
Two ways to score a saturation sweep — pick based on batch size:
| When | Use | Forge cost (for 19 alts × N positions) |
|---|
| Saturation at ≤100 variants OR you only need top-K-per-variant deltas | ESM_score_variant_sae_batch(sequence, variants=[...], top_k_features=10) | 1 + 19N (1 ref + 1 per mut) |
| Full-protein per-feature tensor (e.g. for downstream PCA / clustering) | Loop ESM_get_sae_features(sequence=mutant), cache by (sequence, position) | 2 × 19N (cached reruns free) |
The batch tool is the right default — it halves Forge cost vs the per-variant
disruption pattern, and the cap of 100 variants per call covers saturation
mutagenesis at one position (19) or short positional sweeps (e.g. positions
10-15: 90 variants). For longer sweeps, split into multiple batch calls.
For the full per-residue × per-feature tensor needed by some predictor
analyses, fall back to the loop pattern. The full library scale is
well-tested: tests/integration/test_dms_pipeline_e2e_kras.py runs ~300
mutants for KRAS positions 10–25.
Predictor option B — AlphaMissense (hegelab proxy: categorical, not per-substitution numeric)
The TU AlphaMissense tools proxy a public API (hegelab.org) that returns
residue-level categorical assignments, not per-(position, alt_aa) numeric
scores. Three calls are available; pick the one that matches your scale:
| Tool | Signature | Returns |
|---|
AlphaMissense_get_variant_score | uniprot_id, variant (e.g. "p.G12V") | Residue-level data INCLUDING the alt_aa's bin (benign / ambiguous / pathogenic); also mean/mean_all |
AlphaMissense_get_residue_scores | uniprot_id, position | Same residue-level data (per-position lookup, cheaper than 19 variant calls) |
AlphaMissense_get_protein_scores | uniprot_id | Whole-protein dump in one call (cheapest for full-protein DMS analysis) |
The response shape is the same for all three. Example for KRAS pos 12:
r = AlphaMissense_get_residue_scores(uniprot_id="P01116", position=12)
To populate the (20, n_positions) predictor matrix, parse the bin
strings and map each alt_aa to a numeric score (bin-midpoint is the standard
imputation since the tool doesn't expose true per-substitution numerics):
BIN_MIDPOINTS = {"benign": 0.17, "ambiguous": 0.452, "pathogenic": 0.782}
def parse_bin_list(bin_str):
"""'6:A,C,D,R,S,V' → ['A','C','D','R','S','V']; '' → []."""
if not bin_str:
return []
_, aas = bin_str.split(":", 1)
return aas.split(",")
def am_per_variant_matrix(uniprot_id, positions, aa_index):
AAS = list(aa_index)
M = np.full((20, len(positions)), np.nan, dtype=np.float32)
p = AlphaMissense_get_protein_scores(uniprot_id=uniprot_id)
per_res = {row["resi"]: row for row in p["data"]["scores"]}
for pos_idx, pos in enumerate(positions):
row = per_res.get(pos, {})
for cat in ("benign", "ambiguous", "pathogenic"):
for alt in parse_bin_list(row.get(f"{cat}_all", "")):
if alt in aa_index:
M[aa_index[alt], pos_idx] = BIN_MIDPOINTS[cat]
return M
Higher-resolution alternative — DeepMind bulk CSV (only if you genuinely
need true per-substitution numerics rather than bin-midpoints):
Use the bulk CSV when you need rank-correlation analysis (Spearman benefits
from continuous values, not 3-bin midpoints). Use the TU proxy when you only
need the binary "is this variant in the pathogenic bin" signal.
Both options are free, no API key required.
Predictor option C — ESM logits-based score
ESM_score_sequence(
sequence=mutant_sequence,
model="esmc-600m-2024-12",
)
Predictor option D — any external score
Bring your own. Just produce a (20, n_positions) np.ndarray aligned to the
DMS matrix.
Step 3: Stratify DMS into neutral vs disruptive
def categorize(dms_matrix, disruptive_tail, neutral_abs=0.1, disruptive_quantile=0.05):
"""Split variants into neutral and disruptive masks.
disruptive_tail: 'top' (positive = destabilizing, e.g. folding ΔΔG)
'bottom' (low = LoF, e.g. fitness)
"""
flat = dms_matrix[~np.isnan(dms_matrix)]
if disruptive_tail == "top":
cut = np.quantile(flat, 1 - disruptive_quantile)
disruptive = dms_matrix >= cut
elif disruptive_tail == "bottom":
cut = np.quantile(flat, disruptive_quantile)
disruptive = dms_matrix <= cut
else:
raise ValueError("disruptive_tail must be 'top' or 'bottom'")
neutral = np.abs(dms_matrix) <= neutral_abs
disruptive = disruptive & ~np.isnan(dms_matrix)
neutral = neutral & ~np.isnan(dms_matrix) & ~disruptive
return neutral, disruptive
Keep the neutral band tight (|effect| ≤ 0.1). A loose neutral band leaks
weakly-disruptive variants into the "neutral" group and erodes the contrast.
Sign matters — get disruptive_tail from the DMS-retrieval skill's metadata;
a flipped sign silently inverts every conclusion.
Step 3.5 (MANDATORY): Pre-MWU sanity gate — catch silent NaN failures
Before running any statistical test, verify the predictor scores actually
populated. Silent NaN matrices are the most common failure mode in this
workflow — a batch SAE compute that errored midway, an AlphaMissense fetch
that skipped variants, an ESM forge call that timed out — and they produce
"successful" runs that report meaningless statistics.
neutral, disruptive = categorize(dms_matrix, disruptive_tail)
s_neutral_all = predictor_scores[neutral]
s_disruptive_all = predictor_scores[disruptive]
s_neutral_finite = s_neutral_all[~np.isnan(s_neutral_all)]
s_disruptive_finite = s_disruptive_all[~np.isnan(s_disruptive_all)]
coverage_n = len(s_neutral_finite) / max(len(s_neutral_all), 1)
coverage_d = len(s_disruptive_finite) / max(len(s_disruptive_all), 1)
if len(s_neutral_finite) < 5 or len(s_disruptive_finite) < 5:
raise ValueError(
f"Insufficient predictor scores: only {len(s_neutral_finite)} neutral "
f"and {len(s_disruptive_finite)} disruptive non-NaN values. "
f"Coverage = {coverage_n:.0%} / {coverage_d:.0%}. "
f"Predictor matrix may be empty / failed to populate. "
f"INVESTIGATE THE PREDICTOR COMPUTATION STEP before continuing."
)
if coverage_n < 0.5 or coverage_d < 0.5:
print(f"WARNING: predictor coverage only {coverage_n:.0%} (neutral) / "
f"{coverage_d:.0%} (disruptive). Results may be biased toward the "
f"non-NaN subset.")
If you hit the ValueError: do not paper over with "predictor X wins by
default". The right response is to debug the predictor computation step
(re-run, check API keys, check log files) and report what failed.
Sign-convention double-check (also mandatory if the dataset is new):
verify disruptive_tail against the data using an internal landmark. The
metadata field can be misleading on subsets (e.g. a window of TP53 DBD
might run the opposite direction from the full TP53 abundance assay). The
cheapest check is Spearman correlation between DMS effect and a predictor
known to align in a fixed direction (AlphaMissense pathogenicity score is
always positive=more-damaging):
from scipy.stats import spearmanr
flat_dms = dms_matrix.ravel()
flat_pred = predictor_scores.ravel()
mask = ~(np.isnan(flat_dms) | np.isnan(flat_pred))
rho, p_corr = spearmanr(flat_dms[mask], flat_pred[mask])
expected_sign = "+" if disruptive_tail == "top" else "-"
got_sign = "+" if rho > 0 else "-"
if got_sign != expected_sign and abs(rho) > 0.1:
print(f"WARNING: Spearman rho = {rho:.3f} (sign={got_sign}) but "
f"disruptive_tail='{disruptive_tail}' expects sign={expected_sign}. "
f"Sign convention may be inverted for THIS subset. Verify before "
f"interpreting MWU.")
Step 4: One-sided Mann-Whitney U test
from scipy.stats import mannwhitneyu
u, p = mannwhitneyu(s_disruptive_finite, s_neutral_finite, alternative="greater")
print(f"disruptive median = {np.median(s_disruptive_finite):.4f}, "
f"neutral median = {np.median(s_neutral_finite):.4f}, p = {p:.3g}")
For SAE-style multi-feature predictors with a top-K parameter, run this for
K ∈ {1, 3, 10} and report all three — the best K is usually different across
predictors and you want the comparison transparent, not tuned.
MWU is the discrimination test, but it's not the only valid analysis.
Consider also reporting (one or both):
- Spearman rank correlation between predictor and DMS effect — captures
graded calibration the binary neutral/disruptive split discards. Best for
"is this predictor calibrated?" questions, not just "does it discriminate?"
- AUROC at a clinically-meaningful disruption threshold (e.g. ΔΔG > 1
kcal/mol) — gives a 0-1 number practitioners recognize and lets you compare
to literature predictors directly.
The eval that motivated this skill (KRAS folding AlphaMissense benchmark)
got qualitatively similar answers from MWU (p=0.20, "not reliable") and
Spearman (ρ=0.23, p<1e-30, "weakly calibrated") — but Spearman's reading was
richer. If the user just asks "is X reliable?", give both unless one is
clearly inappropriate for the data shape.
Step 5: Robustness sweep
sweep = []
for neutral_abs in (0.05, 0.1, 0.2):
for q in (0.05, 0.1):
neut, disr = categorize(dms_matrix, disruptive_tail,
neutral_abs=neutral_abs,
disruptive_quantile=q)
s_n = predictor_scores[neut][~np.isnan(predictor_scores[neut])]
s_d = predictor_scores[disr][~np.isnan(predictor_scores[disr])]
if len(s_n) < 5 or len(s_d) < 5:
continue
_u, p = mannwhitneyu(s_d, s_n, alternative="greater")
sweep.append({"neutral_abs": neutral_abs, "disruptive_q": q, "p": p,
"n_n": len(s_n), "n_d": len(s_d)})
A predictor that passes only at one (neutral_abs, q) point is suspect. A
predictor that passes across the grid is robust.
Step 6: Visualize + interpret
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4, 4))
ax.boxplot(
[s_neutral, s_disruptive],
labels=[f"neutral (n={len(s_neutral)})", f"disruptive (n={len(s_disruptive)})"]
)
ax.set_ylabel("Predictor score")
ax.set_title(f"p = {p:.3g}")
plt.tight_layout()
plt.savefig("predictor_vs_dms.png", dpi=150)
Interpretation
| Result | What it means |
|---|
| p < 0.01 AND robust across sweep | Predictor reliably distinguishes disruptive from neutral on this protein. Safe to use for prioritization (not classification — see limits) |
| p < 0.05 but flips signs in sweep | Borderline. Effect exists but is sensitive to stratification — needs more data or tighter neutral band |
| p > 0.05 | Predictor is not informative on this DMS assay. Possible causes: wrong disruptive_tail, predictor mis-calibrated for this protein family, DMS measures something the predictor wasn't trained for |
| Significant at K=1 but not K=10 (multi-feature predictors only) | Disruption is concentrated in a few features (K=1 best) vs distributed across many (K=10) |
Comparing two predictors
Run this skill twice on the same DMS dataset (e.g. SAE drops AND AlphaMissense),
keep the same stratification (disruptive_tail, neutral_abs, disruptive_quantile),
and compare:
| Metric | Better predictor has |
|---|
| Lower p-value | More confidence of discrimination |
| Larger median gap (disruptive − neutral) | Larger effect size |
| Fewer NaNs in coverage | Predicts more variants |
| Robust across sweep | More reliable in different conditions |
Honest limitations
- Per-protein test only. Generalizing "predictor X is good" requires running
this on multiple proteins from different families.
- Per-assay only. A predictor calibrated for stability might fail on
binding-fitness assays — same protein, different DMS, different result.
- Coarse signal. This says "predictor responds to mutational disruptiveness."
It does NOT say "predictor identifies the right residues" — for that, run
tooluniverse-residue-functional-mechanism-interpretation.
- MWU assumes independence. DMS variants at the same position are mildly
correlated (shared structural context); the p-values are slightly optimistic.
- Tail definitions are arbitrary. 5% disruptive cutoff and 0.1 neutral band
are reasonable defaults; the right thresholds depend on the assay's
signal-to-noise. The sweep is what gives you robustness, not any single choice.
Cross-references
| Step | Tool / Skill |
|---|
| DMS retrieval | MaveDB_get_effect_matrix |
| Per-variant SAE scoring (≤100 variants) | ESM_score_variant_sae_batch (preferred — N+1 calls) |
| Per-variant SAE scoring (full tensor / unlimited) | ESM_get_sae_features (loop + cache), ESM_score_variant_sae_disruption (single variant) |
| Per-variant AlphaMissense (single lookup) | AlphaMissense_get_variant_score(uniprot_id, variant) |
| Per-position AlphaMissense (saturation) | AlphaMissense_get_residue_scores(uniprot_id, position) |
| Whole-protein AlphaMissense (cheapest for DMS) | AlphaMissense_get_protein_scores(uniprot_id) |
| Per-variant ESM logits | ESM_score_sequence |
| Structural prior (for predictor analysis) | Structure_annotate_per_residue |
| Next step: per-hotspot mechanism | tooluniverse-residue-functional-mechanism-interpretation |
| Final visualization | Step 7 of tooluniverse-residue-functional-mechanism-interpretation (annotated heatmap + callouts) |