| name | eeg-foundation-sae-interpretability |
| description | Mechanistic interpretability of EEG foundation models using Sparse Autoencoders (SAEs). Extracts interpretable feature dictionaries from EEG transformer embeddings via TopK SAEs, benchmarks monosemanticity across architectures (SleepFM, REVE, LaBraM), and introduces concept steering with target vs. off-target probe metrics. Use when: interpreting EEG models, sparse autoencoders for neural data, EEG foundation model analysis, mechanistic interpretability of time-series models, concept steering in brain models, EEG feature disentanglement. Activation: EEG SAE, EEG interpretability, sparse autoencoder EEG, EEG foundation model, concept steering EEG, EEG monosemanticity, EEG feature dictionary. |
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Apply TopK Sparse Autoencoders to extract interpretable features from EEG foundation model embeddings. (arXiv: 2605.13930)
Core Methodology
TopK SAE on EEG Transformers
Train TopK Sparse Autoencoders on embeddings from EEG foundation models to extract feature dictionaries:
encoded = topk(Encoder(x), k)
decoded = Decoder(encoded)
loss = MSE(x, decoded) + λ * sparsity_penalty
Cross-Architecture Benchmarking
Apply SAEs across three architecturally distinct EEG transformers:
| Architecture | Focus | SAE Transferability |
|---|
| SleepFM | Sleep staging | Robust feature extraction |
| REVE | General EEG | Cross-dataset features |
| LaBraM | Brain activity | Clinical feature grounding |
Clinical Taxonomy Grounding
Ground extracted features against clinical categories:
- Abnormality — pathological EEG patterns
- Age — developmental/aging signatures
- Sex — sex-specific neural patterns
- Medication — drug-induced EEG changes
Dictionary Health Audit
Intrinsic procedure to evaluate SAE quality:
- Monosemanticity — single feature → single concept
- Entanglement — feature-concept mapping complexity
- Coverage — fraction of variance explained
- Sparsity — average active features per input
Concept Steering & Probe Metrics
Target vs. Off-Target Probe Area:
- Quantify steering selectivity
- Three operational regimes identified:
- Selective steering — activates target concept without off-target effects
- Mixed activation — partial selectivity
- Entangled steering — activates multiple concepts simultaneously
Single hyperparameter procedure transfers robustly across all architectures.
Implementation Workflow
Step 1: Extract Embeddings
model = load_eeg_model("SleepFM")
embeddings = model.encode(eeg_data)
Step 2: Train TopK SAE
sae = TopKSAE(
input_dim=embeddings.shape[-1],
dict_size=16384,
k=32,
)
sae.train(embeddings, lr=1e-4, batch_size=256)
Step 3: Audit Dictionary Health
audit = DictionaryAudit(sae, eeg_data)
monosemanticity = audit.compute_monosemanticity()
entanglement = audit.compute_entanglement()
coverage = audit.compute_coverage()
Step 4: Concept Steering
directions = identify_directions(sae, labeled_data, concepts=["abnormality", "age"])
steered = steer(embeddings, directions["abnormality"], strength=2.0)
probe_score = probe(steered, target="abnormality", off_target=["age", "sex"])
Key Findings
- SAEs transfer across architectures — a single hyperparameter setting works for SleepFM, REVE, and LaBraM
- Clinical features are recoverable — abnormality, age, sex, and medication signatures emerge as sparse features
- Steering selectivity varies — three distinct regimes from selective to fully entangled
- Intrinsic audit predicts transfer — dictionary health metrics predict downstream steering quality
Activation Conditions
Use this skill when:
- Interpreting EEG foundation model internals
- Applying SAEs to neural time-series data
- Benchmarking EEG model architectures
- Performing concept steering on brain models
- Analyzing feature entanglement in neural representations
- Evaluating clinical trustworthiness of EEG models
Related Skills
eeg-foundation-lrp-interpretability - LRP-based EEG interpretability
eeg-foundation-model-adapters - EEG foundation model domain adaptation
mechanistic-interpretability - General mechanistic interpretability methods