| name | mindalign-eeg-visual-decoding |
| description | Tri-modal contrastive framework (EEG, vision, language) for zero-shot visual decoding. Achieves 54.1% Top-1 accuracy on 200-way benchmark, massively exceeding prior baselines. Use for: brain-computer interface visual reconstruction, EEG-based image retrieval, non-invasive neural decoding, multimodal brain signal analysis. |
| license | Complete terms in LICENSE.txt |
| metadata | {"arxiv_id":"2605.24523","published":"2026-05-23","authors":"Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar, Xi Vincent Wang, Lihui Wang","tags":["eeg","visual-decoding","contrastive-learning","brain-computer-interface","zero-shot","multimodal","neuroscience","language-grounding"],"source":"arXiv:2605.24523"} |
MindAlign: Tri-Modal EEG-Vision-Language Visual Decoding
Overview
MindAlign introduces a tri-modal contrastive framework that aligns EEG brain signals, visual images, and language descriptions in a unified latent space for zero-shot visual decoding.
Key breakthrough: 54.1% Top-1 accuracy on 200-way zero-shot benchmark vs prior SOTA of 32.4% (+67% relative improvement).
When to Use
- Zero-shot visual decoding from EEG signals
- Brain-computer interface image retrieval
- Non-invasive neural representation alignment
- Multimodal brain signal analysis (EEG + vision + text)
- Studying neural correlates of visual object recognition
- Cross-subject generalization of visual decoders
Architecture
EEG Signal (T×C) → EEG Encoder → e_eeg ∈ R^d
↓
Image → CLIP/CN-CLIP → e_img ∈ R^d → Unified Latent Space (contrastive alignment)
↑
Text Description (LLM-generated) → e_txt ∈ R^d
Two-Stage Design
Stage 1: Masked Reconstruction Pre-training
class EEGMaskedAutoencoder(nn.Module):
def __init__(self, n_channels=128, n_timepoints=512, d_model=512):
self.channel_embed = nn.Linear(1, d_model)
self.time_embed = nn.Linear(1, d_model)
self.transformer = TransformerEncoder(d_model, n_heads=8, n_layers=6)
self.decoder = TransformerDecoder(d_model, n_heads=8, n_layers=4)
def forward(self, eeg, mask_ratio=0.75):
tokens = self.tokenize(eeg)
visible_tokens, mask_idx = self.random_mask(tokens, mask_ratio)
latent = self.transformer(visible_tokens)
reconstructed = self.decoder(latent, mask_idx)
loss = F.mse_loss(reconstructed, tokens[mask_idx])
return loss, latent
Stage 2: Tri-Modal Contrastive Alignment
class MindAlignTraining(nn.Module):
def __init__(self, eeg_encoder, image_encoder, text_encoder, d=512):
self.eeg_enc = eeg_encoder
self.img_enc = image_encoder
self.txt_enc = text_encoder
self.subject_adapters = nn.ModuleDict({
f'subj_{i}': nn.Linear(d, d) for i in range(n_subjects)
})
def forward(self, eeg, images, texts, subject_ids):
e_eeg = self.eeg_enc(eeg)
e_eeg = self.subject_adapters[f'subj_{subject_ids[0]}'](e_eeg)
e_img = self.img_enc(images)
e_txt = self.txt_enc(texts)
loss = self.contrastive_loss_trimodal(e_eeg, e_img, e_txt)
return loss
def contrastive_loss_trimodal(self, e_eeg, e_img, e_txt, tau=0.07):
e_eeg = F.normalize(e_eeg, dim=-1)
e_img = F.normalize(e_img, dim=-1)
e_txt = F.normalize(e_txt, dim=-1)
loss_ei = self.clip_loss(e_eeg, e_img, tau)
loss_et = self.clip_loss(e_eeg, e_txt, tau)
loss_it = self.clip_loss(e_img, e_txt, tau)
alpha = 0.3
return loss_ei + alpha * (loss_et + loss_it)
EEG Encoder Architecture
Key Components
class EEGEncoder(nn.Module):
def __init__(self, n_channels=128, d_model=512):
self.channel_graph_attn = GraphAttentionLayer(
n_channels, d_model,
adjacency='functional_connectivity'
)
self.spatial_conv = nn.Conv1d(n_channels, d_model, kernel_size=1)
self.temporal_conv = nn.Conv1d(d_model, d_model,
kernel_size=25, padding=12)
self.subject_norm = SubjectBatchNorm(d_model, n_subjects)
def forward(self, eeg, subject_id):
h_spatial = self.channel_graph_attn(eeg)
h_temporal = self.temporal_conv(
self.spatial_conv(eeg)
)
h = h_temporal.mean(dim=-1)
h = self.subject_norm(h, subject_id)
return h
LLM Text Generation
Critical insight: Generate textual descriptions using LLM for each image class
def generate_image_descriptions(image_classes, model='gpt-4'):
"""Generate rich textual descriptions for contrastive training."""
descriptions = {}
template = """Describe the image of '{class_name}' in detail,
including:
1. Visual appearance and shape
2. Color and texture
3. Typical context/scene
4. Distinctive features
Keep description factual and visually-grounded (3-4 sentences)."""
for cls in image_classes:
response = model.generate(template.format(class_name=cls))
descriptions[cls] = response
return descriptions
Key Finding: Compact Embeddings Win
image_encoders_comparison = {
'CN-CLIP (compact)': {'top1': 54.1, 'top5': 83.4},
'CLIP ViT-B/32': {'top1': 41.2, 'top5': 72.1},
'CLIP ViT-L/14': {'top1': 38.7, 'top5': 69.3},
'prior_SOTA': {'top1': 32.4, 'top5': 64.0}
}
Benchmark Results
| Method | Top-1 (200-way) | Top-5 (200-way) |
|---|
| EEGClip (2022) | 15.6% | 42.3% |
| BraVL (2023) | 24.8% | 56.1% |
| ATMS (2024) | 32.4% | 64.0% |
| MindAlign (2026) | 54.1% | 83.4% |
Significance: Wilcoxon p < 0.01 vs all baselines, confirmed on Things-EEG2 and Things-MEG datasets.
Implementation Steps
1. Data Preparation
from datasets import load_dataset
dataset = load_dataset('things-eeg2')
def preprocess_eeg(raw_eeg, sfreq=1000, target_sfreq=250):
filtered = mne.filter.filter_data(raw_eeg, sfreq, 0.1, 100)
resampled = mne.filter.resample(filtered, up=target_sfreq, down=sfreq)
epochs = epoch_data(resampled, events, tmin=-0.2, tmax=0.8)
return epochs
2. Stage 1 Pre-training
model = EEGMaskedAutoencoder(n_channels=128, n_timepoints=250)
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
for epoch in range(100):
for eeg_batch in unlabeled_eeg_loader:
loss, _ = model(eeg_batch, mask_ratio=0.75)
loss.backward()
optimizer.step()
3. Stage 2 Contrastive Training
eeg_encoder = model.encoder
img_encoder = load_cn_clip()
txt_encoder = load_text_encoder()
mindalign = MindAlignTraining(eeg_encoder, img_encoder, txt_encoder)
descriptions = generate_image_descriptions(all_image_classes)
for epoch in range(50):
for eeg, images, subject_ids in labeled_loader:
texts = [descriptions[img_class] for img_class in image_classes]
loss = mindalign(eeg, images, texts, subject_ids)
loss.backward()
4. Zero-Shot Decoding
def zero_shot_decode(eeg_trial, image_gallery, model):
"""
Given an EEG trial, find the most similar image from gallery.
"""
e_eeg = model.eeg_enc(eeg_trial)
e_imgs = model.img_enc(image_gallery)
similarities = F.cosine_similarity(e_eeg.unsqueeze(0), e_imgs)
top_k_idx = similarities.topk(5).indices
return image_gallery[top_k_idx]
Neurophysiological Alignment
The paper validates that decoding patterns match established neuroscience:
temporal_importance = analyze_temporal_attention(model)
Pitfalls
- Subject variability: EEG varies enormously across subjects; subject adapters are critical
- CN-CLIP key: Using larger/different image encoders significantly degrades performance
- Text balance: Too much text weight overwhelms EEG-image signal (α=0.3 optimal)
- Masked pretraining: Stage 1 without sufficient unlabeled data will hurt Stage 2
- Things-EEG2 specific: Timing window (-0.2 to 0.8s) tuned for this dataset; adjust for other paradigms
- Compact geometry rule: Always test multiple image encoders; compact usually beats large for EEG alignment
Extensions
Key References
- Primary: Chen et al. (2026). "MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding." arXiv:2605.24523
- Things-EEG2 dataset: Gifford et al. (2022)
- CN-CLIP: Yang et al. (2022)
- CLIP: Radford et al. (2021)
- Code: https://github.com/anon-eeg/eeg_image_decoding
Activation Keywords
EEG visual decoding, brain-computer interface, zero-shot image retrieval, contrastive learning, tri-modal alignment, EEG-vision-language, Things-EEG2, masked autoencoder pretraining, neural image reconstruction, non-invasive BCI, MindAlign, EEG zero-shot