| name | whisper-ecog-alignment |
| description | Whisper-ECoG alignment methodology mapping speech foundation model representations to human cortical activity using interpretable time-resolved neural encoding |
| category | neuroscience |
| activation_keywords | ["whisper","ecog","speech encoding","brain alignment","speech foundation model","temporal encoding","phoneme organization","cortical speech processing","time-resolved neural encoding","soft attention","hierarchical brain alignment"] |
| version | 1.0.0 |
| paper_id | arXiv:2606.02305 |
| authors | ["Matteo Ciferri","Tommaso Boccato","Michal Olak","Matteo Ferrante","Nicola Toschi"] |
| date | "2026-06-01T00:00:00.000Z" |
| conference | ICLR 2026 Workshop on Representational Alignment (Re-Align) |
Whisper-ECoG Alignment: Speech Foundation Models & Brain Representations
Core Discovery
Mapping Whisper Representations to Human ECoG Responses - Speech foundation models (Whisper) provide a useful framework for studying time-resolved cortical speech representations, with intermediate layers showing strongest brain alignment.
Key Findings
1. Hierarchical Brain-Model Alignment
Whisper Layer Correspondence with Neural Activity:
- Intermediate Whisper layers (not early/late) provide strongest correspondence with ECoG responses
- Supports hierarchical match between model representations and cortical speech processing
- Layer-wise brain alignment reveals progressive abstraction in both brain and model
layer_alignment = {
'early_layers': 0.45,
'intermediate_layers': 0.72,
'late_layers': 0.58
}
2. Time-Resolved Neural Encoder Architecture
Novel Encoder Components:
class TimeResolvedNeuralEncoder(nn.Module):
"""
Interpretable neural encoder combining:
1. Speech embeddings (from Whisper)
2. Recurrent temporal model (for dynamics)
3. Soft attention (for temporal alignment)
"""
def __init__(self, whisper_dim, hidden_dim, attention_heads):
super().__init__()
self.embedding_layer = nn.Linear(whisper_dim, hidden_dim)
self.recurrent_model = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
self.attention = nn.MultiheadAttention(hidden_dim, attention_heads)
self.neural_decoder = nn.Linear(hidden_dim, num_ecog_channels)
def forward(self, whisper_embeddings, time_steps):
processed = self.embedding_layer(whisper_embeddings)
temporal_features, _ = self.recurrent_model(processed)
attended, attention_weights = self.attention(
temporal_features, temporal_features, temporal_features
)
ecog_predictions = self.neural_decoder(attended)
return ecog_predictions, attention_weights
Why This Architecture Works:
- Temporal structure: ECoG has high temporal resolution → recurrent model captures dynamics
- Soft attention: Reveals local temporal alignment between embeddings and neural responses
- Linear baseline comparison: Temporally structured modeling outperforms simple linear mappings
3. Phonemic Interpretability Analysis
Anatomically Coherent Phoneme Organization:
phoneme_categories = {
'obstruents': ['frontal_left_channels'],
'sonorants': ['temporal_right_channels'],
'vowels': ['auditory_cortex'],
'consonants': ['motor_areas']
}
def analyze_phoneme_organization(electrodes, phoneme_labels):
"""
Identify anatomically coherent phoneme-category organization
among encoding-informative electrodes
"""
encoding_scores = compute_encoding_performance(electrodes)
informative_electrodes = select_top_k(encoding_scores, k=20)
phoneme_clusters = cluster_by_phoneme_response(
informative_electrodes, phoneme_labels
)
anatomical_coherence = verify_anatomical_grouping(phoneme_clusters)
return phoneme_clusters, anatomical_coherence
Interpretability Result: Electrodes that best encode speech show organized phoneme categories aligned with anatomical regions.
Methodology Steps
Step 1: Extract Whisper Embeddings
import whisper
def extract_whisper_embeddings(audio_path, model_size='base'):
"""
Extract layer-wise Whisper embeddings for speech segments
Returns embeddings from all transformer layers
"""
model = whisper.load_model(model_size)
audio = whisper.load_audio(audio_path)
embeddings = {}
for layer_idx in range(model.dims.n_layers):
embeddings[f'layer_{layer_idx}'] = model.encoder(
audio, layer_output=layer_idx
)
return embeddings
Step 2: Record/Process ECoG Data
def process_ecog_data(raw_ecog, sampling_rate):
"""
Preprocess intracranial ECoG recordings
Key steps:
1. Bandpass filter (speech-relevant frequencies)
2. Normalize across channels
3. Segment by speech timeline
"""
filtered = bandpass_filter(raw_ecog, lowcut=1, highcut=100, fs=sampling_rate)
normalized = z_score_normalize(filtered, axis=1)
aligned_ecog = align_with_audio_segments(normalized, audio_timeline)
return aligned_ecog
Step 3: Train Time-Resolved Encoder
def train_encoder(whisper_embeddings, ecog_data, epochs=100):
"""
Train time-resolved neural encoder
Loss: Predict actual ECoG responses from Whisper embeddings
"""
encoder = TimeResolvedNeuralEncoder(
whisper_dim=512, hidden_dim=256, attention_heads=4
)
optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-4)
for epoch in range(epochs):
predictions, attention_weights = encoder(
whisper_embeddings, ecog_data.timestamps
)
loss = prediction_loss(predictions, ecog_data.signals)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return encoder, attention_weights
Step 4: Layer-wise Alignment Analysis
def analyze_layer_alignment(encoder, whisper_embeddings, ecog_data):
"""
Compare encoding performance across Whisper layers
Find intermediate layers with strongest brain alignment
"""
layer_scores = {}
for layer_name, embeddings in whisper_embeddings.items():
predictions, _ = encoder(embeddings, ecog_data.timestamps)
score = compute_correlation(predictions, ecog_data.signals)
layer_scores[layer_name] = score
best_layer = max(layer_scores, key=layer_scores.get)
return layer_scores, best_layer
Step 5: Phoneme Interpretability
def phoneme_interpretability(encoder, electrodes, phoneme_labels):
"""
Analyze phoneme-category organization in encoding electrodes
"""
electrode_scores = compute_per_electrode_encoding(encoder)
informative_electrodes = electrodes[electrode_scores > threshold]
phoneme_responses = extract_phoneme_responses(informative_electrodes)
clusters = cluster_electrodes_by_phoneme(phoneme_responses)
coherence = check_anatomical_grouping(clusters)
return clusters, coherence
Critical Insights
1. Why Intermediate Layers Align Best
Explanation:
- Early layers: Too acoustic (low-level features) → limited semantic abstraction
- Late layers: Too abstract (high-level semantics) → lose temporal precision
- Intermediate layers: Balance of acoustic + semantic + temporal structure → matches cortical processing hierarchy
cortical_hierarchy = {
'primary_auditory': {'level': 0, 'features': 'acoustic'},
'secondary_auditory': {'level': 1, 'features': 'phoneme'},
'association_areas': {'level': 2, 'features': 'semantic'},
}
whisper_hierarchy = {
'layers_0-6': {'level': 0, 'features': 'acoustic'},
'layers_7-12': {'level': 1, 'features': 'intermediate'},
'layers_13-24': {'level': 2, 'features': 'semantic'},
}
2. Temporal Structure Importance
Key Finding: High-resolution ECoG benefits from temporally structured modeling beyond linear mappings.
encoding_methods = {
'linear_mapping': {'score': 0.52},
'temporal_encoder': {'score': 0.72},
'linear_with_attention': {'score': 0.65}
}
Why: ECoG captures temporal dynamics at millisecond resolution → simple linear mappings miss temporal structure.
3. Attention Reveals Local Alignment
Attention Map Interpretation:
- Attention weights show when (temporally) embeddings align with neural responses
- Local peaks indicate specific speech moments with strongest encoding
- Attention provides interpretability into temporal alignment dynamics
Applications
1. Speech Foundation Model Evaluation
Use Case: Evaluate Whisper (and other speech models) for biological plausibility.
def evaluate_speech_model(model, ecog_dataset):
"""
Assess biological plausibility via brain alignment
"""
encoder = TimeResolvedNeuralEncoder(model.embedding_dim)
alignment_score = train_and_score(encoder, model, ecog_dataset)
return alignment_score
2. Neural Speech Decoding
Use Case: Decode neural activity to predict perceived speech.
class NeuralToSpeechDecoder(nn.Module):
def __init__(self):
self.neural_encoder = TimeResolvedNeuralEncoder(...)
self.speech_decoder = WhisperDecoder()
def decode(self, ecog_signals):
neural_features = self.neural_encoder.inverse(ecog_signals)
predicted_speech = self.speech_decoder(neural_features)
return predicted_speech
3. Phoneme-Level Brain Mapping
Use Case: Map phoneme representations to cortical regions.
phoneme_cortex_map = {
'obstruents': 'left_frontal',
'vowels': 'auditory_cortex',
'nasals': 'motor_areas'
}
Technical Pitfalls
Pitfall 1: Overfitting to Specific Whisper Layer
Problem: Training encoder on single "best" layer without exploring alternatives.
Solution:
best_layers = []
for layer in all_whisper_layers:
score = evaluate_encoder_on_layer(layer)
if score > threshold:
best_layers.append(layer)
multi_layer_encoder = combine_top_layers(best_layers[:3])
Pitfall 2: Ignoring Temporal Dynamics
Problem: Using only linear mappings without temporal modeling.
Solution: Always include recurrent temporal component for ECoG:
minimal_encoder = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.LSTM(hidden_dim, hidden_dim),
nn.Linear(hidden_dim, ecog_channels)
)
Pitfall 3: Misinterpreting Attention Weights
Problem: Treating attention peaks as "true" alignment without validation.
Solution: Validate attention against ground-truth speech timeline:
def validate_attention(attention_weights, speech_timeline):
"""
Check if attention peaks align with actual speech events
"""
peaks = extract_attention_peaks(attention_weights)
speech_events = identify_speech_events(speech_timeline)
alignment = compute_peak_event_alignment(peaks, speech_events)
assert alignment > 0.7, "Attention validation failed"
Pitfall 4: Phoneme Organization False Discovery
Problem: Finding phoneme clusters that don't reflect actual phoneme processing.
Solution: Cross-validate with behavioral phoneme tasks:
def validate_phoneme_clusters(clusters, behavioral_data):
"""
Verify phoneme organization reflects actual phoneme perception
"""
discrimination_scores = behavioral_data.phoneme_discrimination
correlation = correlate_clusters_with_behavior(clusters, discrimination_scores)
return correlation > 0.6
Validation Procedures
1. Encoding Score Thresholds
encoding_quality_thresholds = {
'poor': 0.4,
'moderate': 0.55,
'good': 0.65,
'excellent': 0.72
}
2. Temporal Precision Check
def check_temporal_precision(encoder, ecog_data):
"""
Verify encoder captures millisecond-level dynamics
"""
predictions = encoder.predict(ecog_data)
temporal_correlation = compute_temporal_structure_correlation(
predictions, ecog_data
)
return temporal_correlation > 0.8
3. Phoneme Organization Coherence
def verify_anatomical_coherence(phoneme_clusters):
"""
Ensure phoneme categories align with known cortical regions
"""
for category, electrodes in phoneme_clusters.items():
anatomical_region = get_anatomical_region(electrodes)
assert electrodes_same_region(anatomical_region)
Integration with Other Skills
Related Skills
- [[brain-digital-twins-execution-semantics]] - Brain digital twin methodology
- [[neural-encoding-evaluation-ground-truth]] - Neural encoding evaluation frameworks
- [[vlm-visual-cortex-alignment-robustness]] - Vision-language model brain alignment
- [[neuromorphic-supremacy-hybrid-astrocytic-spiking]] - Neuromorphic architectures
- [[sae-optimality-structures-dictionaries]] - Sparse autoencoder interpretability
- [[whisper-speech-recognition]] - Whisper model applications
Key References
Methodology Papers
- Whisper (Radford et al., 2022) - Speech foundation model
- ECoG speech encoding (Pasley et al., 2012) - Neural speech decoding
- Temporal neural encoding (Kell et al., 2018) - Time-resolved models
Brain-Model Alignment Papers
- Vision-brain alignment (Schrimpf et al., 2021) - VGG/ResNet vs. IT cortex
- Language-brain alignment (Caucheteux & King, 2022) - GPT vs. language cortex
- Speech-brain alignment (Vaidya et al., 2022) - SpeechNet vs. auditory cortex
Activation Keywords
Primary: whisper-ecog-alignment, speech encoding, brain alignment, temporal encoder
Secondary: speech foundation model, ecog, phoneme organization, cortical speech, soft attention, hierarchical alignment
Summary
Whisper-ECoG Alignment demonstrates that speech foundation models (Whisper) offer a useful framework for studying cortical speech representations. Key findings:
- Intermediate Whisper layers align strongest with ECoG (hierarchical match)
- Time-resolved encoder with attention outperforms linear mappings
- Attention reveals temporal local alignment dynamics
- Phoneme interpretability shows anatomically coherent organization
Impact: Speech foundation models bridge computational neuroscience and AI, enabling interpretable neural encoding research.