| name | brain-dit-fmri-foundation-model |
| description | Brain-DiT universal multi-state fMRI foundation model with metadata-conditioned diffusion pretraining. Trigger words: Brain-DiT, fMRI foundation model, diffusion transformer, multi-state, metadata-conditioned |
Brain-DiT: Universal Multi-State fMRI Foundation Model
Large-scale fMRI foundation model pretrained on 349,898 sessions from 24 datasets using metadata-conditioned diffusion with Diffusion Transformer (DiT) architecture.
Metadata
- Source: arXiv:2604.12683v1
- Authors: Brain foundation model researchers (2026)
- Published: 2026-04-14
- Domain: fMRI Analysis, Foundation Models, Diffusion Models, Brain Imaging
Core Methodology
Key Innovation
Current fMRI foundation models rely on limited brain state ranges and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. Brain-DiT addresses this by:
- Massive Multi-Dataset Training: 349,898 sessions spanning diverse states
- Metadata-Conditioned Diffusion: Conditioning on acquisition metadata for state-aware generation
- Diffusion Transformer Architecture: DiT for learning multi-scale representations
Training Data Coverage
- Resting State: 145,230 sessions
- Task-Based: 127,450 sessions (motor, cognitive, emotion)
- Naturalistic: 54,890 sessions (movie watching, narrative)
- Clinical: 18,120 sessions (disease states)
- Sleep: 4,208 sessions
Technical Framework
Diffusion Pretraining
Unlike prior fMRI models using masked reconstruction in raw or latent space, Brain-DiT uses:
- Diffusion Process: Progressive denoising of brain activity patterns
- Conditional Generation: Metadata (TR, task type, scanner) as conditioning
- Multi-Scale Learning: Captures both fine-grained functional structure and global semantics
DiT Architecture Adaptations
- 3D Spatial Attention: Process volumetric fMRI data
- Temporal Modeling: Capture temporal dynamics within and across TRs
- Metadata Embedding: Encode acquisition parameters as conditioning vectors
Implementation Guide
Prerequisites
- PyTorch 2.0+
- MONAI for medical imaging
- Diffusers library for diffusion models
- Access to large-scale fMRI datasets
Step-by-Step
1. Data Preprocessing
import nibabel as nib
import numpy as np
from nilearn import image, signal
def preprocess_fmri(fmri_path, mask_path, tr, standardize=True):
"""
Standard fMRI preprocessing for Brain-DiT
Args:
fmri_path: Path to 4D fMRI NIfTI file
mask_path: Brain mask
tr: Repetition time
standardize: Whether to z-score normalize
Returns:
preprocessed: (T, H, W, D) preprocessed fMRI data
metadata: Dict with acquisition parameters
"""
img = nib.load(fmri_path)
data = img.get_fdata()
mask = nib.load(mask_path).get_fdata().astype(bool)
data_clean = signal.clean(
data[mask].T,
detrend=True,
standardize=standardize,
low_pass=0.1,
high_pass=0.01,
t_r=tr
).T
preprocessed = np.zeros_like(data)
preprocessed[mask] = data_clean
metadata = {
'tr': tr,
'n_volumes': data.shape[-1],
'shape': data.shape[:3]
}
return preprocessed, metadata
2. Metadata Embedding
import torch
import torch.nn as nn
class MetadataEmbedder(nn.Module):
"""Embed acquisition metadata into conditioning vectors"""
def __init__(self, metadata_dims, embed_dim=512):
super().__init__()
self.scanner_embed = nn.Embedding(50, 128)
self.task_embed = nn.Embedding(100, 256)
self.state_embed = nn.Embedding(20, 128)
self.tr_projection = nn.Linear(1, 64)
self.age_projection = nn.Linear(1, 64)
self.combiner = nn.Sequential(
nn.Linear(128+256+128+64+64, embed_dim),
nn.SiLU(),
nn.Linear(embed_dim, embed_dim)
)
def forward(self, scanner_id, task_id, state_id, tr, age):
"""
Args:
scanner_id: Scanner type indices (B,)
task_id: Task type indices (B,)
state_id: Brain state indices (B,)
tr: Repetition time in seconds (B, 1)
age: Subject age (B, 1)
Returns:
conditioning: (B, embed_dim)
"""
scanner_emb = self.scanner_embed(scanner_id)
task_emb = self.task_embed(task_id)
state_emb = self.state_embed(state_id)
tr_emb = self.tr_projection(tr)
age_emb = self.age_projection(age)
combined = torch.cat([scanner_emb, task_emb, state_emb, tr_emb, age_emb], dim=-1)
return self.combiner(combined)
3. Brain-DiT Architecture
class BrainDiTBlock(nn.Module):
"""DiT block adapted for 3D fMRI data"""
def __init__(self, dim, num_heads=8, mlp_ratio=4):
super().__init__()
self.norm1 = nn.LayerNorm(dim)
self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.norm2 = nn.LayerNorm(dim)
mlp_hidden = int(dim * mlp_ratio)
self.mlp = nn.Sequential(
nn.Linear(dim, mlp_hidden),
nn.GELU(),
nn.Linear(mlp_hidden, dim)
)
self.adaLN_modulation = nn.Sequential(
nn.SiLU(),
nn.Linear(dim, 6 * dim)
)
def forward(self, x, c):
"""
Args:
x: (B, N, D) flattened 3D fMRI patches
c: (B, D) conditioning vector from metadata
Returns:
output: (B, N, D)
"""
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = \
self.adaLN_modulation(c).chunk(6, dim=-1)
x_norm = self.norm1(x)
x_norm = x_norm * (1 + scale_msa.unsqueeze(1)) + shift_msa.unsqueeze(1)
attn_out, _ = self.attn(x_norm, x_norm, x_norm)
x = x + gate_msa.unsqueeze(1) * attn_out
x_norm = self.norm2(x)
x_norm = x_norm * (1 + scale_mlp.unsqueeze(1)) + shift_mlp.unsqueeze(1)
x = x + gate_mlp.unsqueeze(1) * self.mlp(x_norm)
return x
class BrainDiT(nn.Module):
"""Brain-DiT: Diffusion Transformer for fMRI"""
def __init__(self, patch_size=8, embed_dim=768, depth=24, num_heads=12):
super().__init__()
self.patch_size = patch_size
self.patch_embed = nn.Conv3d(1, embed_dim, kernel_size=patch_size, stride=patch_size)
self.time_embed = nn.Sequential(
nn.Linear(1, 256),
nn.SiLU(),
nn.Linear(256, embed_dim)
)
self.metadata_embedder = MetadataEmbedder(None, embed_dim)
self.blocks = nn.ModuleList([
BrainDiTBlock(embed_dim, num_heads) for _ in range(depth)
])
self.final = nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, patch_size**3)
)
def forward(self, x_noisy, t, metadata):
"""
Args:
x_noisy: Noisy fMRI (B, 1, H, W, D)
t: Diffusion timestep (B,)
metadata: Dict with acquisition metadata
Returns:
predicted_noise: (B, 1, H, W, D)
"""
x = self.patch_embed(x_noisy)
B, D, H, W, Dd = x.shape
x = x.flatten(2).transpose(1, 2)
t_emb = self.time_embed(t.view(-1, 1).float())
m_emb = self.metadata_embedder(**metadata)
conditioning = t_emb + m_emb
for block in self.blocks:
x = block(x, conditioning)
x = self.final(x)
x = x.transpose(1, 2).view(B, 1, -1)
H_out = H * self.patch_size
W_out = W * self.patch_size
D_out = Dd * self.patch_size
x = x.view(B, 1, H_out, W_out, D_out)
return x
4. Diffusion Training Loop
def train_step(model, optimizer, fmri_batch, metadata_batch):
"""Single training step for Brain-DiT"""
t = torch.randint(0, num_timesteps, (fmri_batch.size(0),))
noise = torch.randn_like(fmri_batch)
alpha_t = diffusion_schedule(t)
noisy_fmri = torch.sqrt(alpha_t) * fmri_batch + torch.sqrt(1 - alpha_t) * noise
predicted_noise = model(noisy_fmri, t, metadata_batch)
loss = F.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
Applications
- Cross-State Transfer Learning: Pretrain on diverse states, fine-tune on specific tasks
- Clinical Screening: Detect disease states using learned representations
- Data Imputation: Fill missing or corrupted fMRI data
- Synthesis: Generate realistic fMRI for data augmentation
- Interpretability: Analyze what the model learns about brain organization
Evaluation Results
- Resting State Prediction: Outperforms previous models by 15%
- Task Classification: State-of-the-art on 7 downstream tasks
- Transfer Learning: Effective zero-shot and few-shot transfer
- Ablation Studies: Metadata conditioning crucial for performance
Pitfalls
- Data Requirements: Needs large-scale multi-site datasets for pretraining
- Computational Cost: Training requires significant GPU resources
- Scanner Effects: Despite metadata conditioning, scanner bias may persist
- Interpretability: Diffusion models are less interpretable than autoregressive models
Related Skills
- brain-foundation-model-batch-effects
- fmri-connectivity-analysis
- brain-digital-twins-execution-semantics
- calcium-foundation-model
References
@article{braindit2026,
title={Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Diffusion Pretraining},
journal={arXiv preprint arXiv:2604.12683},
year={2026}
}