| name | pyvene-interventions |
| description | Provides guidance for performing causal interventions on PyTorch models using pyvene's declarative intervention framework. Use when conducting causal tracing, activation patching, interchange intervention training, or testing causal hypotheses about model behavior. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| tags | ["Causal Intervention","pyvene","Activation Patching","Causal Tracing","Interpretability"] |
| dependencies | ["pyvene>=0.1.8","torch>=2.0.0","transformers>=4.30.0"] |
pyvene: Causal Interventions for Neural Networks
pyvene is Stanford NLP's library for performing causal interventions on PyTorch models. It provides a declarative, dict-based framework for activation patching, causal tracing, and interchange intervention training - making intervention experiments reproducible and shareable.
GitHub: stanfordnlp/pyvene (840+ stars)
Paper: pyvene: A Library for Understanding and Improving PyTorch Models via Interventions (NAACL 2024)
When to Use pyvene
Use pyvene when you need to:
- Perform causal tracing (ROME-style localization)
- Run activation patching experiments
- Conduct interchange intervention training (IIT)
- Test causal hypotheses about model components
- Share/reproduce intervention experiments via HuggingFace
- Work with any PyTorch architecture (not just transformers)
Consider alternatives when:
- You need exploratory activation analysis → Use TransformerLens
- You want to train/analyze SAEs → Use SAELens
- You need remote execution on massive models → Use nnsight
- You want lower-level control → Use nnsight
Installation
pip install pyvene
Standard import:
import pyvene as pv
Core Concepts
IntervenableModel
The main class that wraps any PyTorch model with intervention capabilities:
import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=8,
component="block_output",
intervention_type=pv.VanillaIntervention,
)
]
)
intervenable = pv.IntervenableModel(config, model)
Intervention Types
| Type | Description | Use Case |
|---|
VanillaIntervention | Swap activations between runs | Activation patching |
AdditionIntervention | Add activations to base run | Steering, ablation |
SubtractionIntervention | Subtract activations | Ablation |
ZeroIntervention | Zero out activations | Component knockout |
RotatedSpaceIntervention | DAS trainable intervention | Causal discovery |
CollectIntervention | Collect activations | Probing, analysis |
Component Targets
components = [
"block_input",
"block_output",
"mlp_input",
"mlp_output",
"mlp_activation",
"attention_input",
"attention_output",
"attention_value_output",
"query_output",
"key_output",
"value_output",
"head_attention_value_output",
]
Workflow 1: Causal Tracing (ROME-style)
Locate where factual associations are stored by corrupting inputs and restoring activations.
Step-by-Step
import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")
clean_prompt = "The Space Needle is in downtown"
corrupted_prompt = "The ##### ###### ## ## ########"
clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")
with torch.no_grad():
clean_outputs = model(**clean_tokens, output_hidden_states=True)
clean_states = clean_outputs.hidden_states
def run_causal_trace(layer, position):
"""Restore clean activation at specific layer and position."""
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=layer,
component="block_output",
intervention_type=pv.VanillaIntervention,
unit="pos",
max_number_of_units=1,
)
]
)
intervenable = pv.IntervenableModel(config, model)
_, patched_outputs = intervenable(
base=corrupted_tokens,
sources=[clean_tokens],
unit_locations={"sources->base": ([[[position]]], [[[position]]])},
output_original_output=True,
)
probs = torch.softmax(patched_outputs.logits[0, -1], dim=-1)
seattle_token = tokenizer.encode(" Seattle")[0]
return probs[seattle_token].item()
n_layers = model.config.n_layer
seq_len = clean_tokens["input_ids"].shape[1]
results = torch.zeros(n_layers, seq_len)
for layer in range(n_layers):
for pos in range(seq_len):
results[layer, pos] = run_causal_trace(layer, pos)
Checklist
Workflow 2: Activation Patching for Circuit Analysis
Test which components are necessary for a specific behavior.
Step-by-Step
import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
clean_prompt = "When John and Mary went to the store, Mary gave a bottle to"
corrupted_prompt = "When John and Mary went to the store, John gave a bottle to"
clean_tokens = tokenizer(clean_prompt, return_tensors="pt")
corrupted_tokens = tokenizer(corrupted_prompt, return_tensors="pt")
john_token = tokenizer.encode(" John")[0]
mary_token = tokenizer.encode(" Mary")[0]
def logit_diff(logits):
"""IO - S logit difference."""
return logits[0, -1, john_token] - logits[0, -1, mary_token]
def patch_attention(layer):
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=layer,
component="attention_output",
intervention_type=pv.VanillaIntervention,
)
]
)
intervenable = pv.IntervenableModel(config, model)
_, patched_outputs = intervenable(
base=corrupted_tokens,
sources=[clean_tokens],
)
return logit_diff(patched_outputs.logits).item()
results = []
for layer in range(model.config.n_layer):
diff = patch_attention(layer)
results.append(diff)
print(f"Layer {layer}: logit diff = {diff:.3f}")
Workflow 3: Interchange Intervention Training (IIT)
Train interventions to discover causal structure.
Step-by-Step
import pyvene as pv
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("gpt2")
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=6,
component="block_output",
intervention_type=pv.RotatedSpaceIntervention,
low_rank_dimension=64,
)
]
)
intervenable = pv.IntervenableModel(config, model)
optimizer = torch.optim.Adam(
intervenable.get_trainable_parameters(),
lr=1e-4
)
for base_input, source_input, target_output in dataloader:
optimizer.zero_grad()
_, outputs = intervenable(
base=base_input,
sources=[source_input],
)
loss = criterion(outputs.logits, target_output)
loss.backward()
optimizer.step()
rotation = intervenable.interventions["layer.6.block_output"][0].rotate_layer
DAS (Distributed Alignment Search)
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=8,
component="block_output",
intervention_type=pv.LowRankRotatedSpaceIntervention,
low_rank_dimension=1,
)
]
)
Workflow 4: Model Steering (Honest LLaMA)
Steer model behavior during generation.
import pyvene as pv
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
intervenable = pv.IntervenableModel.load(
"zhengxuanzenwu/intervenable_honest_llama2_chat_7B",
model=model,
)
prompt = "Is the earth flat?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = intervenable.generate(
inputs,
max_new_tokens=100,
do_sample=False,
)
print(tokenizer.decode(outputs[0]))
Saving and Sharing Interventions
intervenable.save("./my_intervention")
intervenable = pv.IntervenableModel.load(
"./my_intervention",
model=model,
)
intervenable.save_intervention("username/my-intervention")
intervenable = pv.IntervenableModel.load(
"username/my-intervention",
model=model,
)
Common Issues & Solutions
Issue: Wrong intervention location
config = pv.RepresentationConfig(
component="mlp",
)
config = pv.RepresentationConfig(
component="mlp_output",
)
Issue: Dimension mismatch
config = pv.RepresentationConfig(
unit="pos",
max_number_of_units=1,
)
intervenable(
base=base_tokens,
sources=[source_tokens],
unit_locations={"sources->base": ([[[5]]], [[[5]]])},
)
Issue: Memory with large models
model.gradient_checkpointing_enable()
config = pv.IntervenableConfig(
representations=[
pv.RepresentationConfig(
layer=8,
component="block_output",
)
]
)
Issue: LoRA integration
config = pv.RepresentationConfig(
intervention_type=pv.LoRAIntervention,
low_rank_dimension=16,
)
Key Classes Reference
| Class | Purpose |
|---|
IntervenableModel | Main wrapper for interventions |
IntervenableConfig | Configuration container |
RepresentationConfig | Single intervention specification |
VanillaIntervention | Activation swapping |
RotatedSpaceIntervention | Trainable DAS intervention |
CollectIntervention | Activation collection |
Supported Models
pyvene works with any PyTorch model. Tested on:
- GPT-2 (all sizes)
- LLaMA / LLaMA-2
- Pythia
- Mistral / Mixtral
- OPT
- BLIP (vision-language)
- ESM (protein models)
- Mamba (state space)
Reference Documentation
For detailed API documentation, tutorials, and advanced usage, see the references/ folder:
External Resources
Tutorials
Papers
Official Documentation
Comparison with Other Tools
| Feature | pyvene | TransformerLens | nnsight |
|---|
| Declarative config | Yes | No | No |
| HuggingFace sharing | Yes | No | No |
| Trainable interventions | Yes | Limited | Yes |
| Any PyTorch model | Yes | Transformers only | Yes |
| Remote execution | No | No | Yes (NDIF) |