Run any Skill in Manus with one click

representation-steering

LLM representation steering and activation patching methodology for mechanistic interpretability. Use when analyzing how steering vectors affect LLM internals, conducting activation patching experiments, or investigating causal mechanisms in neural networks. Keywords: representation steering, activation patching, mechanistic interpretability, steering vectors, OV circuit, QK circuit, refusal steering.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/hiyenwong/ai_collection --skill representation-steering

Copy and paste this command into Claude Code to install the skill

Source

hiyenwong/ai_collection

Stars1

Forks0

UpdatedJune 4, 2026 at 02:00

SKILL.md

readonly

name

representation-steering

description

Representation Steering Skill

Description

Framework for analyzing and applying steering vectors to LLMs, based on mechanistic interpretability methods from recent research.

Activation Keywords

representation steering
activation patching
mechanistic interpretability
steering vectors
OV circuit analysis
QK circuit analysis
refusal steering
机制可解释性
表示转向
激活修补

Tools Used

exec: Run Python scripts for activation analysis
read: Load model configurations and weights
write: Save analysis results and patching configurations

Key Concepts

Steering Vectors

Vectors applied to model activations to control behavior without modifying weights. Effective for alignment tasks like refusal.

Activation Patching

Method to trace causal mechanisms by replacing activations at specific layers/positions.

Circuit Analysis

OV circuit: Output-Value pathway, where steering primarily operates
QK circuit: Query-Key pathway, largely ignored by steering

Workflow

Step 1: Identify Target Layer

Locate layer where steering has maximal effect:

# Typical effective layers: middle layers (e.g., layer 10-15 for 32-layer models)
target_layer = find_critical_layer(model, behavior_type)

Step 2: Extract Steering Vector

Compute difference between positive/negative examples:

steering_vector = positive_activation - negative_activation

Step 3: Apply Multi-Token Patching

Apply steering across multiple tokens, not just first position:

for token_pos in target_positions:
    patched_activation = activation[token_pos] + alpha * steering_vector

Step 4: Analyze OV Circuit

Decompose attention contributions:

ov_contribution = analyze_ov_circuit(layer_activations)

Step 5: Sparsification

Reduce dimensions while preserving performance:

sparse_vector = sparsify(steering_vector, keep_ratio=0.01)  # 99% sparse

Findings from Research

Different steering methods use interchangeable circuits at same layer
OV circuit is primary pathway (QK frozen → only 8.75% performance drop)
Steering vectors can be sparsified 90-99% without major performance loss
Semantically interpretable concepts emerge in OV decomposition

Error Handling

Steering Not Effective

Check target layer (may need adjustment)
Increase alpha (steering magnitude)
Verify multi-token patching is applied

Model Instability

Reduce alpha magnitude
Apply to fewer layers
Use sparse steering vector

Resources

Reference paper: arxiv:2604.08524
Key finding: Freezing all attention scores drops performance by only 8.75%

More from this repository

same repository

attachment-representations-interbrain-synchrony

hiyenwong/ai_collection

Attachment representations in early childhood as independent endogenous driver of interbrain synchrony during remote cooperation. Novel Remote Partner-Belief Manipulation paradigm isolates attachment representations by manipulating partner-belief. EEG synchrony concentrated at P4 channel (right TPJ). Activation: attachment, interbrain synchrony, EEG hyperscanning, child-adult interaction, attachment representations, social neuroscience, partner-belief manipulation, early childhood, mother-child interaction, brain synchronization, attachment security, social-emotional development.

2026-06-041

sleep-replay-acceleration-sharp

hiyenwong/ai_collection

SHARP (Sleep-based Hierarchical Accelerated Replay) 方法论 — 睡眠启发的分层加速回放框架用于长程非平稳时序模式识别。受啮齿动物慢波睡眠中加速回放启发，通过分离记忆模块和模式识别模块实现无反向传播的长程信用分配。适用于流式时序学习、长程依赖建模、神经科学启发的 AI 架构。触发词：睡眠回放、加速回放、SHARP、时序学习、长程依赖、流式学习、慢波睡眠、hierarchical replay

2026-06-041

piston-control-two-ion-quantum

hiyenwong/ai_collection

Inverse-engineering methodology for piston operations in trapped-ion quantum devices. One ion serves as classical piston driven by Coulomb interaction with quantum-controlled ion. Stationary state determined self-consistently. Inverse-engineering protocols enable precise control of classical ion motion. Provides route toward controlled piston dynamics in microscopic quantum devices.

2026-06-041

quantum-fault-trees-minimal-cut

hiyenwong/ai_collection

Quantum fault tree analysis methodology using quantum computing. Extends classical reliability engineering fault trees to quantum domain. Identifies minimal cut sets in system reliability analysis using quantum algorithms. Applicable to safety-critical systems, cyber-physical systems, and quantum system reliability engineering.

2026-06-041

adaptive-hybrid-feature-fusion-medical

hiyenwong/ai_collection

Adaptive Hybrid Quantum-Classical Feature Fusion methodology for medical image classification. Addresses optimization asymmetries between quantum and classical paradigms using Temperature-Scaled Hybrid Fusion (TSHF), Dynamic Hybrid Fusion (DHF), and Static Hybrid Fusion (SHF) strategies. Use when designing hybrid quantum-classical ML pipelines for healthcare/medical imaging, especially when combining ResNet backbones with variational quantum circuits for diagnostic tasks.

2026-06-041

adaptive-spiking-neuron-asn

hiyenwong/ai_collection

Adaptive Spiking Neuron (ASN) methodology for vision and language modeling. Implements trainable membrane potential dynamics with adaptive firing mechanisms for efficient Spiking Neural Networks (SNNs). Activation: adaptive spiking neuron, ASN, spiking neural network vision language, SNN adaptive neuron, neuromorphic vision language model.

2026-06-041

Source

hiyenwong

hiyenwong/ai_collection

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

name

representation-steering

description

Representation Steering Skill

Description

Framework for analyzing and applying steering vectors to LLMs, based on mechanistic interpretability methods from recent research.

Activation Keywords

representation steering
activation patching
mechanistic interpretability
steering vectors
OV circuit analysis
QK circuit analysis
refusal steering
机制可解释性
表示转向
激活修补

Tools Used

exec: Run Python scripts for activation analysis
read: Load model configurations and weights
write: Save analysis results and patching configurations

Key Concepts

Steering Vectors

Vectors applied to model activations to control behavior without modifying weights. Effective for alignment tasks like refusal.

Activation Patching

Method to trace causal mechanisms by replacing activations at specific layers/positions.

Circuit Analysis

OV circuit: Output-Value pathway, where steering primarily operates
QK circuit: Query-Key pathway, largely ignored by steering

Workflow

Step 1: Identify Target Layer

Locate layer where steering has maximal effect:

# Typical effective layers: middle layers (e.g., layer 10-15 for 32-layer models)
target_layer = find_critical_layer(model, behavior_type)

Step 2: Extract Steering Vector

Compute difference between positive/negative examples:

steering_vector = positive_activation - negative_activation

Step 3: Apply Multi-Token Patching

Apply steering across multiple tokens, not just first position:

for token_pos in target_positions:
    patched_activation = activation[token_pos] + alpha * steering_vector

Step 4: Analyze OV Circuit

Decompose attention contributions:

ov_contribution = analyze_ov_circuit(layer_activations)

Step 5: Sparsification

Reduce dimensions while preserving performance:

sparse_vector = sparsify(steering_vector, keep_ratio=0.01)  # 99% sparse

Findings from Research

Different steering methods use interchangeable circuits at same layer
OV circuit is primary pathway (QK frozen → only 8.75% performance drop)
Steering vectors can be sparsified 90-99% without major performance loss
Semantically interpretable concepts emerge in OV decomposition

Error Handling

Steering Not Effective

Check target layer (may need adjustment)
Increase alpha (steering magnitude)
Verify multi-token patching is applied

Model Instability

Reduce alpha magnitude
Apply to fewer layers
Use sparse steering vector

Resources

Reference paper: arxiv:2604.08524
Key finding: Freezing all attention scores drops performance by only 8.75%