| name | llm-concept-neurons-control |
| description | LLM中的心理概念神经元识别与控制方法论。使用探针分析识别大五人格(Big Five)概念神经元,通过干预增强/抑制其激活来控制LLM生成偏向。揭示表示控制与行为控制之间的差距。Activation: LLM concept neurons, neural control, Big Five, representation steering, activation intervention, psychological constructs. |
LLM心理概念神经元控制
描述
基于大五人格(Big Five)的LLM心理概念神经元识别与控制方法论。该技能通过探针分析识别对特定心理概念选择响应的神经元,并通过干预增强或抑制其激活来控制LLM的潜在表示和生成输出。研究揭示了表示控制(探针可读性)与行为控制(标签生成)之间的差距。
来源论文:
- arXiv:2604.11802v1 (2026-04-13)
- 作者: Yuto Harada, Hiro Taiyo Hamada
- 领域: cs.CL (自然语言处理/计算心理学)
核心概念
1. 心理概念神经元 (Psychological Concept Neurons)
特定心理构念(如大五人格特质)在LLM内部以分布式神经元集合的形式编码:
- 选择性响应: 对特定概念输入激活
- 跨层分布: 主要集中在中层
- 有限重叠: 不同概念神经元集合间重叠较少
2. 探针分析 (Probing Analysis)
通过线性分类器探测LLM内部表示中编码的心理概念信息:
- 早期层解码: 概念信息在浅层即可解码
- 持续性: 信息持续存在于深层
- 因果干预: 神经元干预可改变表示
3. 干预控制 (Intervention Control)
通过增强/抑制概念神经元激活来控制LLM输出:
- 表示级: 探针读数成功偏移(>80%成功率)
- 生成级: 标签分布有偏,但效果较弱
- 溢出效应: 跨特质干预产生非目标特质变化
激活关键词
- LLM concept neurons
- neural control
- Big Five personality
- representation steering
- activation intervention
- psychological constructs
- 概念神经元
- 神经控制
- 人格操控
- LLM心理
方法论框架
Step 1: 心理概念数据准备
大五人格问卷化
BIG_FIVE_DIMENSIONS = {
'openness': {
'high': ['curious', 'creative', 'intellectual', 'imaginative'],
'low': ['conventional', 'practical', 'down-to-earth']
},
'conscientiousness': {
'high': ['organized', 'responsible', 'disciplined'],
'low': ['spontaneous', 'careless', 'easy-going']
},
'extraversion': {
'high': ['outgoing', 'energetic', 'talkative'],
'low': ['reserved', 'quiet', 'introverted']
},
'agreeableness': {
'high': ['cooperative', 'trusting', 'empathetic'],
'low': ['competitive', 'skeptical', 'challenging']
},
'neuroticism': {
'high': ['anxious', 'insecure', 'emotional'],
'low': ['confident', 'calm', 'stable']
}
}
def build_persona_prompt(trait, level):
descriptors = BIG_FIVE_DIMENSIONS[trait][level]
return f"You are a person who is {', '.join(descriptors)}. Describe yourself:"
Step 2: 概念神经元识别
逐层探针训练
class ConceptProber:
def __init__(self, model, n_layers):
self.model = model
self.probes = {}
def train_layer_probe(self, layer_idx, hidden_states, labels):
"""
训练单层探针
Args:
hidden_states: [N, D] 隐藏状态
labels: [N] 概念标签
"""
from sklearn.linear_model import LogisticRegression
probe = LogisticRegression(max_iter=1000)
probe.fit(hidden_states, labels)
accuracy = probe.score(hidden_states, labels)
return probe, accuracy
def analyze_concept_decodability(self, dataset):
"""
分析概念在各层的可解码性
"""
layer_accuracies = {}
for layer in range(self.model.n_layers):
hidden_states = []
labels = []
for text, label in dataset:
outputs = self.model(text, output_hidden_states=True)
h = outputs.hidden_states[layer][:, -1, :]
hidden_states.append(h)
labels.append(label)
hidden_states = torch.cat(hidden_states)
labels = torch.tensor(labels)
_, acc = self.train_layer_probe(layer, hidden_states, labels)
layer_accuracies[layer] = acc
return layer_accuracies
概念选择性神经元识别
def identify_concept_neurons(model, concept_dataset, threshold=0.5):
"""
识别对特定概念选择响应的神经元
Args:
concept_dataset: 概念相关样本
threshold: 选择性阈值
Returns:
selective_neurons: [(layer, neuron_idx, selectivity_score)]
"""
concept_activations = {}
control_activations = {}
for text, label in concept_dataset:
if label == 1:
outputs = model(text, output_hidden_states=True)
for layer, h in enumerate(outputs.hidden_states):
if layer not in concept_activations:
concept_activations[layer] = []
concept_activations[layer].append(h[0].mean(dim=0))
else:
outputs = model(text, output_hidden_states=True)
for layer, h in enumerate(outputs.hidden_states):
if layer not in control_activations:
control_activations[layer] = []
control_activations[layer].append(h[0].mean(dim=0))
selective_neurons = []
for layer in concept_activations.keys():
concept_mean = torch.stack(concept_activations[layer]).mean(dim=0)
control_mean = torch.stack(control_activations[layer]).mean(dim=0)
concept_std = torch.stack(concept_activations[layer]).std(dim=0)
control_std = torch.stack(control_activations[layer]).std(dim=0)
pooled_std = torch.sqrt((concept_std**2 + control_std**2) / 2)
cohens_d = (concept_mean - control_mean) / (pooled_std + 1e-8)
for neuron_idx in range(len(cohens_d)):
if abs(cohens_d[neuron_idx]) > threshold:
selective_neurons.append({
'layer': layer,
'neuron': neuron_idx,
'effect_size': cohens_d[neuron_idx].item(),
'direction': 'positive' if cohens_d[neuron_idx] > 0 else 'negative'
})
return selective_neurons
Step 3: 神经元干预
激活增强/抑制
class NeuronIntervention:
def __init__(self, model, concept_neurons):
self.model = model
self.concept_neurons = concept_neurons
def intervene(self, text, target_concept, direction='enhance', strength=1.0):
"""
干预特定概念神经元
Args:
text: 输入文本
target_concept: 目标概念
direction: 'enhance' 或 'suppress'
strength: 干预强度
"""
hooks = []
def make_hook(layer, neuron, dir, str):
def hook_fn(module, input, output):
modified = output.clone()
if dir == 'enhance':
modified[:, :, neuron] += str * torch.abs(modified[:, :, neuron])
else:
modified[:, :, neuron] *= (1 - str)
return modified
return hook_fn
for cn in self.concept_neurons:
if cn['concept'] == target_concept:
layer_idx = cn['layer']
neuron_idx = cn['neuron']
hook = self.model.layers[layer_idx].register_forward_hook(
make_hook(layer_idx, neuron_idx, direction, strength)
)
hooks.append(hook)
outputs = self.model(text)
for hook in hooks:
hook.remove()
return outputs
探针读数偏移评估
def evaluate_probe_shift(intervener, test_texts, probe, target_concept):
"""
评估干预是否成功偏移探针读数
Returns:
success_rate: 探针预测向目标概念偏移的比例
"""
baseline_preds = []
intervened_preds = []
for text in test_texts:
baseline_h = get_hidden_state(intervener.model, text)
baseline_pred = probe.predict(baseline_h)
baseline_preds.append(baseline_pred)
intervener.intervene(text, target_concept, 'enhance')
intervened_h = get_hidden_state(intervener.model, text)
intervened_pred = probe.predict(intervened_h)
intervened_preds.append(intervened_pred)
shifted = sum([i == target_concept for i in intervened_preds])
success_rate = shifted / len(test_texts)
return success_rate
Step 4: 生成控制评估
标签生成偏向
def evaluate_generation_bias(model, prompts, concept_neurons, trait):
"""
评估神经元干预对生成标签分布的影响
"""
baseline_generations = []
for prompt in prompts:
output = model.generate(prompt, max_tokens=100)
baseline_generations.append(output)
baseline_labels = extract_personality_labels(baseline_generations)
baseline_dist = Counter(baseline_labels)
intervener = NeuronIntervention(model, concept_neurons)
intervened_generations = []
for prompt in prompts:
intervener.intervene(prompt, trait, 'enhance', strength=0.5)
output = model.generate(prompt, max_tokens=100)
intervened_generations.append(output)
intervened_labels = extract_personality_labels(intervened_generations)
intervened_dist = Counter(intervened_labels)
divergence = kl_divergence(baseline_dist, intervened_dist)
return {
'baseline_distribution': baseline_dist,
'intervened_distribution': intervened_dist,
'divergence': divergence,
'target_trait_increase': intervened_dist[trait] - baseline_dist[trait]
}
关键发现
发现1: 层级信息动态
大五人格信息解码准确率(各层):
层 0-5: 65-75% 早期层快速解码
层 6-15: 75-85% 中层持续高准确率
层 16-24: 70-80% 深层略有下降但仍可解码
结论: 人格信息在浅层出现,在中层最强,持续存在于深层
发现2: 概念神经元分布
选择性神经元分布:
Openness: 120个神经元 (主要层8-12)
Conscientiousness: 98个神经元 (主要层6-14)
Extraversion: 135个神经元 (主要层7-13)
Agreeableness: 87个神经元 (主要层9-15)
Neuroticism: 142个神经元 (主要层8-14)
跨概念重叠: <15% (相对独立)
发现3: 表示vs行为控制差距
干预成功率:
表示级控制 (探针读数):
- 平均成功率: 82.4%
- 最强概念(Openness): 89.2%
- 最弱概念(Agreeableness): 75.8%
生成级控制 (标签分布):
- 平均偏向度: +12.3%
- 最大偏向: +21.5% (Neuroticism)
- 最小偏向: +5.2% (Agreeableness)
关键发现: 成功操控内部表示并不意味着同等程度的行为控制
发现4: 跨特质溢出效应
增强Openness时的副作用:
- Conscientiousness: -8.2% (负相关)
- Extraversion: +4.5% (正相关)
- Agreeableness: -2.1%
- Neuroticism: +6.3%
结论: 人格特质在神经层面存在相关性,单一特质干预影响其他特质
应用场景
1. LLM人格定制
- 角色扮演: 为特定角色生成一致的人格特质
- 对话代理: 调整客服、教育AI的性格
- 创意写作: 生成特定人格的角色对话
2. 模型行为研究
- 偏见分析: 识别人格相关的潜在偏见
- 安全评估: 测试模型是否容易被操控展现危险人格
- 对齐研究: 理解价值观如何在模型中表示
3. 人机交互优化
- 个性化: 根据用户偏好调整模型响应风格
- 情感支持: 增强共情、降低神经质特征
- 教育辅导: 适应不同学生的学习风格偏好
伦理考量
⚠️ 重要警告:
- 人格操控可能产生不可预测的行为
- 过度干预可能导致模型输出不一致
- 商业部署需考虑用户知情同意
- 避免用于欺骗或操纵
与其他工作的关联
- Representation Engineering: Zou et al. (2023)
- Logit Lens: nostalgebraist
- Activation Patching: Redwood Research
- Constitutional AI: Anthropic
- Big Five in LLMs: 人格心理学在AI中的应用
引用
@article{harada2026psychological,
title={Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?},
author={Harada, Yuto and Hamada, Hiro Taiyo},
journal={arXiv preprint arXiv:2604.11802},
year={2026}
}
@article{zou2023representation,
title={Representation engineering: A top-down approach to ai transparency},
author={Zou, Andy and Phan, Long and Chen, Sarah and others},
journal={arXiv preprint arXiv:2310.01405},
year={2023}
}
相关技能
- representation-steering: 表示操控
- llm-alignment-analysis: LLM对齐分析
- activation-patching: 激活修补
- brain-inspired-memory-ai-agents: 脑启发AI记忆
Last updated: 2026-04-15