Run any Skill in Manus with one click

llm-concept-neurons-control

LLM中的心理概念神经元识别与控制方法论。使用探针分析识别大五人格(Big Five)概念神经元，通过干预增强/抑制其激活来控制LLM生成偏向。揭示表示控制与行为控制之间的差距。Activation: LLM concept neurons, neural control, Big Five, representation steering, activation intervention, psychological constructs.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/hiyenwong/ai_collection --skill llm-concept-neurons-control

Copy and paste this command into Claude Code to install the skill

Source

hiyenwong/ai_collection

Stars1

Forks0

UpdatedJune 4, 2026 at 02:00

SKILL.md

readonly

More from this repository

same repository

attachment-representations-interbrain-synchrony

hiyenwong/ai_collection

Attachment representations in early childhood as independent endogenous driver of interbrain synchrony during remote cooperation. Novel Remote Partner-Belief Manipulation paradigm isolates attachment representations by manipulating partner-belief. EEG synchrony concentrated at P4 channel (right TPJ). Activation: attachment, interbrain synchrony, EEG hyperscanning, child-adult interaction, attachment representations, social neuroscience, partner-belief manipulation, early childhood, mother-child interaction, brain synchronization, attachment security, social-emotional development.

2026-06-041

sleep-replay-acceleration-sharp

hiyenwong/ai_collection

SHARP (Sleep-based Hierarchical Accelerated Replay) 方法论 — 睡眠启发的分层加速回放框架用于长程非平稳时序模式识别。受啮齿动物慢波睡眠中加速回放启发，通过分离记忆模块和模式识别模块实现无反向传播的长程信用分配。适用于流式时序学习、长程依赖建模、神经科学启发的 AI 架构。触发词：睡眠回放、加速回放、SHARP、时序学习、长程依赖、流式学习、慢波睡眠、hierarchical replay

2026-06-041

piston-control-two-ion-quantum

hiyenwong/ai_collection

Inverse-engineering methodology for piston operations in trapped-ion quantum devices. One ion serves as classical piston driven by Coulomb interaction with quantum-controlled ion. Stationary state determined self-consistently. Inverse-engineering protocols enable precise control of classical ion motion. Provides route toward controlled piston dynamics in microscopic quantum devices.

2026-06-041

quantum-fault-trees-minimal-cut

hiyenwong/ai_collection

Quantum fault tree analysis methodology using quantum computing. Extends classical reliability engineering fault trees to quantum domain. Identifies minimal cut sets in system reliability analysis using quantum algorithms. Applicable to safety-critical systems, cyber-physical systems, and quantum system reliability engineering.

2026-06-041

adaptive-hybrid-feature-fusion-medical

hiyenwong/ai_collection

Adaptive Hybrid Quantum-Classical Feature Fusion methodology for medical image classification. Addresses optimization asymmetries between quantum and classical paradigms using Temperature-Scaled Hybrid Fusion (TSHF), Dynamic Hybrid Fusion (DHF), and Static Hybrid Fusion (SHF) strategies. Use when designing hybrid quantum-classical ML pipelines for healthcare/medical imaging, especially when combining ResNet backbones with variational quantum circuits for diagnostic tasks.

2026-06-041

adaptive-spiking-neuron-asn

hiyenwong/ai_collection

Adaptive Spiking Neuron (ASN) methodology for vision and language modeling. Implements trainable membrane potential dynamics with adaptive firing mechanisms for efficient Spiking Neural Networks (SNNs). Activation: adaptive spiking neuron, ASN, spiking neural network vision language, SNN adaptive neuron, neuromorphic vision language model.

2026-06-041

Source

hiyenwong

hiyenwong/ai_collection

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Psychologists, All OtherLife, Physical, and Social Science Occupations19-3039L4

name	llm-concept-neurons-control
description	LLM中的心理概念神经元识别与控制方法论。使用探针分析识别大五人格(Big Five)概念神经元，通过干预增强/抑制其激活来控制LLM生成偏向。揭示表示控制与行为控制之间的差距。Activation: LLM concept neurons, neural control, Big Five, representation steering, activation intervention, psychological constructs.

LLM心理概念神经元控制

描述

基于大五人格(Big Five)的LLM心理概念神经元识别与控制方法论。该技能通过探针分析识别对特定心理概念选择响应的神经元，并通过干预增强或抑制其激活来控制LLM的潜在表示和生成输出。研究揭示了表示控制(探针可读性)与行为控制(标签生成)之间的差距。

来源论文:

arXiv:2604.11802v1 (2026-04-13)
作者: Yuto Harada, Hiro Taiyo Hamada
领域: cs.CL (自然语言处理/计算心理学)

核心概念

1. 心理概念神经元 (Psychological Concept Neurons)

特定心理构念(如大五人格特质)在LLM内部以分布式神经元集合的形式编码：

选择性响应: 对特定概念输入激活
跨层分布: 主要集中在中层
有限重叠: 不同概念神经元集合间重叠较少

2. 探针分析 (Probing Analysis)

通过线性分类器探测LLM内部表示中编码的心理概念信息：

早期层解码: 概念信息在浅层即可解码
持续性: 信息持续存在于深层
因果干预: 神经元干预可改变表示

3. 干预控制 (Intervention Control)

通过增强/抑制概念神经元激活来控制LLM输出：

表示级: 探针读数成功偏移(>80%成功率)
生成级: 标签分布有偏，但效果较弱
溢出效应: 跨特质干预产生非目标特质变化

激活关键词

LLM concept neurons
neural control
Big Five personality
representation steering
activation intervention
psychological constructs
概念神经元
神经控制
人格操控
LLM心理

方法论框架

Step 1: 心理概念数据准备

大五人格问卷化

BIG_FIVE_DIMENSIONS = {
    'openness': {
        'high': ['curious', 'creative', 'intellectual', 'imaginative'],
        'low': ['conventional', 'practical', 'down-to-earth']
    },
    'conscientiousness': {
        'high': ['organized', 'responsible', 'disciplined'],
        'low': ['spontaneous', 'careless', 'easy-going']
    },
    'extraversion': {
        'high': ['outgoing', 'energetic', 'talkative'],
        'low': ['reserved', 'quiet', 'introverted']
    },
    'agreeableness': {
        'high': ['cooperative', 'trusting', 'empathetic'],
        'low': ['competitive', 'skeptical', 'challenging']
    },
    'neuroticism': {
        'high': ['anxious', 'insecure', 'emotional'],
        'low': ['confident', 'calm', 'stable']
    }
}

# 构建提示
def build_persona_prompt(trait, level):
    descriptors = BIG_FIVE_DIMENSIONS[trait][level]
    return f"You are a person who is {', '.join(descriptors)}. Describe yourself:"

Step 2: 概念神经元识别

逐层探针训练

class ConceptProber:
    def __init__(self, model, n_layers):
        self.model = model
        self.probes = {}
        
    def train_layer_probe(self, layer_idx, hidden_states, labels):
        """
        训练单层探针
        Args:
            hidden_states: [N, D] 隐藏状态
            labels: [N] 概念标签
        """
        from sklearn.linear_model import LogisticRegression
        
        probe = LogisticRegression(max_iter=1000)
        probe.fit(hidden_states, labels)
        
        # 评估
        accuracy = probe.score(hidden_states, labels)
        return probe, accuracy
    
    def analyze_concept_decodability(self, dataset):
        """
        分析概念在各层的可解码性
        """
        layer_accuracies = {}
        
        for layer in range(self.model.n_layers):
            # 提取该层隐藏状态
            hidden_states = []
            labels = []
            
            for text, label in dataset:
                outputs = self.model(text, output_hidden_states=True)
                h = outputs.hidden_states[layer][:, -1, :]  # 最后token
                hidden_states.append(h)
                labels.append(label)
            
            hidden_states = torch.cat(hidden_states)
            labels = torch.tensor(labels)
            
            # 训练探针
            _, acc = self.train_layer_probe(layer, hidden_states, labels)
            layer_accuracies[layer] = acc
        
        return layer_accuracies

概念选择性神经元识别

def identify_concept_neurons(model, concept_dataset, threshold=0.5):
    """
    识别对特定概念选择响应的神经元
    
    Args:
        concept_dataset: 概念相关样本
        threshold: 选择性阈值
    
    Returns:
        selective_neurons: [(layer, neuron_idx, selectivity_score)]
    """
    concept_activations = {}
    control_activations = {}
    
    # 收集概念样本激活
    for text, label in concept_dataset:
        if label == 1:  # 概念正例
            outputs = model(text, output_hidden_states=True)
            for layer, h in enumerate(outputs.hidden_states):
                if layer not in concept_activations:
                    concept_activations[layer] = []
                concept_activations[layer].append(h[0].mean(dim=0))
        else:  # 对照组
            outputs = model(text, output_hidden_states=True)
            for layer, h in enumerate(outputs.hidden_states):
                if layer not in control_activations:
                    control_activations[layer] = []
                control_activations[layer].append(h[0].mean(dim=0))
    
    # 计算每个神经元的t统计量
    selective_neurons = []
    
    for layer in concept_activations.keys():
        concept_mean = torch.stack(concept_activations[layer]).mean(dim=0)
        control_mean = torch.stack(control_activations[layer]).mean(dim=0)
        
        concept_std = torch.stack(concept_activations[layer]).std(dim=0)
        control_std = torch.stack(control_activations[layer]).std(dim=0)
        
        # Cohen's d效应量
        pooled_std = torch.sqrt((concept_std**2 + control_std**2) / 2)
        cohens_d = (concept_mean - control_mean) / (pooled_std + 1e-8)
        
        # 选择选择性强的神经元
        for neuron_idx in range(len(cohens_d)):
            if abs(cohens_d[neuron_idx]) > threshold:
                selective_neurons.append({
                    'layer': layer,
                    'neuron': neuron_idx,
                    'effect_size': cohens_d[neuron_idx].item(),
                    'direction': 'positive' if cohens_d[neuron_idx] > 0 else 'negative'
                })
    
    return selective_neurons

Step 3: 神经元干预

激活增强/抑制

class NeuronIntervention:
    def __init__(self, model, concept_neurons):
        self.model = model
        self.concept_neurons = concept_neurons  # [(layer, neuron_idx)]
        
    def intervene(self, text, target_concept, direction='enhance', strength=1.0):
        """
        干预特定概念神经元
        
        Args:
            text: 输入文本
            target_concept: 目标概念
            direction: 'enhance' 或 'suppress'
            strength: 干预强度
        """
        # 注册hook
        hooks = []
        
        def make_hook(layer, neuron, dir, str):
            def hook_fn(module, input, output):
                # 修改特定神经元激活
                modified = output.clone()
                if dir == 'enhance':
                    modified[:, :, neuron] += str * torch.abs(modified[:, :, neuron])
                else:  # suppress
                    modified[:, :, neuron] *= (1 - str)
                return modified
            return hook_fn
        
        for cn in self.concept_neurons:
            if cn['concept'] == target_concept:
                layer_idx = cn['layer']
                neuron_idx = cn['neuron']
                hook = self.model.layers[layer_idx].register_forward_hook(
                    make_hook(layer_idx, neuron_idx, direction, strength)
                )
                hooks.append(hook)
        
        # 前向传播
        outputs = self.model(text)
        
        # 移除hooks
        for hook in hooks:
            hook.remove()
        
        return outputs

探针读数偏移评估

def evaluate_probe_shift(intervener, test_texts, probe, target_concept):
    """
    评估干预是否成功偏移探针读数
    
    Returns:
        success_rate: 探针预测向目标概念偏移的比例
    """
    baseline_preds = []
    intervened_preds = []
    
    for text in test_texts:
        # 基线
        baseline_h = get_hidden_state(intervener.model, text)
        baseline_pred = probe.predict(baseline_h)
        baseline_preds.append(baseline_pred)
        
        # 干预后
        intervener.intervene(text, target_concept, 'enhance')
        intervened_h = get_hidden_state(intervener.model, text)
        intervened_pred = probe.predict(intervened_h)
        intervened_preds.append(intervened_pred)
    
    # 计算成功率
    shifted = sum([i == target_concept for i in intervened_preds])
    success_rate = shifted / len(test_texts)
    
    return success_rate

Step 4: 生成控制评估

标签生成偏向

def evaluate_generation_bias(model, prompts, concept_neurons, trait):
    """
    评估神经元干预对生成标签分布的影响
    """
    # 基线生成
    baseline_generations = []
    for prompt in prompts:
        output = model.generate(prompt, max_tokens=100)
        baseline_generations.append(output)
    
    # 标签分布
    baseline_labels = extract_personality_labels(baseline_generations)
    baseline_dist = Counter(baseline_labels)
    
    # 干预后生成
    intervener = NeuronIntervention(model, concept_neurons)
    intervened_generations = []
    for prompt in prompts:
        intervener.intervene(prompt, trait, 'enhance', strength=0.5)
        output = model.generate(prompt, max_tokens=100)
        intervened_generations.append(output)
    
    intervened_labels = extract_personality_labels(intervened_generations)
    intervened_dist = Counter(intervened_labels)
    
    # 计算KL散度或卡方统计量
    divergence = kl_divergence(baseline_dist, intervened_dist)
    
    return {
        'baseline_distribution': baseline_dist,
        'intervened_distribution': intervened_dist,
        'divergence': divergence,
        'target_trait_increase': intervened_dist[trait] - baseline_dist[trait]
    }

关键发现

发现1: 层级信息动态

大五人格信息解码准确率（各层）:

层 0-5:   65-75%  早期层快速解码
层 6-15:  75-85%  中层持续高准确率
层 16-24: 70-80%  深层略有下降但仍可解码

结论: 人格信息在浅层出现，在中层最强，持续存在于深层

发现2: 概念神经元分布

选择性神经元分布:

Openness:         120个神经元 (主要层8-12)
Conscientiousness: 98个神经元 (主要层6-14)
Extraversion:     135个神经元 (主要层7-13)
Agreeableness:     87个神经元 (主要层9-15)
Neuroticism:      142个神经元 (主要层8-14)

跨概念重叠: <15% (相对独立)

发现3: 表示vs行为控制差距

干预成功率:

表示级控制 (探针读数):
  - 平均成功率: 82.4%
  - 最强概念(Openness): 89.2%
  - 最弱概念(Agreeableness): 75.8%

生成级控制 (标签分布):
  - 平均偏向度: +12.3%
  - 最大偏向: +21.5% (Neuroticism)
  - 最小偏向: +5.2% (Agreeableness)

关键发现: 成功操控内部表示并不意味着同等程度的行为控制

发现4: 跨特质溢出效应

增强Openness时的副作用:
  - Conscientiousness: -8.2% (负相关)
  - Extraversion:      +4.5% (正相关)
  - Agreeableness:     -2.1%
  - Neuroticism:       +6.3%

结论: 人格特质在神经层面存在相关性，单一特质干预影响其他特质

应用场景

1. LLM人格定制

角色扮演: 为特定角色生成一致的人格特质
对话代理: 调整客服、教育AI的性格
创意写作: 生成特定人格的角色对话

2. 模型行为研究

偏见分析: 识别人格相关的潜在偏见
安全评估: 测试模型是否容易被操控展现危险人格
对齐研究: 理解价值观如何在模型中表示

3. 人机交互优化

个性化: 根据用户偏好调整模型响应风格
情感支持: 增强共情、降低神经质特征
教育辅导: 适应不同学生的学习风格偏好

伦理考量

⚠️ 重要警告:

人格操控可能产生不可预测的行为
过度干预可能导致模型输出不一致
商业部署需考虑用户知情同意
避免用于欺骗或操纵

与其他工作的关联

Representation Engineering: Zou et al. (2023)
Logit Lens: nostalgebraist
Activation Patching: Redwood Research
Constitutional AI: Anthropic
Big Five in LLMs: 人格心理学在AI中的应用

引用

@article{harada2026psychological,
  title={Psychological Concept Neurons: Can Neural Control Bias Probing and Shift Generation in LLMs?},
  author={Harada, Yuto and Hamada, Hiro Taiyo},
  journal={arXiv preprint arXiv:2604.11802},
  year={2026}
}

@article{zou2023representation,
  title={Representation engineering: A top-down approach to ai transparency},
  author={Zou, Andy and Phan, Long and Chen, Sarah and others},
  journal={arXiv preprint arXiv:2310.01405},
  year={2023}
}

llm-concept-neurons-control

More from this repository

LLM心理概念神经元控制

描述

核心概念

1. 心理概念神经元 (Psychological Concept Neurons)

2. 探针分析 (Probing Analysis)

3. 干预控制 (Intervention Control)

激活关键词

方法论框架

Step 1: 心理概念数据准备

大五人格问卷化

Step 2: 概念神经元识别

逐层探针训练

概念选择性神经元识别

Step 3: 神经元干预

激活增强/抑制

探针读数偏移评估

Step 4: 生成控制评估

标签生成偏向

关键发现

发现1: 层级信息动态

发现2: 概念神经元分布

发现3: 表示vs行为控制差距

发现4: 跨特质溢出效应

应用场景

1. LLM人格定制

2. 模型行为研究

3. 人机交互优化

伦理考量

与其他工作的关联

引用

相关技能

LLM心理概念神经元控制

描述

核心概念

1. 心理概念神经元 (Psychological Concept Neurons)

2. 探针分析 (Probing Analysis)

3. 干预控制 (Intervention Control)

激活关键词

方法论框架

Step 1: 心理概念数据准备

大五人格问卷化

Step 2: 概念神经元识别

逐层探针训练

概念选择性神经元识别

Step 3: 神经元干预

激活增强/抑制

探针读数偏移评估

Step 4: 生成控制评估

标签生成偏向

关键发现

发现1: 层级信息动态

发现2: 概念神经元分布

发现3: 表示vs行为控制差距

发现4: 跨特质溢出效应

应用场景

1. LLM人格定制

2. 模型行为研究

3. 人机交互优化

伦理考量

与其他工作的关联

引用

相关技能

More from this repository