一键在 Manus 中运行任何 Skill

开始使用

locomo-benchmark

星标2

分支3

更新时间2026年1月20日 18:09

Run LoCoMo benchmark for long-term conversational memory

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

genomewalker

genomewalker/cc-soul

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py

# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30

# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full

# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)
Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
Evaluates QA pairs:
- Retrieves context using chitta recall --tag {sample_id}
- Calculates F1 score vs ground truth
Reports results by category

Baseline Scores (from paper)

Model	F1
Human ceiling	87.9%
AutoMem	90.5%
GPT-4	32.1%
GPT-3.5	23.7%
Mistral-7B	13.9%

Data

Repository: https://github.com/snap-research/locomo
Local cache: /tmp/locomo/data/locomo10.json
10 conversations, ~200 QA pairs each, ~35 sessions per conversation

Manual Execution

If you prefer to run manually:

# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo

# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26

Expected Output

=== LoCoMo Benchmark Results ===

Total QA Pairs: 50
Overall F1: XX.X%

By Category:
  Multi-hop (n=XX): XX.X%
  Single-hop (n=XX): XX.X%
  Temporal (n=XX): XX.X%
  Open-domain (n=XX): XX.X%
  Adversarial (n=XX): XX.X%

Per Conversation:
  conv-26: XX.X% (50 QA)

Comparison (from paper):
  Human ceiling: 87.9%
  GPT-4 baseline: 32.1%
  cc-soul: XX.X%

同仓库更多 Skills

同仓库

dream

genomewalker/cc-soul

Trigger autonomous curiosity-driven exploration. The soul picks a topic from memory gaps or curiosity seeds, searches the web, and stores what it finds as dream-tagged memories.

2026-06-162

hint-corpus

genomewalker/cc-soul

Fine-tune the Qwen3-0.6B hint model — corpus gen, LoRA/unsloth, GGUF export, Ollama

2026-06-162

kriya

genomewalker/cc-soul

Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.

2026-06-162

prog-review

genomewalker/cc-soul

First-principles review — question requirements, delete unnecessary parts, simplify, optimize with evidence, automate last. Use for code review, refactor, performance, or architecture.

2026-06-162

recap

genomewalker/cc-soul

Token-savvy session continuation. Rebuilds working context from transcript + soul memories in ~1500 tokens instead of replaying full history. Use when starting a new session to continue previous work.

2026-06-162

resume

genomewalker/cc-soul

Resume a thread by loading its ~800-token context capsule

2026-06-162

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py # Test specific conversations python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30 # Full benchmark (all 10 conversations) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full # Limit QA pairs per conversation python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)

Ingests conversations into cc-soul memory:

Extracts session summaries as observations
Creates triplets for speaker facts
Tags with sample_id for retrieval

Evaluates QA pairs:

Retrieves context using chitta recall --tag {sample_id}
Calculates F1 score vs ground truth

Reports results by category

name	locomo-benchmark
description	Run LoCoMo benchmark for long-term conversational memory
execution	inline
model	inherit
aliases	["locomo","benchmark-memory"]

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

name	locomo-benchmark
description	Run LoCoMo benchmark for long-term conversational memory
execution	inline
model	inherit
aliases	["locomo","benchmark-memory"]

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

locomo-benchmark

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output