Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

Commencer

locomo-benchmark

Étoiles2

Forks3

Mis à jour20 janvier 2026 à 18:09

Run LoCoMo benchmark for long-term conversational memory

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

genomewalker

genomewalker/cc-soul

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques·SOC 15-1253

SKILL.md

readonly

name	locomo-benchmark
description	Run LoCoMo benchmark for long-term conversational memory
execution	inline
model	inherit
aliases	["locomo","benchmark-memory"]

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py

# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30

# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full

# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)
Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
Evaluates QA pairs:
- Retrieves context using chitta recall --tag {sample_id}
- Calculates F1 score vs ground truth
Reports results by category

Baseline Scores (from paper)

Model	F1
Human ceiling	87.9%
AutoMem	90.5%
GPT-4	32.1%
GPT-3.5	23.7%
Mistral-7B	13.9%

Data

Repository: https://github.com/snap-research/locomo
Local cache: /tmp/locomo/data/locomo10.json
10 conversations, ~200 QA pairs each, ~35 sessions per conversation

Manual Execution

If you prefer to run manually:

# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo

# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26

Expected Output

=== LoCoMo Benchmark Results ===

Total QA Pairs: 50
Overall F1: XX.X%

By Category:
  Multi-hop (n=XX): XX.X%
  Single-hop (n=XX): XX.X%
  Temporal (n=XX): XX.X%
  Open-domain (n=XX): XX.X%
  Adversarial (n=XX): XX.X%

Per Conversation:
  conv-26: XX.X% (50 QA)

Comparison (from paper):
  Human ceiling: 87.9%
  GPT-4 baseline: 32.1%
  cc-soul: XX.X%

Plus depuis ce dépôt

même dépôt

dream

genomewalker/cc-soul

Trigger autonomous curiosity-driven exploration. The soul picks a topic from memory gaps or curiosity seeds, searches the web, and stores what it finds as dream-tagged memories.

2026-06-162

hint-corpus

genomewalker/cc-soul

Fine-tune the Qwen3-0.6B hint model — corpus gen, LoRA/unsloth, GGUF export, Ollama

2026-06-162

kriya

genomewalker/cc-soul

Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.

2026-06-162

prog-review

genomewalker/cc-soul

First-principles review — question requirements, delete unnecessary parts, simplify, optimize with evidence, automate last. Use for code review, refactor, performance, or architecture.

2026-06-162

recap

genomewalker/cc-soul

Token-savvy session continuation. Rebuilds working context from transcript + soul memories in ~1500 tokens instead of replaying full history. Use when starting a new session to continue previous work.

2026-06-162

resume

genomewalker/cc-soul

Resume a thread by loading its ~800-token context capsule

2026-06-162

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py # Test specific conversations python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30 # Full benchmark (all 10 conversations) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full # Limit QA pairs per conversation python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)

Ingests conversations into cc-soul memory:

Extracts session summaries as observations
Creates triplets for speaker facts
Tags with sample_id for retrieval

Evaluates QA pairs:

Retrieves context using chitta recall --tag {sample_id}
Calculates F1 score vs ground truth

Reports results by category

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

locomo-benchmark

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output