Run any Skill in Manus with one click

Get Started

locomo-benchmark

Stars2

Forks3

UpdatedJanuary 20, 2026 at 18:09

Run LoCoMo benchmark for long-term conversational memory

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

genomewalker

genomewalker/cc-soul

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

SKILL.md

readonly

name	locomo-benchmark
description	Run LoCoMo benchmark for long-term conversational memory
execution	inline
model	inherit
aliases	["locomo","benchmark-memory"]

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py

# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30

# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full

# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)
Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
Evaluates QA pairs:
- Retrieves context using chitta recall --tag {sample_id}
- Calculates F1 score vs ground truth
Reports results by category

Baseline Scores (from paper)

Model	F1
Human ceiling	87.9%
AutoMem	90.5%
GPT-4	32.1%
GPT-3.5	23.7%
Mistral-7B	13.9%

Data

Repository: https://github.com/snap-research/locomo
Local cache: /tmp/locomo/data/locomo10.json
10 conversations, ~200 QA pairs each, ~35 sessions per conversation

Manual Execution

If you prefer to run manually:

# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo

# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26

Expected Output

=== LoCoMo Benchmark Results ===

Total QA Pairs: 50
Overall F1: XX.X%

By Category:
  Multi-hop (n=XX): XX.X%
  Single-hop (n=XX): XX.X%
  Temporal (n=XX): XX.X%
  Open-domain (n=XX): XX.X%
  Adversarial (n=XX): XX.X%

Per Conversation:
  conv-26: XX.X% (50 QA)

Comparison (from paper):
  Human ceiling: 87.9%
  GPT-4 baseline: 32.1%
  cc-soul: XX.X%

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

locomo-benchmark

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

locomo-benchmark

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

More from this repository

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

More from this repository