with one click
locomo-benchmark
Run LoCoMo benchmark for long-term conversational memory
Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.
Menu
Run LoCoMo benchmark for long-term conversational memory
Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.
Based on SOC occupation classification
| name | locomo-benchmark |
| description | Run LoCoMo benchmark for long-term conversational memory |
| execution | inline |
| model | inherit |
| aliases | ["locomo","benchmark-memory"] |
Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.
Run the benchmark script:
# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py
# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30
# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full
# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20
Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).
/tmp/locomo/ (if not present)chitta recall --tag {sample_id}| Cat | Name | Description |
|---|---|---|
| 1 | Multi-hop | Requires connecting multiple facts |
| 2 | Single-hop | Direct fact retrieval |
| 3 | Temporal | Date/time questions |
| 4 | Open-domain | General knowledge |
| 5 | Adversarial | Should answer "no information" |
| Model | F1 |
|---|---|
| Human ceiling | 87.9% |
| AutoMem | 90.5% |
| GPT-4 | 32.1% |
| GPT-3.5 | 23.7% |
| Mistral-7B | 13.9% |
https://github.com/snap-research/locomo/tmp/locomo/data/locomo10.jsonIf you prefer to run manually:
# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo
# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26
=== LoCoMo Benchmark Results ===
Total QA Pairs: 50
Overall F1: XX.X%
By Category:
Multi-hop (n=XX): XX.X%
Single-hop (n=XX): XX.X%
Temporal (n=XX): XX.X%
Open-domain (n=XX): XX.X%
Adversarial (n=XX): XX.X%
Per Conversation:
conv-26: XX.X% (50 QA)
Comparison (from paper):
Human ceiling: 87.9%
GPT-4 baseline: 32.1%
cc-soul: XX.X%
Trigger autonomous curiosity-driven exploration. The soul picks a topic from memory gaps or curiosity seeds, searches the web, and stores what it finds as dream-tagged memories.
Fine-tune the Qwen3-0.6B hint model — corpus gen, LoRA/unsloth, GGUF export, Ollama
Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.
First-principles review — question requirements, delete unnecessary parts, simplify, optimize with evidence, automate last. Use for code review, refactor, performance, or architecture.
Token-savvy session continuation. Rebuilds working context from transcript + soul memories in ~1500 tokens instead of replaying full history. Use when starting a new session to continue previous work.
Resume a thread by loading its ~800-token context capsule