بنقرة واحدة
locomo-benchmark
Run LoCoMo benchmark for long-term conversational memory
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Run LoCoMo benchmark for long-term conversational memory
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Trigger autonomous curiosity-driven exploration. The soul picks a topic from memory gaps or curiosity seeds, searches the web, and stores what it finds as dream-tagged memories.
Fine-tune the Qwen3-0.6B hint model — corpus gen, LoRA/unsloth, GGUF export, Ollama
Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.
First-principles review — question requirements, delete unnecessary parts, simplify, optimize with evidence, automate last. Use for code review, refactor, performance, or architecture.
Token-savvy session continuation. Rebuilds working context from transcript + soul memories in ~1500 tokens instead of replaying full history. Use when starting a new session to continue previous work.
Resume a thread by loading its ~800-token context capsule
| name | locomo-benchmark |
| description | Run LoCoMo benchmark for long-term conversational memory |
| execution | inline |
| model | inherit |
| aliases | ["locomo","benchmark-memory"] |
Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.
Run the benchmark script:
# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py
# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30
# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full
# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20
Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).
/tmp/locomo/ (if not present)chitta recall --tag {sample_id}| Cat | Name | Description |
|---|---|---|
| 1 | Multi-hop | Requires connecting multiple facts |
| 2 | Single-hop | Direct fact retrieval |
| 3 | Temporal | Date/time questions |
| 4 | Open-domain | General knowledge |
| 5 | Adversarial | Should answer "no information" |
| Model | F1 |
|---|---|
| Human ceiling | 87.9% |
| AutoMem | 90.5% |
| GPT-4 | 32.1% |
| GPT-3.5 | 23.7% |
| Mistral-7B | 13.9% |
https://github.com/snap-research/locomo/tmp/locomo/data/locomo10.jsonIf you prefer to run manually:
# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo
# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26
=== LoCoMo Benchmark Results ===
Total QA Pairs: 50
Overall F1: XX.X%
By Category:
Multi-hop (n=XX): XX.X%
Single-hop (n=XX): XX.X%
Temporal (n=XX): XX.X%
Open-domain (n=XX): XX.X%
Adversarial (n=XX): XX.X%
Per Conversation:
conv-26: XX.X% (50 QA)
Comparison (from paper):
Human ceiling: 87.9%
GPT-4 baseline: 32.1%
cc-soul: XX.X%