تشغيل أي مهارة في Manus بنقرة واحدة

ابدأ الآن

locomo-benchmark

النجوم٢

التفرعات٣

آخر تحديث٢٠ يناير ٢٠٢٦ في ١٨:٠٩

Run LoCoMo benchmark for long-term conversational memory

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

genomewalker

genomewalker/cc-soul

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

SKILL.md

readonly

name	locomo-benchmark
description	Run LoCoMo benchmark for long-term conversational memory
execution	inline
model	inherit
aliases	["locomo","benchmark-memory"]

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py

# Test specific conversations
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30

# Full benchmark (all 10 conversations)
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full

# Limit QA pairs per conversation
python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)
Ingests conversations into cc-soul memory:
- Extracts session summaries as observations
- Creates triplets for speaker facts
- Tags with sample_id for retrieval
Evaluates QA pairs:
- Retrieves context using chitta recall --tag {sample_id}
- Calculates F1 score vs ground truth
Reports results by category

Baseline Scores (from paper)

Model	F1
Human ceiling	87.9%
AutoMem	90.5%
GPT-4	32.1%
GPT-3.5	23.7%
Mistral-7B	13.9%

Data

Repository: https://github.com/snap-research/locomo
Local cache: /tmp/locomo/data/locomo10.json
10 conversations, ~200 QA pairs each, ~35 sessions per conversation

Manual Execution

If you prefer to run manually:

# Ensure data exists
git clone https://github.com/snap-research/locomo /tmp/locomo

# Run benchmark
python3 /maps/projects/fernandezguerra/apps/repos/cc-soul/scripts/locomo-benchmark.py conv-26

Expected Output

=== LoCoMo Benchmark Results ===

Total QA Pairs: 50
Overall F1: XX.X%

By Category:
  Multi-hop (n=XX): XX.X%
  Single-hop (n=XX): XX.X%
  Temporal (n=XX): XX.X%
  Open-domain (n=XX): XX.X%
  Adversarial (n=XX): XX.X%

Per Conversation:
  conv-26: XX.X% (50 QA)

Comparison (from paper):
  Human ceiling: 87.9%
  GPT-4 baseline: 32.1%
  cc-soul: XX.X%

المزيد من هذا المستودع

نفس المستودع

dream

genomewalker/cc-soul

Trigger autonomous curiosity-driven exploration. The soul picks a topic from memory gaps or curiosity seeds, searches the web, and stores what it finds as dream-tagged memories.

2026-06-162

hint-corpus

genomewalker/cc-soul

Fine-tune the Qwen3-0.6B hint model — corpus gen, LoRA/unsloth, GGUF export, Ollama

2026-06-162

kriya

genomewalker/cc-soul

Review soul discoveries (fixes, improvements, corrections) one by one, accept or discard each, implement accepted ones, build chitta, and optionally release.

2026-06-162

prog-review

genomewalker/cc-soul

First-principles review — question requirements, delete unnecessary parts, simplify, optimize with evidence, automate last. Use for code review, refactor, performance, or architecture.

2026-06-162

recap

genomewalker/cc-soul

Token-savvy session continuation. Rebuilds working context from transcript + soul memories in ~1500 tokens instead of replaying full history. Use when starting a new session to continue previous work.

2026-06-162

resume

genomewalker/cc-soul

Resume a thread by loading its ~800-token context capsule

2026-06-162

LoCoMo Benchmark

Evaluate cc-soul's memory against the LoCoMo benchmark (ACL 2024) for long-term conversational memory.

Quick Start

Run the benchmark script:

# Test one conversation (default: conv-26) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py # Test specific conversations python3 $PLUGIN_DIR/scripts/locomo-benchmark.py conv-26 conv-30 # Full benchmark (all 10 conversations) python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --full # Limit QA pairs per conversation python3 $PLUGIN_DIR/scripts/locomo-benchmark.py --max-qa 20

Where $PLUGIN_DIR is /maps/projects/fernandezguerra/apps/repos/cc-soul (or installed plugin path).

What the Script Does

Downloads LoCoMo data from GitHub to /tmp/locomo/ (if not present)

Ingests conversations into cc-soul memory:

Extracts session summaries as observations
Creates triplets for speaker facts
Tags with sample_id for retrieval

Evaluates QA pairs:

Retrieves context using chitta recall --tag {sample_id}
Calculates F1 score vs ground truth

Reports results by category

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

Cat	Name	Description
1	Multi-hop	Requires connecting multiple facts
2	Single-hop	Direct fact retrieval
3	Temporal	Date/time questions
4	Open-domain	General knowledge
5	Adversarial	Should answer "no information"

locomo-benchmark

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output

LoCoMo Benchmark

Quick Start

What the Script Does

Categories

Baseline Scores (from paper)

Data

Manual Execution

Expected Output