| name | benchmark-context |
| description | Automatically benchmark your custom memory implementation against established systems like Supermemory. Set up a public benchmark, or create your own. Compare solutions against quality, latency, features and cost, easily, with a simple UI and CLI. |
MemoryBench Integration Skill
Automatically benchmark your custom memory implementation against established systems like Supermemory, Mem0, and Zep.
What This Skill Does
When you invoke this skill from your project, it handles the complete benchmarking process end-to-end:
- Asks you 5 questions about your setup preferences
- Analyzes your memory code to understand how it works
- Generates integration code automatically
- Runs the full benchmark comparing your system to competitors
- Shows you the results with clear performance comparisons
No manual commands needed - everything runs automatically from start to finish.
When to Use This Skill
Use this skill when you:
- Built a custom memory/context system and want to see how it performs
- Need objective metrics comparing your implementation to industry solutions
- Want to benchmark on standardized datasets (conversational memory, RAG, dialogue)
- Are considering which memory system to use and want data-driven comparison
How It Works
The 7 Automated Phases
Phase 1: Setup
Phase 2: Discovery
- Uses AI agents to analyze your memory codebase
- Identifies initialization, ingestion, and search methods
- Detects required configuration and API keys
- Understands your data formats
Phase 3: Code Generation
- Creates a provider adapter implementing the MemoryBench interface
- Copies and adapts your memory code into the provider
- Generates custom prompts if needed for your result format
- See Provider Template Reference
Phase 4: Registration
- Updates
src/types/provider.ts with your provider name
- Registers provider in
src/providers/index.ts
- Adds configuration in
src/utils/config.ts
- Documents environment variables
Phase 5: Configuration
- Creates
.env.local with required API keys
- Asks for your provider credentials
- Asks for comparison provider keys (if selected)
- Asks for OpenAI/Anthropic key for judging
Phase 6: Validation
- Runs quick test with single question
- Verifies provider initialization works
- Confirms ingestion and search work correctly
- Asks if you want to continue or debug if issues arise
Phase 7: Benchmark Execution
- Runs full benchmark automatically
- Shows real-time progress (ingestion, indexing, search, answers, evaluation)
- Compares against selected competitors
- Presents final results with accuracy and latency metrics
Initial Questions
The skill will ask you these questions upfront:
1. Provider Name
What should we call your memory provider?
- Use lowercase, no spaces (e.g., "mymemory", "contextengine")
2. Memory Code Location
Where is your memory implementation?
- Examples:
src/lib/memory, packages/memory, src/services/context
3. Benchmark Dataset
Which dataset matches your use case?
-
LoCoMo - Long-term conversational memory across multiple sessions spanning days/weeks
- Best for: Chat apps, AI assistants, customer support bots
-
LongMemEval - Memory with long documents and complex retrieval
- Best for: RAG systems, document analysis, knowledge bases
-
ConvoMem - Multi-turn conversation understanding and context tracking
- Best for: Dialogue systems, interview bots, meeting assistants
See Benchmarks Reference for detailed information.
4. Comparison Targets (Multi-select)
Which systems to compare against?
- Supermemory - Fast hybrid vector + graph memory with automatic extraction
- Mem0 - Persistent memory for AI agents with graph relationships
- Zep - Long-term conversation memory with automatic summarization
- Filesystem - Baseline vector search (no API required)
- RAG - Baseline RAG with LLM extraction (no API required)
5. Test Size
How many questions to benchmark?
- Small (5 questions) - Quick validation, ~2-5 minutes
- Medium (20 questions) - Good sample size, ~10-15 minutes
- Full (all questions) - Complete evaluation, ~30-60 minutes
Working Directory Management
Important: You must run this skill from your project root, NOT from memorybench.
your-project/ ← Run skill from here
├── src/
│ └── lib/memory/ ← Your memory implementation
└── memorybench/ ← Skill clones this automatically
└── src/providers/ ← Your provider adapter goes here
The skill will:
- Verify you're in your project (not in memorybench)
- Clone memorybench to
./memorybench
- Use relative paths (
../src/lib/memory) when analyzing your code
- Run benchmarks with
cd memorybench && bun run src/index.ts ...
What Gets Created
After the skill runs, you'll have:
your-project/
└── memorybench/
├── .env.local # Your API keys
├── src/
│ ├── providers/
│ │ └── {yourname}/
│ │ ├── index.ts # Provider implementation
│ │ └── prompts.ts # Custom prompts (optional)
│ ├── types/provider.ts # Updated with your provider
│ └── providers/index.ts # Registered
└── data/runs/{run-id}/ # Benchmark results
├── checkpoint.json # Run state
├── results/ # Per-question results
└── report.json # Final metrics
After Completion
Once the benchmark finishes, the skill shows:
Summary Scores:
- Accuracy percentage for each provider
- Average search latency
- Success rate (questions answered vs failed)
Key Findings:
- "Your provider achieved 76% accuracy vs 82% for Supermemory"
- "Search latency: 145ms (yours) vs 98ms (best competitor)"
Next Steps:
- View detailed results:
cd memorybench && bun run src/index.ts serve
- See failures:
cd memorybench && bun run src/index.ts show-failures -r {run-id}
- Try different benchmark: Run this skill again with another dataset
- Run manually:
cd memorybench && bun run src/index.ts run -p {name} -b {benchmark}
Troubleshooting
If something goes wrong:
- "Provider not initialized" - Check API keys in
.env.local
- Ingestion fails - Check data format transformation, see Data Formats Reference
- Search returns no results - Verify containerTag handling and indexing completion
- Answers are wrong - May need custom prompts for your result format
See Debugging Reference for detailed troubleshooting.
Implementation Instructions for Claude
When executing this skill, follow these steps:
Step 1: Verify Environment
Check that we're in the user's project (not in memorybench):
basename "$(pwd)" | grep -q "memorybench" && echo "ERROR" || echo "OK"
If ERROR, inform user to run from their project root and exit.
Step 2: Clone MemoryBench
Check if memorybench already exists:
[ -d "memorybench" ] && echo "EXISTS" || echo "NOT_FOUND"
If NOT_FOUND, clone it using EXACTLY this command (do not modify the URL):
git clone https://github.com/supermemoryai/memorybench.git memorybench
Then install dependencies:
cd memorybench && bun install && cd ..
IMPORTANT: You MUST use the URL https://github.com/supermemoryai/memorybench.git - do not infer or use any other URL.
If EXISTS, use the existing installation (no action needed).
Step 3: Gather User Input
Use AskUserQuestion tool to collect all 5 questions at once:
- Provider name (text input suggestion)
- Memory code location (text input suggestion)
- Benchmark dataset (single select: locomo, longmemeval, convomem)
- Comparison targets (multi-select: supermemory, mem0, zep, filesystem, rag)
- Test size (single select: 5, 20, or full)
Step 4: Analyze User's Memory Code
Use Task tool with subagent_type=Explore to analyze the provided memory code location. Look for initialization, add/ingest methods, search/query methods, and configuration needs.
Step 5: Generate Provider Code
Based on discovery, create:
memorybench/src/providers/{providerName}/index.ts using template from references/provider-template.md
- Optionally
memorybench/src/providers/{providerName}/prompts.ts if custom formatting needed
Step 6: Register Provider
Update these files in memorybench:
src/types/provider.ts - Add to ProviderName union type
src/providers/index.ts - Import and register in providers Record
src/utils/config.ts - Add case in getProviderConfig()
.env.example - Document environment variables
Step 7: Configure Environment
Ask user for API keys (OpenAI for judging, their provider keys, comparison provider keys).
Create or update memorybench/.env.local with provided values.
Step 8: Validation Test
Run quick test:
cd memorybench && bun run src/index.ts test -p {providerName} -b {benchmark} -q question_1
If fails, show error and ask user if they want to debug or abort.
Step 9: Run Benchmark
Based on user selections, run the benchmark command and LET IT COMPLETE without polling.
With comparisons:
cd memorybench && bun run src/index.ts compare -p {providerName},{others} -b {benchmark} -l {limit}
Without comparisons:
cd memorybench && bun run src/index.ts run -p {providerName} -b {benchmark} -l {limit}
(Omit -l {limit} if user selected "full")
IMPORTANT INSTRUCTIONS FOR RUNNING:
- Use Bash tool with
run_in_background: false (or omit the parameter) so the command runs synchronously
- Set a long timeout (e.g.,
timeout: 600000 for 10 minutes)
- DO NOT poll with BashOutput repeatedly - let the command complete on its own
- The benchmark will show progress automatically in its output
- Only when the command fully completes, proceed to Step 10
Alternative if synchronous doesn't work:
- Use
run_in_background: true
- Tell the user "The benchmark is running in the background (typically takes 5-15 minutes depending on test size)"
- Use BashOutput ONCE after an appropriate wait time to get final results
- Do NOT repeatedly poll for progress updates
Step 10: Present Results
When the benchmark completes, parse the final output and present:
- Accuracy scores for each provider
- Latency metrics
- Key findings and comparisons
- Suggested next steps
Look for sections in the output like:
- "COMPARISON RESULTS" or "Final Report"
- Accuracy percentages
- Average latencies
- Success rates
Reference Documentation
Success Criteria
The skill completes successfully when:
- ✅ Provider adapter generated and registered
- ✅ Validation test passes (single question works)
- ✅ Full benchmark completes without errors
- ✅ Results displayed with comparison scores
- ✅ You can see how your system stacks up against competitors
Running the Skill
From your project root:
/memorybench
skill:memorybench
That's it! The skill handles everything else automatically.