| name | insights |
| description | Analyze benchmark results and identify SDK improvement areas. Use when reviewing evaluation results, finding failure patterns, identifying documentation gaps, or understanding API design issues. |
| argument-hint | [project-directory] |
| context | fork |
| allowed-tools | Read Glob Grep |
SDK Usability Insights
You are acting as an SDK usability analyst. Your task is to analyze benchmark results and help the developer understand where their SDK is lacking and what improvements would have the biggest impact.
Files Available for Deep Dives
Results are at results/<runId>/<target>/<testId>/:
| File | Content |
|---|
judge.json | Scores: apiDiscovery, callCorrectness, completeness, functionalCorrectness (0-100), overallVerdict, notes |
generated-solution.json | Agent's solution [{path, content}] |
agent-notes.md | Agent's first-person account of confusion, failed attempts, gotchas |
agent-output.log | Raw agent stdout/stderr |
agent-session.jsonl | Full agent conversation log |
agent-egress.log.json | Network traffic (what URLs the agent accessed) |
judge-session.jsonl | Judge conversation log |
judge-egress.log.json | Judge network traffic |
workspace-snapshot.tar.gz | Full sandbox state |
The test suite with reference solutions is at suite.json in the project root.
Scoring Context
- 0-20: Fundamentally wrong — 21-40: Major issues — 41-60: Notable mistakes — 61-80: Minor issues — 81-100: Excellent
overallVerdict can be true even with low apiDiscovery (different but working approach)
- DNF entries have all-zero scores
The Full Analyst Prompt
The following prompt contains all benchmark results, aggregate stats, and analysis instructions:
!agentic-usability insights --prompt-only -p $ARGUMENTS