원클릭으로
analyze-eval
// Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
// Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
| name | analyze-eval |
| description | Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID. |
https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalIdThe visualizer URL pattern is:
/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps
$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)You need the evalId to query.
Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):
npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'
This returns a JSON object with:
| Field | Contents |
|---|---|
eval | Name, category, evalPath, status (pass/fail + failure reason), task text |
run | Model name, provider, experiment name, run status |
steps | Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason |
outputFiles | Map of file path -> file content from the model's generated output (unzipped) |
evalSourceFiles | Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.) |
With the data returned, compare:
steps for the first entry with status.kind === "failed". The failureReason field has the error message.outputFiles for the model's code.evalSourceFiles for the answer directory and grader test files.eval.task for the TASK.txt content.Common failure patterns:
outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.Classify the failure as one of:
Summarize:
Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage.
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.