com um clique
add-eval
// Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage.
// Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage.
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.
Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.
| name | add-eval |
| description | Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage. |
Follow these steps whenever the user asks to create a new eval. Read .cursor/skills/add-eval/reference.md for grader helpers, test patterns, and conventions.
Switch to Plan mode immediately. Steps 0-2 (gather info, research, design) are collaborative and read-only. The user should see the research findings and approve the eval design before any files are created. Switch back to Agent mode at Step 3.
Determine the following (ask the user if not provided):
List the existing categories by scanning evals/ top-level directories, then propose the best-fit category. The current categories are:
| Category | Scope |
|---|---|
000-fundamentals | Basic Convex concepts (empty functions, schema definition, crons, scheduling) |
001-data_modeling | Schema design, indexes, relationships, unions, optional fields |
002-queries | Reading data, joins, pagination, aggregation, filtering |
003-mutations | Writing data, inserts, patches, deletes, cascades |
004-actions | HTTP fetch, file storage, node runtime, HTTP action routing |
005-idioms | File organization, internal functions, batch patterns |
006-clients | useQuery, useMutation, usePaginatedQuery |
Determine the eval number by listing existing evals in the chosen category and picking the next sequential number.
Run these four research tracks. Use sub-agents or parallel tool calls where possible.
https://docs.convex.dev/llms.txt to get the docs table of contents.runner/models/guidelines.ts (or the generated guidelines) for the concept being tested.reference.md for the full catalog.evals/ to get a high-level view of coverage.You should already be in Plan mode. Present the full eval design to the user for review.
Write the complete TASK.txt content. Follow these principles:
internalMutation or how pagination works, that's a meaningful signal, not a task problem.Describe the files that will be created and the key implementation approach. Don't write the full code yet, just the structure and important decisions.
Describe how the eval will be graded:
reference.md for the catalog and decision tree).If unit tests cannot fully verify the concept (e.g. testing that a model uses an index rather than a filter, or testing code organization patterns), STOP and discuss with the user. Present the options:
getSchema, hasIndexForFields)createAIGraderTest, currently disabled and requires a repo change to re-enable)Let the user decide before proceeding.
Summarize the guideline context before implementation:
If a new guideline is likely needed, design it minimal-first. Every token in the guidelines is sent with every prompt, so bloat costs real money. Start with the smallest guideline that teaches the critical pattern (usually one code example), test it, and only expand if models still fail. Avoid pinning specific dependency versions in guidelines as they age quickly. Prefer "always install the latest version" instead.
Before presenting the design, critically evaluate it. Warn the user if:
.withIndex() calls). We're testing knowledge, not instruction-following.After the user approves the design, switch back to Agent mode and implement:
Create directory: evals/<category>/<eval_slug>/
Write TASK.txt with the approved content.
Create answer directory:
answer/package.json:
{
"name": "convexbot",
"version": "1.0.0",
"dependencies": {
"convex": "^1.31.2"
}
}
answer/convex/schema.ts (if applicable)answer/convex/index.ts)Run codegen:
cd evals/<category>/<eval_slug>/answer && bunx convex codegen
Write grader.test.ts using the approved test approach. Import paths are relative:
import { responseClient, responseAdminClient, addDocuments } from "../../../grader";
import { api } from "./answer/convex/_generated/api";
Adjust the depth of ../ based on the eval's nesting level.
Typecheck:
bun run typecheck
First run canonical answer validation for the new eval:
TEST_FILTER=<category>/<eval_slug> bun run validate:answers
Then run the eval for one model as a smoke test. This validates model generation against the new eval:
MODELS=anthropic/claude-sonnet-4.6 TEST_FILTER=<category>/<eval_slug> bun run local:run
Do NOT set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN, so results stay local-only.
If the smoke test fails:
run.log to understand what happened.Start with a smaller representative set of models to calibrate difficulty. If the result is unclear, expand to a broader sweep. Launch separate background processes, one per model:
# Suggested first-pass set
MODELS=anthropic/claude-sonnet-4.6 TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=openai/gpt-5.4 TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=google/gemini-3.1-pro-preview TEST_FILTER=<category>/<eval_slug> bun run local:run &
MODELS=anthropic/claude-haiku-4.5 TEST_FILTER=<category>/<eval_slug> bun run local:run &
wait
If those results are too noisy or too uniform, expand to a broader sweep across providers and tiers. The user can override the list. Do NOT set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN.
Monitor the background processes by reading their terminal output files. Each process runs one eval so they should complete in a few minutes.
Run each model at least twice (ideally three times) to distinguish systematic failures from flaky ones. Model output is non-deterministic, so a single pass or fail doesn't tell the whole story. Common variance sources:
Collect pass/fail from all model runs and present a summary table:
Model Result
----------------------------- ------
anthropic/claude-sonnet-4.6 PASS
anthropic/claude-haiku-4.5 FAIL
openai/gpt-5.4 PASS
...
Then assess the results:
Then explicitly ask: is this primarily an eval/task gap, a model gap, or a guideline gap?
validate-guidelines skill after the eval is settled.Model output is preserved in the temp directory printed at the start of each run (Using tempdir: ...). These directories are NOT cleaned up automatically. For each model, look at:
<tempdir>/output/<provider>/<model>/<category>/<eval>/ for the generated source filesrun.log in that directory for step-by-step output (install, deploy, tsc, eslint, vitest)node_modules/ for the actual resolved dependency versionspackage.json for what the model requested vs what was installedThis is essential for distinguishing "model wrote bad code" from "model's code is fine but the grader is too strict" from "dependency version bug".
Push back with specific recommendations if calibration looks off. Suggest concrete changes to the task, answer, or tests.
bun run typecheck passesbun run validate:answers passes for the new eval