| name | behavioral-evals |
| description | Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests. |
Behavioral Evals
Overview
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE]
Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
🔄 Workflow Decision Tree
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Is it a new test?
- Yes -> Set policy to
USUALLY_PASSES.
- No ->
ALWAYS_PASSES (locks in regression).
- Are you fixing a failure or promoting a test?
📋 Quick Checklist
1. Setup Workspace
Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).
2. Write Assertions
Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().
3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
📦 Bundled Resources
Detailed procedural guides:
- creating.md: Assertion strategies, Rig selection, Mock MCPs.
- fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
- promoting.md: Candidate identification criteria and threshold guidelines.