| name | agent-evaluation-framework |
| description | Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing. |
Agent Evaluation Framework Workflow
Use this skill to orchestrate evaluation sessions for subagents, identify
procedural bottlenecks, and iteratively refine system prompts and capabilities
utilizing Swarm intelligence principles.
0. Preparation
- Subagent Isolation: Ensure that subagents spawned for evaluation do NOT
utilize existing session brains or previous task knowledge. This is critical
to maintain the integrity of meta-testing.
- Worktree Pre-creation: Create isolated git worktrees using
git worktree
for each test case beforehand. Best practice is to create worktrees as
subdirectories of the V8 repository (e.g., in a worktrees/ directory within
the V8 root). Report where the worktrees were created to the user. Inside
worktrees, builds MUST use the tools/dev/gm.py tool INSIDE the worktree.
gm.py will automatically run setup_worktree_build.py to prepare the
symlinks; manual execution of setup_worktree_build.py is not required.
- Test Injection: Copy the target test case into the worktree (e.g.,
test/mjsunit/repro.js).
- Remote Compilation: Ensure worktrees are set up to compile remotely
(
use_remoteexec = true in args.gn) before proceeding.
1. Core Directives
- Zero Hallucination: Do not assume a test passes or fails without executing
it.
- Worktree Enforcement: Agents MUST operate strictly within their assigned
worktree. They should NOT know the main V8 root exists.
- Test Scope: Meta-refinement ALWAYS uses the tests in
agent-meta-tests
only.
- Test Immutability: The
agent-meta-tests directory cannot be changed.
- Crash Verification: Only work on test-cases that still crash.
- Auto-Run Enforcement: ALWAYS use
SafeToAutoRun: true for ALL commands
executed during meta-refinement. Approval must NEVER be asked of the user.
- Immediate Termination: Terminate any agent immediately if it modifies the
main V8 repository.
2. Agent Orchestration & Lifecycle Management
- Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
- Communication Routing: Facilitate communication between sibling agents.
Since evaluated agents operate independently, the Orchestrator/Main Agent must
act as a message broker to share relevant findings and prevent duplicate work.
- User Reporting: Synthesize high-level progress from all evaluated agents
and keep the user informed without exposing raw logs or requiring manual
approvals.
3. Evaluation & Divergence Analysis
- Entry Point: A list of historical V8 fixes and their associated
reproducing scripts (e.g., from
test/mjsunit/ or Buganizer).
- Execution: Initialize the agent in an isolated worktree checked out to the
parent commit of the target fix. Copy the repro script and command the
agent to resolve the bug.
- Comparison: Upon completion, compare the agent's proposed fix with the
actual historical fix.
- Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically
required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec
requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply
disabled an optimization or feature mistakenly instead of addressing the
logic error.
- Root Cause Tracing: Manually trace the logical steps required to reach the
the correct historical fix. Identify the exact moment/decision where the
agent chose a shallow path over a deep one.
4. Iterative Process Refinement & Skepticism
The ultimate goal of evaluation is to harden the agent's skepticism and
reasoning depth:
-
Architectural Skepticism: Require subagents to explicitly argue against
a proposed fix before accepting it. Look at the problem from multiple
orthogonal angles.
-
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct
evidence from GDB/Spec logs, spawn a subagent to reason deeper about the
specific invariant being violated.
-
Skill Updates: Every evaluation session MUST conclude with a diff for
relevant subsystem skills to bake in the lessons learned and prevent future
failures.
-
analyze_brain.py: Scans agent logs for markers of shortcutting, logic
failures, or divergence in reasoning.