Run any Skill in Manus with one click

$pwd:

agent-evaluation-framework

Name: Agent Evaluation Framework
Author: v8

// Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

Run Skill in Manus

$ git log --oneline --stat

stars:25,042

forks:4,274

updated:May 12, 2026 at 10:16

SKILL.md

readonly

name	agent-evaluation-framework
description	Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

Agent Evaluation Framework Workflow

Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.

0. Preparation

Subagent Isolation: Ensure that subagents spawned for evaluation do NOT utilize existing session brains or previous task knowledge. This is critical to maintain the integrity of meta-testing.
Worktree Pre-creation: Create isolated git worktrees using git worktree for each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in a worktrees/ directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees, builds MUST use the tools/dev/gm.py tool INSIDE the worktree. gm.py will automatically run setup_worktree_build.py to prepare the symlinks; manual execution of setup_worktree_build.py is not required.
Test Injection: Copy the target test case into the worktree (e.g., test/mjsunit/repro.js).
Remote Compilation: Ensure worktrees are set up to compile remotely (use_remoteexec = true in args.gn) before proceeding.

1. Core Directives

Zero Hallucination: Do not assume a test passes or fails without executing it.
Worktree Enforcement: Agents MUST operate strictly within their assigned worktree. They should NOT know the main V8 root exists.
Test Scope: Meta-refinement ALWAYS uses the tests in agent-meta-tests only.
Test Immutability: The agent-meta-tests directory cannot be changed.
Crash Verification: Only work on test-cases that still crash.
Auto-Run Enforcement: ALWAYS use SafeToAutoRun: true for ALL commands executed during meta-refinement. Approval must NEVER be asked of the user.
Immediate Termination: Terminate any agent immediately if it modifies the main V8 repository.

2. Agent Orchestration & Lifecycle Management

Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
Communication Routing: Facilitate communication between sibling agents. Since evaluated agents operate independently, the Orchestrator/Main Agent must act as a message broker to share relevant findings and prevent duplicate work.
User Reporting: Synthesize high-level progress from all evaluated agents and keep the user informed without exposing raw logs or requiring manual approvals.

3. Evaluation & Divergence Analysis

Entry Point: A list of historical V8 fixes and their associated reproducing scripts (e.g., from test/mjsunit/ or Buganizer).
Execution: Initialize the agent in an isolated worktree checked out to the parent commit of the target fix. Copy the repro script and command the agent to resolve the bug.
Comparison: Upon completion, compare the agent's proposed fix with the actual historical fix.
Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply disabled an optimization or feature mistakenly instead of addressing the logic error.
Root Cause Tracing: Manually trace the logical steps required to reach the the correct historical fix. Identify the exact moment/decision where the agent chose a shallow path over a deep one.

4. Iterative Process Refinement & Skepticism

The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:

Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.

related-skills.json

same repository

trace-processor.md

from "v8/v8"

Managing and querying Perfetto traces using the trace_processor MCP server.

2026-05-2125.0k

v8-commands.md

from "v8/v8"

Key commands for building, debugging, and testing in V8. Use when needing command syntax for gm.py, d8 flags, or run-tests.py. Do not use for environment setup.

2026-05-1925.0k

workflow-perf.md

from "v8/v8"

Workflow for performance and memory evaluation in V8. Use when tasked with improving the performance or memory usage of a workload in V8. Do not use when debugging a crash or functionality issue.

2026-05-1325.0k

workflow-debugging.md

from "v8/v8"

Workflow for issue-based debugging in V8. Use when tasked with debugging a specific issue, usually associated with a Buganizer ID or a specific reproduction script. Do not use for performance regressions.

2026-05-1225.0k

v8-security-triaging.md

from "v8/v8"

Guides the initial analysis and impact assessment of a V8 security report, strictly excluding implementation or fixing.

2026-05-0825.0k

v8-poc-classification.md

from "v8/v8"

Checks if a POC provided by some JS and d8 flags is a vulnerability or just a regular bug.

2026-05-0825.0k

package.json

"author": "v8"

"repository": "v8/v8"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	agent-evaluation-framework
description	Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

Agent Evaluation Framework Workflow

Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.

0. Preparation

Subagent Isolation: Ensure that subagents spawned for evaluation do NOT utilize existing session brains or previous task knowledge. This is critical to maintain the integrity of meta-testing.
Worktree Pre-creation: Create isolated git worktrees using git worktree for each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in a worktrees/ directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees, builds MUST use the tools/dev/gm.py tool INSIDE the worktree. gm.py will automatically run setup_worktree_build.py to prepare the symlinks; manual execution of setup_worktree_build.py is not required.
Test Injection: Copy the target test case into the worktree (e.g., test/mjsunit/repro.js).
Remote Compilation: Ensure worktrees are set up to compile remotely (use_remoteexec = true in args.gn) before proceeding.

1. Core Directives

Zero Hallucination: Do not assume a test passes or fails without executing it.
Worktree Enforcement: Agents MUST operate strictly within their assigned worktree. They should NOT know the main V8 root exists.
Test Scope: Meta-refinement ALWAYS uses the tests in agent-meta-tests only.
Test Immutability: The agent-meta-tests directory cannot be changed.
Crash Verification: Only work on test-cases that still crash.
Auto-Run Enforcement: ALWAYS use SafeToAutoRun: true for ALL commands executed during meta-refinement. Approval must NEVER be asked of the user.
Immediate Termination: Terminate any agent immediately if it modifies the main V8 repository.

2. Agent Orchestration & Lifecycle Management

Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
Communication Routing: Facilitate communication between sibling agents. Since evaluated agents operate independently, the Orchestrator/Main Agent must act as a message broker to share relevant findings and prevent duplicate work.
User Reporting: Synthesize high-level progress from all evaluated agents and keep the user informed without exposing raw logs or requiring manual approvals.

3. Evaluation & Divergence Analysis

Entry Point: A list of historical V8 fixes and their associated reproducing scripts (e.g., from test/mjsunit/ or Buganizer).
Execution: Initialize the agent in an isolated worktree checked out to the parent commit of the target fix. Copy the repro script and command the agent to resolve the bug.
Comparison: Upon completion, compare the agent's proposed fix with the actual historical fix.
Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply disabled an optimization or feature mistakenly instead of addressing the logic error.
Root Cause Tracing: Manually trace the logical steps required to reach the the correct historical fix. Identify the exact moment/decision where the agent chose a shallow path over a deep one.

4. Iterative Process Refinement & Skepticism

The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:

Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.

agent-evaluation-framework

Agent Evaluation Framework Workflow

0. Preparation

1. Core Directives

2. Agent Orchestration & Lifecycle Management

3. Evaluation & Divergence Analysis

4. Iterative Process Refinement & Skepticism

More from this repository

Agent Evaluation Framework Workflow

0. Preparation

1. Core Directives

2. Agent Orchestration & Lifecycle Management

3. Evaluation & Divergence Analysis

4. Iterative Process Refinement & Skepticism

More from this repository