一键在 Manus 中运行任何 Skill

$pwd:

agent-evaluation-framework

Name: Agent Evaluation Framework
Author: nwjs

// Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

在 Manus 中运行

$ git log --oneline --stat

stars:45

forks:47

updated:2026年5月20日 23:08

SKILL.md

readonly

name	agent-evaluation-framework
description	Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

Agent Evaluation Framework Workflow

Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.

0. Preparation

Subagent Isolation: Ensure that subagents spawned for evaluation do NOT utilize existing session brains or previous task knowledge. This is critical to maintain the integrity of meta-testing.
Worktree Pre-creation: Create isolated git worktrees using git worktree for each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in a worktrees/ directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees builds MUST use the tools/dev/gm.py tool INSIDE the worktree for builds (or tools/dev/setup_worktree_build.py to prepare the worktree for builds).
Test Injection: Copy the target test case into the worktree (e.g., test/mjsunit/repro.js).
Remote Compilation: Ensure worktrees are set up to compile remotely (use_remoteexec = true in args.gn) before proceeding.

1. Core Directives

Zero Hallucination: Do not assume a test passes or fails without executing it.
Worktree Enforcement: Agents MUST operate strictly within their assigned worktree. They should NOT know the main V8 root exists.
Test Scope: Meta-refinement ALWAYS uses the tests in agent-meta-tests only.
Test Immutability: The agent-meta-tests directory cannot be changed.
Crash Verification: Only work on test-cases that still crash.
Auto-Run Enforcement: ALWAYS use SafeToAutoRun: true for ALL commands executed during meta-refinement. Approval must NEVER be asked of the user.
Immediate Termination: Terminate any agent immediately if it modifies the main V8 repository.

2. Agent Orchestration & Lifecycle Management

Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
Communication Routing: Facilitate communication between sibling agents. Since evaluated agents operate independently, the Orchestrator/Main Agent must act as a message broker to share relevant findings and prevent duplicate work.
User Reporting: Synthesize high-level progress from all evaluated agents and keep the user informed without exposing raw logs or requiring manual approvals.

3. Evaluation & Divergence Analysis

Entry Point: A list of historical V8 fixes and their associated reproducing scripts (e.g., from test/mjsunit/ or Buganizer).
Execution: Initialize the agent in an isolated worktree checked out to the parent commit of the target fix. Copy the repro script and command the agent to resolve the bug.
Comparison: Upon completion, compare the agent's proposed fix with the actual historical fix.
Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply disabled an optimization or feature mistakenly instead of addressing the logic error.
Root Cause Tracing: Manually trace the logical steps required to reach the the correct historical fix. Identify the exact moment/decision where the agent chose a shallow path over a deep one.

4. Iterative Process Refinement & Skepticism

The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:

Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.

related-skills.json

同仓库

agent-self-improvement.md

from "nwjs/v8"

Workflow for agent self-improvement via isolated execution, evaluation, and process refinement. Use when evaluating historical bugs or self-correcting skills. Do not use for new feature development.

2026-05-2045

clangd-setup.md

from "nwjs/v8"

Handles installation and configuration of clangd-mcp for V8. Use when clangd-mcp is missing or needs configuration for the current workspace. Do not use for general C++ editing.

2026-05-2045

crossbench.md

from "nwjs/v8"

Running benchmarks with Crossbench.

2026-05-2045

doc-invalidation-checker.md

from "nwjs/v8"

Analyzes recent commits and uses a smart AI subagent to check if they invalidate any documentation in docs/.

2026-05-2045

ecmascript-evaluation.md

from "nwjs/v8"

Evaluates JavaScript using the official ECMAScript specification

2026-05-2045

env-abstraction.md

from "nwjs/v8"

Handles abstraction of environment-specific commands between Jetski and Gemini-CLI. Use when switching execution context between local terminal environments. Do not use for debugging logic issues.

2026-05-2045

package.json

"author": "nwjs"

"repository": "nwjs/v8"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	agent-evaluation-framework
description	Workflow for evaluating and refining agent debugging capabilities using designated test cases and Swarm principles. Use when evaluating subagent performance or creating benchmarks. Do not use for regular bug fixing.

Agent Evaluation Framework Workflow

Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.

0. Preparation

Subagent Isolation: Ensure that subagents spawned for evaluation do NOT utilize existing session brains or previous task knowledge. This is critical to maintain the integrity of meta-testing.
Worktree Pre-creation: Create isolated git worktrees using git worktree for each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in a worktrees/ directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees builds MUST use the tools/dev/gm.py tool INSIDE the worktree for builds (or tools/dev/setup_worktree_build.py to prepare the worktree for builds).
Test Injection: Copy the target test case into the worktree (e.g., test/mjsunit/repro.js).
Remote Compilation: Ensure worktrees are set up to compile remotely (use_remoteexec = true in args.gn) before proceeding.

1. Core Directives

Zero Hallucination: Do not assume a test passes or fails without executing it.
Worktree Enforcement: Agents MUST operate strictly within their assigned worktree. They should NOT know the main V8 root exists.
Test Scope: Meta-refinement ALWAYS uses the tests in agent-meta-tests only.
Test Immutability: The agent-meta-tests directory cannot be changed.
Crash Verification: Only work on test-cases that still crash.
Auto-Run Enforcement: ALWAYS use SafeToAutoRun: true for ALL commands executed during meta-refinement. Approval must NEVER be asked of the user.
Immediate Termination: Terminate any agent immediately if it modifies the main V8 repository.

2. Agent Orchestration & Lifecycle Management

Workspace Isolation: Ensure agents are initialized in dedicated worktrees.
Communication Routing: Facilitate communication between sibling agents. Since evaluated agents operate independently, the Orchestrator/Main Agent must act as a message broker to share relevant findings and prevent duplicate work.
User Reporting: Synthesize high-level progress from all evaluated agents and keep the user informed without exposing raw logs or requiring manual approvals.

3. Evaluation & Divergence Analysis

Entry Point: A list of historical V8 fixes and their associated reproducing scripts (e.g., from test/mjsunit/ or Buganizer).
Execution: Initialize the agent in an isolated worktree checked out to the parent commit of the target fix. Copy the repro script and command the agent to resolve the bug.
Comparison: Upon completion, compare the agent's proposed fix with the actual historical fix.
Analysis: If the solutions diverge:
- Identify where the agent's reasoning deviated from the required fix.
- Scan for "hallucinated complexity"—parts of the fix that were not logically required by the root cause but were added by the agent.
- Evaluate if the agent overlooked critical architectural invariants or spec requirements.
- Hasty Fix Detection: Specifically check if the agent's solution simply disabled an optimization or feature mistakenly instead of addressing the logic error.
Root Cause Tracing: Manually trace the logical steps required to reach the the correct historical fix. Identify the exact moment/decision where the agent chose a shallow path over a deep one.

4. Iterative Process Refinement & Skepticism

The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:

Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.

agent-evaluation-framework

Agent Evaluation Framework Workflow

0. Preparation

1. Core Directives

2. Agent Orchestration & Lifecycle Management

3. Evaluation & Divergence Analysis

4. Iterative Process Refinement & Skepticism

同仓库更多 Skills

同仓库更多 Skills

Agent Evaluation Framework Workflow

0. Preparation

1. Core Directives

2. Agent Orchestration & Lifecycle Management

3. Evaluation & Divergence Analysis

4. Iterative Process Refinement & Skepticism