원클릭으로
agent-evaluation-direct
Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.
Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.
Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.
Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.
Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.
Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.
| name | agent-evaluation-direct |
| description | Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses |
| license | MIT |
| compatibility | opencode |
| metadata | {"audience":"developers","workflow":"evaluation"} |
Evaluate agents by running tasks directly and comparing responses. OpenCode submits tasks to each agent using scripts/run_agent.py, then evaluates results with agents/benchmark.py.
scripts/run_agent.py for each frameworkComparativeEvaluator to score responsesRun single agent:
python scripts/run_agent.py autogen "List 3 GitHub issues"
python scripts/run_agent.py crewai "List 3 GitHub issues"
python scripts/run_agent.py openhands "List 3 GitHub issues"
Run all agents:
python scripts/run_agent.py all "List 3 GitHub issues"
JSON output (for parsing):
python scripts/run_agent.py autogen "List 3 GitHub issues" --json
Options:
--role - Agent role: software_engineer, support_engineer, release_engineer--json - Output as JSON--timeout - Timeout in seconds (default: 180)For each agent run, capture:
| Field | Description |
|---|---|
| Framework | autogen, crewai, openhands |
| Task | The input task |
| Response | Agent's output |
| Latency | Time in ms |
| Success | true/false |
| Framework | Input | Output | Score | Feedback | Recommendations |
|---|---|---|---|---|---|
| AutoGen | {task} | {truncated}... | 4/5 | {feedback} | {improvements} |
| CrewAI | {task} | {truncated}... | 3/5 | {feedback} | {improvements} |
| OpenHands | {task} | {truncated}... | 5/5 | {feedback} | {improvements} |
| Metric | Value |
|---|---|
| Winner | {framework} |
| Reasoning | {why} |
| Judge Model | {model} |
| Eval Time | {ms} |
# Run and capture output
python scripts/run_agent.py autogen "YOUR_TASK" --json > /tmp/autogen.json
python scripts/run_agent.py crewai "YOUR_TASK" --json > /tmp/crewai.json
python scripts/run_agent.py openhands "YOUR_TASK" --json > /tmp/openhands.json
Or run all at once:
python scripts/run_agent.py all "YOUR_TASK" --json
Use ComparativeEvaluator from agents/benchmark.py:
response field from each agent's outputevaluator.evaluate(task, responses)| Score | Meaning |
|---|---|
| 0 | Failed/error |
| 1 | Mostly wrong |
| 2 | Partial |
| 3 | Acceptable |
| 4 | Good |
| 5 | Excellent |
| Task | Role |
|---|---|
| List 3 recent GitHub issues | software_engineer |
| Summarize Sentry errors this week | support_engineer |
| Generate release notes for v1.2.0 | release_engineer |
| Triage open PRs | software_engineer |
| Check CI status | release_engineer |
| File | Purpose |
|---|---|
scripts/run_agent.py | CLI to run agents with tasks |
agents/benchmark.py | ComparativeEvaluator for scoring |
agents/autogen/*.py | AutoGen agents |
agents/crewai/*.py | CrewAI agents |
agents/openhands/*.py | OpenHands agents |