원클릭으로 Manus에서 모든 스킬 실행

시작하기

agent-evaluation-direct

스타1

포크0

업데이트2026년 1월 28일 21:53

Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

VibeTechnologies

VibeTechnologies/VibeTeam

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Agent Evaluation (Direct) Skill

Evaluate agents by running tasks directly and comparing responses. OpenCode submits tasks to each agent using scripts/run_agent.py, then evaluates results with agents/benchmark.py.

Workflow

Define task - OpenCode determines the evaluation task
Run agents - Execute scripts/run_agent.py for each framework
Collect responses - Capture output from each agent
Evaluate - Use ComparativeEvaluator to score responses
Report - Present results in table format

CLI Commands

Run single agent:

python scripts/run_agent.py autogen "List 3 GitHub issues"
python scripts/run_agent.py crewai "List 3 GitHub issues"
python scripts/run_agent.py openhands "List 3 GitHub issues"

Run all agents:

python scripts/run_agent.py all "List 3 GitHub issues"

JSON output (for parsing):

python scripts/run_agent.py autogen "List 3 GitHub issues" --json

Options:

--role - Agent role: software_engineer, support_engineer, release_engineer
--json - Output as JSON
--timeout - Timeout in seconds (default: 180)

Required Output Format

Agent Responses

For each agent run, capture:

Field	Description
Framework	autogen, crewai, openhands
Task	The input task
Response	Agent's output
Latency	Time in ms
Success	true/false

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Evaluation Steps

Step 1: Run Each Agent

# Run and capture output
python scripts/run_agent.py autogen "YOUR_TASK" --json > /tmp/autogen.json
python scripts/run_agent.py crewai "YOUR_TASK" --json > /tmp/crewai.json
python scripts/run_agent.py openhands "YOUR_TASK" --json > /tmp/openhands.json

Or run all at once:

python scripts/run_agent.py all "YOUR_TASK" --json

Step 2: Evaluate Responses

Use ComparativeEvaluator from agents/benchmark.py:

Extract response field from each agent's output
Call evaluator.evaluate(task, responses)
Format results into table

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Example Tasks

Task	Role
List 3 recent GitHub issues	software_engineer
Summarize Sentry errors this week	support_engineer
Generate release notes for v1.2.0	release_engineer
Triage open PRs	software_engineer
Check CI status	release_engineer

Key Files

File	Purpose
`scripts/run_agent.py`	CLI to run agents with tasks
`agents/benchmark.py`	`ComparativeEvaluator` for scoring
`agents/autogen/*.py`	AutoGen agents
`agents/crewai/*.py`	CrewAI agents
`agents/openhands/*.py`	OpenHands agents

이 저장소의 다른 Skills

같은 저장소

github-apps

VibeTechnologies/VibeTeam

Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.

2026-03-121

github-handoff-evals

VibeTechnologies/VibeTeam

Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.

2026-03-121

slack-app

VibeTechnologies/VibeTeam

Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.

2026-03-121

task-completition-evaluation

VibeTechnologies/VibeTeam

Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.

2026-03-121

knowledgebase-search

VibeTechnologies/VibeTeam

Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.

2026-03-051

knowledgebase-search

VibeTechnologies/VibeTeam

Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.

2026-03-051

name	agent-evaluation-direct
description	Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Agent Evaluation (Direct) Skill

Evaluate agents by running tasks directly and comparing responses. OpenCode submits tasks to each agent using scripts/run_agent.py, then evaluates results with agents/benchmark.py.

Workflow

Define task - OpenCode determines the evaluation task
Run agents - Execute scripts/run_agent.py for each framework
Collect responses - Capture output from each agent
Evaluate - Use ComparativeEvaluator to score responses
Report - Present results in table format

CLI Commands

Run single agent:

python scripts/run_agent.py autogen "List 3 GitHub issues"
python scripts/run_agent.py crewai "List 3 GitHub issues"
python scripts/run_agent.py openhands "List 3 GitHub issues"

Run all agents:

python scripts/run_agent.py all "List 3 GitHub issues"

JSON output (for parsing):

python scripts/run_agent.py autogen "List 3 GitHub issues" --json

Options:

--role - Agent role: software_engineer, support_engineer, release_engineer
--json - Output as JSON
--timeout - Timeout in seconds (default: 180)

Required Output Format

Agent Responses

For each agent run, capture:

Field	Description
Framework	autogen, crewai, openhands
Task	The input task
Response	Agent's output
Latency	Time in ms
Success	true/false

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Evaluation Steps

Step 1: Run Each Agent

# Run and capture output
python scripts/run_agent.py autogen "YOUR_TASK" --json > /tmp/autogen.json
python scripts/run_agent.py crewai "YOUR_TASK" --json > /tmp/crewai.json
python scripts/run_agent.py openhands "YOUR_TASK" --json > /tmp/openhands.json

Or run all at once:

python scripts/run_agent.py all "YOUR_TASK" --json

Step 2: Evaluate Responses

Use ComparativeEvaluator from agents/benchmark.py:

Extract response field from each agent's output
Call evaluator.evaluate(task, responses)
Format results into table

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Example Tasks

Task	Role
List 3 recent GitHub issues	software_engineer
Summarize Sentry errors this week	support_engineer
Generate release notes for v1.2.0	release_engineer
Triage open PRs	software_engineer
Check CI status	release_engineer

Key Files

File	Purpose
`scripts/run_agent.py`	CLI to run agents with tasks
`agents/benchmark.py`	`ComparativeEvaluator` for scoring
`agents/autogen/*.py`	AutoGen agents
`agents/crewai/*.py`	CrewAI agents
`agents/openhands/*.py`	OpenHands agents