Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

Commencer

agent-evaluation-direct

Étoiles1

Forks0

Mis à jour28 janvier 2026 à 21:53

Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

VibeTechnologies

VibeTechnologies/VibeTeam

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques·SOC 15-1253

SKILL.md

readonly

name	agent-evaluation-direct
description	Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Agent Evaluation (Direct) Skill

Evaluate agents by running tasks directly and comparing responses. OpenCode submits tasks to each agent using scripts/run_agent.py, then evaluates results with agents/benchmark.py.

Workflow

Define task - OpenCode determines the evaluation task
Run agents - Execute scripts/run_agent.py for each framework
Collect responses - Capture output from each agent
Evaluate - Use ComparativeEvaluator to score responses
Report - Present results in table format

CLI Commands

Run single agent:

python scripts/run_agent.py autogen "List 3 GitHub issues"
python scripts/run_agent.py crewai "List 3 GitHub issues"
python scripts/run_agent.py openhands "List 3 GitHub issues"

Run all agents:

python scripts/run_agent.py all "List 3 GitHub issues"

JSON output (for parsing):

python scripts/run_agent.py autogen "List 3 GitHub issues" --json

Options:

--role - Agent role: software_engineer, support_engineer, release_engineer
--json - Output as JSON
--timeout - Timeout in seconds (default: 180)

Required Output Format

Agent Responses

For each agent run, capture:

Field	Description
Framework	autogen, crewai, openhands
Task	The input task
Response	Agent's output
Latency	Time in ms
Success	true/false

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Evaluation Steps

Step 1: Run Each Agent

# Run and capture output
python scripts/run_agent.py autogen "YOUR_TASK" --json > /tmp/autogen.json
python scripts/run_agent.py crewai "YOUR_TASK" --json > /tmp/crewai.json
python scripts/run_agent.py openhands "YOUR_TASK" --json > /tmp/openhands.json

Or run all at once:

python scripts/run_agent.py all "YOUR_TASK" --json

Step 2: Evaluate Responses

Use ComparativeEvaluator from agents/benchmark.py:

Extract response field from each agent's output
Call evaluator.evaluate(task, responses)
Format results into table

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Example Tasks

Task	Role
List 3 recent GitHub issues	software_engineer
Summarize Sentry errors this week	support_engineer
Generate release notes for v1.2.0	release_engineer
Triage open PRs	software_engineer
Check CI status	release_engineer

Key Files

File	Purpose
`scripts/run_agent.py`	CLI to run agents with tasks
`agents/benchmark.py`	`ComparativeEvaluator` for scoring
`agents/autogen/*.py`	AutoGen agents
`agents/crewai/*.py`	CrewAI agents
`agents/openhands/*.py`	OpenHands agents

Plus depuis ce dépôt

même dépôt

github-apps

VibeTechnologies/VibeTeam

Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.

2026-03-121

github-handoff-evals

VibeTechnologies/VibeTeam

Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.

2026-03-121

slack-app

VibeTechnologies/VibeTeam

Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.

2026-03-121

task-completition-evaluation

VibeTechnologies/VibeTeam

Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.

2026-03-121

knowledgebase-search

VibeTechnologies/VibeTeam

Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.

2026-03-051

knowledgebase-search

VibeTechnologies/VibeTeam

Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.

2026-03-051

name	agent-evaluation-direct
description	Evaluate VibeTeam agents by running tasks directly via scripts/run_agent.py and comparing responses
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Agent Evaluation (Direct) Skill

Evaluate agents by running tasks directly and comparing responses. OpenCode submits tasks to each agent using scripts/run_agent.py, then evaluates results with agents/benchmark.py.

Workflow

Define task - OpenCode determines the evaluation task
Run agents - Execute scripts/run_agent.py for each framework
Collect responses - Capture output from each agent
Evaluate - Use ComparativeEvaluator to score responses
Report - Present results in table format

CLI Commands

Run single agent:

python scripts/run_agent.py autogen "List 3 GitHub issues"
python scripts/run_agent.py crewai "List 3 GitHub issues"
python scripts/run_agent.py openhands "List 3 GitHub issues"

Run all agents:

python scripts/run_agent.py all "List 3 GitHub issues"

JSON output (for parsing):

python scripts/run_agent.py autogen "List 3 GitHub issues" --json

Options:

--role - Agent role: software_engineer, support_engineer, release_engineer
--json - Output as JSON
--timeout - Timeout in seconds (default: 180)

Required Output Format

Agent Responses

For each agent run, capture:

Field	Description
Framework	autogen, crewai, openhands
Task	The input task
Response	Agent's output
Latency	Time in ms
Success	true/false

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Evaluation Steps

Step 1: Run Each Agent

# Run and capture output
python scripts/run_agent.py autogen "YOUR_TASK" --json > /tmp/autogen.json
python scripts/run_agent.py crewai "YOUR_TASK" --json > /tmp/crewai.json
python scripts/run_agent.py openhands "YOUR_TASK" --json > /tmp/openhands.json

Or run all at once:

python scripts/run_agent.py all "YOUR_TASK" --json

Step 2: Evaluate Responses

Use ComparativeEvaluator from agents/benchmark.py:

Extract response field from each agent's output
Call evaluator.evaluate(task, responses)
Format results into table

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Example Tasks

Task	Role
List 3 recent GitHub issues	software_engineer
Summarize Sentry errors this week	support_engineer
Generate release notes for v1.2.0	release_engineer
Triage open PRs	software_engineer
Check CI status	release_engineer

Key Files

File	Purpose
`scripts/run_agent.py`	CLI to run agents with tasks
`agents/benchmark.py`	`ComparativeEvaluator` for scoring
`agents/autogen/*.py`	AutoGen agents
`agents/crewai/*.py`	CrewAI agents
`agents/openhands/*.py`	OpenHands agents