ワンクリックでManusで任意のスキルを実行

始める

agent-evaluation

スター1

フォーク0

更新日2026年2月13日 06:08

Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format

インストール

Codex または Claude でインストールこの Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。

Manusで実行

ソース

VibeTechnologies

VibeTechnologies/VibeTeam

GitHub リポジトリを開く Creator のリポジトリを見る

ダウンロード

Manusで実行

Agent Evaluation Skill

OpenCode evaluates VibeTeam agents using agents/benchmark.py and reports results in table format.

CRITICAL: Pre-Flight Deployment Verification

Before running ANY evaluation, you MUST verify that the deployed code matches the repo code. Git-sync updates files on disk, but Python processes don't hot-reload — pods must be restarted.

Step 1: Check deployed commit matches repo HEAD

# Get the commit SHA running in the cluster
export $( < .env )
DEPLOYED_SHA=$(kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- cat /code/.git/worktrees/*/HEAD 2>/dev/null)
LOCAL_SHA=$(git rev-parse origin/master)
echo "Deployed: $DEPLOYED_SHA"
echo "Repo:     $LOCAL_SHA"

If they differ, wait for git-sync (30s) or manually restart pods.

Step 2: Verify the Python process loaded current code

# Check when the pod last started (Python loads modules at startup)
kubectl get pods -n vibeteam -l app=vibeteam-gateway -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'
kubectl get pods -n vibeteam -l app=openhands-svc -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'

If pods started BEFORE the latest commit, the Python process is running stale code. Restart:

kubectl rollout resume deployment/vibeteam-gateway -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/vibeteam-gateway -n vibeteam
kubectl rollout resume deployment/openhands-svc -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/openhands-svc -n vibeteam
kubectl rollout status deployment/vibeteam-gateway -n vibeteam --timeout=120s
kubectl rollout status deployment/openhands-svc -n vibeteam --timeout=120s

Step 3: Verify specific code changes

If you changed a specific file (e.g., removed "Thinking..." messages), verify it's gone:

# Example: check that "Thinking..." is not in the deployed gateway code
kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- grep -rn "Thinking" /code/current/vibeteam/ 2>/dev/null
# Should return nothing

Step 4: Pause rollouts before eval

kubectl rollout pause deployment/vibeteam-gateway -n vibeteam
kubectl rollout pause deployment/openhands-svc -n vibeteam

Only proceed with evaluation after all checks pass.

Workflow

Pre-flight checks (see above) — verify deployed code matches repo
Run each agent (AutoGen, CrewAI, OpenHands) with the given task
Call ComparativeEvaluator.evaluate() from agents/benchmark.py
Present results in the table format below

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{first 100 chars}...	4/5	{judge feedback}	{specific improvements}
CrewAI	{task}	{first 100 chars}...	3/5	{judge feedback}	{specific improvements}
OpenHands	{task}	{first 100 chars}...	5/5	{judge feedback}	{specific improvements}

Summary

Metric	Value
Winner	{framework name}
Reasoning	{why winner was chosen}
Judge Model	{model used, e.g. gpt-5-2}
Eval Time	{milliseconds}

Scoring Scale

Score	Meaning
0	Failed/error/refusal
1	Mostly wrong
2	Partial, missing key elements
3	Acceptable
4	Good, comprehensive
5	Excellent

CLI Alternative

cd ~/workspace/vibebrowser/VibeTeam
set -a && source .env && set +a
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Predefined tasks: sentry-weekly-summary, github-issue-triage, release-notes

Key Files

File	Purpose
`agents/benchmark.py`	`ComparativeEvaluator`, `QualityEvaluator`
`agents/benchmark.py:286`	`QualityEvaluator` class
`agents/benchmark.py:459`	`ComparativeEvaluator` class

Environment

Required before running:

source .env

Variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT

このリポジトリの他の Skills

同じリポジトリ

github-apps

VibeTechnologies/VibeTeam

Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.

2026-03-121

github-handoff-evals

VibeTechnologies/VibeTeam

Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.

2026-03-121

slack-app

VibeTechnologies/VibeTeam

Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.

2026-03-121

task-completition-evaluation

VibeTechnologies/VibeTeam

Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.

2026-03-121

knowledgebase-search

VibeTechnologies/VibeTeam

Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.

2026-03-051

knowledgebase-search

VibeTechnologies/VibeTeam

Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.

2026-03-051

name	agent-evaluation
description	Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Agent Evaluation Skill

OpenCode evaluates VibeTeam agents using agents/benchmark.py and reports results in table format.

CRITICAL: Pre-Flight Deployment Verification

Before running ANY evaluation, you MUST verify that the deployed code matches the repo code. Git-sync updates files on disk, but Python processes don't hot-reload — pods must be restarted.

Step 1: Check deployed commit matches repo HEAD

# Get the commit SHA running in the cluster
export $( < .env )
DEPLOYED_SHA=$(kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- cat /code/.git/worktrees/*/HEAD 2>/dev/null)
LOCAL_SHA=$(git rev-parse origin/master)
echo "Deployed: $DEPLOYED_SHA"
echo "Repo:     $LOCAL_SHA"

If they differ, wait for git-sync (30s) or manually restart pods.

Step 2: Verify the Python process loaded current code

# Check when the pod last started (Python loads modules at startup)
kubectl get pods -n vibeteam -l app=vibeteam-gateway -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'
kubectl get pods -n vibeteam -l app=openhands-svc -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'

If pods started BEFORE the latest commit, the Python process is running stale code. Restart:

kubectl rollout resume deployment/vibeteam-gateway -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/vibeteam-gateway -n vibeteam
kubectl rollout resume deployment/openhands-svc -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/openhands-svc -n vibeteam
kubectl rollout status deployment/vibeteam-gateway -n vibeteam --timeout=120s
kubectl rollout status deployment/openhands-svc -n vibeteam --timeout=120s

Step 3: Verify specific code changes

If you changed a specific file (e.g., removed "Thinking..." messages), verify it's gone:

# Example: check that "Thinking..." is not in the deployed gateway code
kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- grep -rn "Thinking" /code/current/vibeteam/ 2>/dev/null
# Should return nothing

Step 4: Pause rollouts before eval

kubectl rollout pause deployment/vibeteam-gateway -n vibeteam
kubectl rollout pause deployment/openhands-svc -n vibeteam

Only proceed with evaluation after all checks pass.

Workflow

Pre-flight checks (see above) — verify deployed code matches repo
Run each agent (AutoGen, CrewAI, OpenHands) with the given task
Call ComparativeEvaluator.evaluate() from agents/benchmark.py
Present results in the table format below

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{first 100 chars}...	4/5	{judge feedback}	{specific improvements}
CrewAI	{task}	{first 100 chars}...	3/5	{judge feedback}	{specific improvements}
OpenHands	{task}	{first 100 chars}...	5/5	{judge feedback}	{specific improvements}

Summary

Metric	Value
Winner	{framework name}
Reasoning	{why winner was chosen}
Judge Model	{model used, e.g. gpt-5-2}
Eval Time	{milliseconds}

Scoring Scale

Score	Meaning
0	Failed/error/refusal
1	Mostly wrong
2	Partial, missing key elements
3	Acceptable
4	Good, comprehensive
5	Excellent

CLI Alternative

cd ~/workspace/vibebrowser/VibeTeam
set -a && source .env && set +a
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Predefined tasks: sentry-weekly-summary, github-issue-triage, release-notes

Key Files

File	Purpose
`agents/benchmark.py`	`ComparativeEvaluator`, `QualityEvaluator`
`agents/benchmark.py:286`	`QualityEvaluator` class
`agents/benchmark.py:459`	`ComparativeEvaluator` class

Environment

Required before running:

source .env

Variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT