ワンクリックで
agent-evaluation
Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.
Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.
Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.
Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.
Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.
Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.
SOC 職業分類に基づく
| name | agent-evaluation |
| description | Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format |
| license | MIT |
| compatibility | opencode |
| metadata | {"audience":"developers","workflow":"evaluation"} |
OpenCode evaluates VibeTeam agents using agents/benchmark.py and reports results in table format.
Before running ANY evaluation, you MUST verify that the deployed code matches the repo code. Git-sync updates files on disk, but Python processes don't hot-reload — pods must be restarted.
# Get the commit SHA running in the cluster
export $( < .env )
DEPLOYED_SHA=$(kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- cat /code/.git/worktrees/*/HEAD 2>/dev/null)
LOCAL_SHA=$(git rev-parse origin/master)
echo "Deployed: $DEPLOYED_SHA"
echo "Repo: $LOCAL_SHA"
If they differ, wait for git-sync (30s) or manually restart pods.
# Check when the pod last started (Python loads modules at startup)
kubectl get pods -n vibeteam -l app=vibeteam-gateway -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'
kubectl get pods -n vibeteam -l app=openhands-svc -o jsonpath='{.items[*].status.containerStatuses[0].state.running.startedAt}'
If pods started BEFORE the latest commit, the Python process is running stale code. Restart:
kubectl rollout resume deployment/vibeteam-gateway -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/vibeteam-gateway -n vibeteam
kubectl rollout resume deployment/openhands-svc -n vibeteam 2>/dev/null || true
kubectl rollout restart deployment/openhands-svc -n vibeteam
kubectl rollout status deployment/vibeteam-gateway -n vibeteam --timeout=120s
kubectl rollout status deployment/openhands-svc -n vibeteam --timeout=120s
If you changed a specific file (e.g., removed "Thinking..." messages), verify it's gone:
# Example: check that "Thinking..." is not in the deployed gateway code
kubectl exec deployment/vibeteam-gateway -n vibeteam -c gateway -- grep -rn "Thinking" /code/current/vibeteam/ 2>/dev/null
# Should return nothing
kubectl rollout pause deployment/vibeteam-gateway -n vibeteam
kubectl rollout pause deployment/openhands-svc -n vibeteam
Only proceed with evaluation after all checks pass.
ComparativeEvaluator.evaluate() from agents/benchmark.py| Framework | Input | Output | Score | Feedback | Recommendations |
|---|---|---|---|---|---|
| AutoGen | {task} | {first 100 chars}... | 4/5 | {judge feedback} | {specific improvements} |
| CrewAI | {task} | {first 100 chars}... | 3/5 | {judge feedback} | {specific improvements} |
| OpenHands | {task} | {first 100 chars}... | 5/5 | {judge feedback} | {specific improvements} |
| Metric | Value |
|---|---|
| Winner | {framework name} |
| Reasoning | {why winner was chosen} |
| Judge Model | {model used, e.g. gpt-5-2} |
| Eval Time | {milliseconds} |
| Score | Meaning |
|---|---|
| 0 | Failed/error/refusal |
| 1 | Mostly wrong |
| 2 | Partial, missing key elements |
| 3 | Acceptable |
| 4 | Good, comprehensive |
| 5 | Excellent |
cd ~/workspace/vibebrowser/VibeTeam
set -a && source .env && set +a
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands
Predefined tasks: sentry-weekly-summary, github-issue-triage, release-notes
| File | Purpose |
|---|---|
agents/benchmark.py | ComparativeEvaluator, QualityEvaluator |
agents/benchmark.py:286 | QualityEvaluator class |
agents/benchmark.py:459 | ComparativeEvaluator class |
Required before running:
source .env
Variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT