원클릭으로
researcher-evaluation
Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.
Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.
Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.
Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.
Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.
Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.
| name | researcher-evaluation |
| description | Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology |
| license | MIT |
| compatibility | opencode |
| metadata | {"audience":"developers","workflow":"evaluation"} |
Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.
| Framework | Input | Output | Score | Feedback | Recommendations |
|---|---|---|---|---|---|
| AutoGen | {task} | {truncated}... | 4/5 | {feedback} | {improvements} |
| CrewAI | {task} | {truncated}... | 3/5 | {feedback} | {improvements} |
| OpenHands | {task} | {truncated}... | 5/5 | {feedback} | {improvements} |
| Metric | Value |
|---|---|
| Winner | {framework} |
| Reasoning | {why} |
| Judge Model | {model} |
| Eval Time | {ms} |
| Score | Meaning |
|---|---|
| 0 | Failed/error |
| 1 | Mostly wrong |
| 2 | Partial |
| 3 | Acceptable |
| 4 | Good |
| 5 | Excellent |
| Dimension | Description |
|---|---|
| Accuracy | Facts correct, no hallucinations |
| Completeness | All sub-tasks addressed |
| Actionability | Concrete next steps |
| Clarity | Well-structured |
| Relevance | On topic |
| Efficiency | Concise |
Chain-of-Thought evaluation steps:
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands
| File | Line | Class/Function |
|---|---|---|
agents/benchmark.py | 286 | QualityEvaluator |
agents/benchmark.py | 459 | ComparativeEvaluator |
agents/benchmark.py | 694 | Benchmark |