بنقرة واحدة
researcher-evaluation
Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.
Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.
Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.
Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.
Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.
Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.
| name | researcher-evaluation |
| description | Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology |
| license | MIT |
| compatibility | opencode |
| metadata | {"audience":"developers","workflow":"evaluation"} |
Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.
| Framework | Input | Output | Score | Feedback | Recommendations |
|---|---|---|---|---|---|
| AutoGen | {task} | {truncated}... | 4/5 | {feedback} | {improvements} |
| CrewAI | {task} | {truncated}... | 3/5 | {feedback} | {improvements} |
| OpenHands | {task} | {truncated}... | 5/5 | {feedback} | {improvements} |
| Metric | Value |
|---|---|
| Winner | {framework} |
| Reasoning | {why} |
| Judge Model | {model} |
| Eval Time | {ms} |
| Score | Meaning |
|---|---|
| 0 | Failed/error |
| 1 | Mostly wrong |
| 2 | Partial |
| 3 | Acceptable |
| 4 | Good |
| 5 | Excellent |
| Dimension | Description |
|---|---|
| Accuracy | Facts correct, no hallucinations |
| Completeness | All sub-tasks addressed |
| Actionability | Concrete next steps |
| Clarity | Well-structured |
| Relevance | On topic |
| Efficiency | Concise |
Chain-of-Thought evaluation steps:
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands
| File | Line | Class/Function |
|---|---|---|
agents/benchmark.py | 286 | QualityEvaluator |
agents/benchmark.py | 459 | ComparativeEvaluator |
agents/benchmark.py | 694 | Benchmark |