تشغيل أي مهارة في Manus بنقرة واحدة

ابدأ الآن

researcher-evaluation

النجوم١

التفرعات٠

آخر تحديث٢٨ يناير ٢٠٢٦ في ٢١:٥٣

Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

VibeTechnologies

VibeTechnologies/VibeTeam

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

SKILL.md

readonly

name	researcher-evaluation
description	Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Researcher Evaluation Skill

Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Evaluation Dimensions

Dimension	Description
Accuracy	Facts correct, no hallucinations
Completeness	All sub-tasks addressed
Actionability	Concrete next steps
Clarity	Well-structured
Relevance	On topic
Efficiency	Concise

Methodology (G-Eval)

Chain-of-Thought evaluation steps:

Check factual accuracy
Verify task completion
Assess actionability
Check for hallucinations
Evaluate conciseness → Score 0-5

CLI

python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Key Files

File	Line	Class/Function
`agents/benchmark.py`	286	`QualityEvaluator`
`agents/benchmark.py`	459	`ComparativeEvaluator`
`agents/benchmark.py`	694	`Benchmark`

المزيد من هذا المستودع

نفس المستودع

github-apps

VibeTechnologies/VibeTeam

Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.

2026-03-121

github-handoff-evals

VibeTechnologies/VibeTeam

Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.

2026-03-121

slack-app

VibeTechnologies/VibeTeam

Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.

2026-03-121

task-completition-evaluation

VibeTechnologies/VibeTeam

Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.

2026-03-121

knowledgebase-search

VibeTechnologies/VibeTeam

Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.

2026-03-051

knowledgebase-search

VibeTechnologies/VibeTeam

Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.

2026-03-051

name	researcher-evaluation
description	Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Researcher Evaluation Skill

Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Evaluation Dimensions

Dimension	Description
Accuracy	Facts correct, no hallucinations
Completeness	All sub-tasks addressed
Actionability	Concrete next steps
Clarity	Well-structured
Relevance	On topic
Efficiency	Concise

Methodology (G-Eval)

Chain-of-Thought evaluation steps:

Check factual accuracy
Verify task completion
Assess actionability
Check for hallucinations
Evaluate conciseness → Score 0-5

CLI

python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Key Files

File	Line	Class/Function
`agents/benchmark.py`	286	`QualityEvaluator`
`agents/benchmark.py`	459	`ComparativeEvaluator`
`agents/benchmark.py`	694	`Benchmark`