원클릭으로 Manus에서 모든 스킬 실행

시작하기

researcher-evaluation

스타1

포크0

업데이트2026년 1월 28일 21:53

Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

VibeTechnologies

VibeTechnologies/VibeTeam

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Researcher Evaluation Skill

Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Evaluation Dimensions

Dimension	Description
Accuracy	Facts correct, no hallucinations
Completeness	All sub-tasks addressed
Actionability	Concrete next steps
Clarity	Well-structured
Relevance	On topic
Efficiency	Concise

Methodology (G-Eval)

Chain-of-Thought evaluation steps:

Check factual accuracy
Verify task completion
Assess actionability
Check for hallucinations
Evaluate conciseness → Score 0-5

CLI

python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Key Files

File	Line	Class/Function
`agents/benchmark.py`	286	`QualityEvaluator`
`agents/benchmark.py`	459	`ComparativeEvaluator`
`agents/benchmark.py`	694	`Benchmark`

이 저장소의 다른 Skills

같은 저장소

github-apps

VibeTechnologies/VibeTeam

Create and configure role-scoped GitHub Apps for VibeTeam, map credentials to agents placeholders, and validate installation permissions/identity.

2026-03-121

github-handoff-evals

VibeTechnologies/VibeTeam

Run VibeTeam GitHub/Slack handoff validation with unit tests, Slack evals, GitHub webhook evals, and permission checks. Use when validating multi-agent GitHub communication (issues, discussions, PR comments) or when asked to prove changes via tests/evals and record status.

2026-03-121

slack-app

VibeTechnologies/VibeTeam

Create and configure VibeTeam Slack apps (one ingress app plus role-scoped responder apps), wire role tokens/secrets, and validate routing/identity behavior.

2026-03-121

task-completition-evaluation

VibeTechnologies/VibeTeam

Final completion gate for VibeTeam tasks. Use at the end of implementation to verify diff quality, real testing, GitHub/Slack multi-agent communication evidence, and PR health before declaring done.

2026-03-121

knowledgebase-search

VibeTechnologies/VibeTeam

Search shared knowledgebase content using docs_tools (BM25 + fallback keyword scoring) before answering from memory.

2026-03-051

knowledgebase-search

VibeTechnologies/VibeTeam

Shared workflow for knowledgebase retrieval using docs_tools and injected OpenClaw context.

2026-03-051

name	researcher-evaluation
description	Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
license	MIT
compatibility	opencode
metadata	{"audience":"developers","workflow":"evaluation"}

Researcher Evaluation Skill

Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.

Required Output Format

Evaluation Results

Framework	Input	Output	Score	Feedback	Recommendations
AutoGen	{task}	{truncated}...	4/5	{feedback}	{improvements}
CrewAI	{task}	{truncated}...	3/5	{feedback}	{improvements}
OpenHands	{task}	{truncated}...	5/5	{feedback}	{improvements}

Summary

Metric	Value
Winner	{framework}
Reasoning	{why}
Judge Model	{model}
Eval Time	{ms}

Scoring Scale

Score	Meaning
0	Failed/error
1	Mostly wrong
2	Partial
3	Acceptable
4	Good
5	Excellent

Evaluation Dimensions

Dimension	Description
Accuracy	Facts correct, no hallucinations
Completeness	All sub-tasks addressed
Actionability	Concrete next steps
Clarity	Well-structured
Relevance	On topic
Efficiency	Concise

Methodology (G-Eval)

Chain-of-Thought evaluation steps:

Check factual accuracy
Verify task completion
Assess actionability
Check for hallucinations
Evaluate conciseness → Score 0-5

CLI

python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Key Files

File	Line	Class/Function
`agents/benchmark.py`	286	`QualityEvaluator`
`agents/benchmark.py`	459	`ComparativeEvaluator`
`agents/benchmark.py`	694	`Benchmark`