تشغيل أي مهارة في Manus بنقرة واحدة

eval-reporter

النجوم٥

التفرعات٠

آخر تحديث٢١ فبراير ٢٠٢٦ في ٠٧:٢٣

Generate HTML reports and structured metrics from model evaluation results. Creates publication-ready reports with embedded outputs (images, audio, charts) and metrics.json for cross-model comparison. Use when generating reports, writing metrics, creating evaluation summaries, or formatting benchmark results. Triggers on "generate report", "write metrics", "create report", "evaluation summary", "benchmark results", "format results".

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

nyosegawa

nyosegawa/agentic-bench

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

علماء البياناتمهن الحاسوب والرياضيات·SOC 15-2051

مستكشف الملفات

4 ملفات

SKILL.md

readonly

name

eval-reporter

description

Eval Reporter

You are generating evaluation reports from model benchmark results.

Your Goal

Given evaluation outputs and metrics:

Write structured metrics.json with all measurements
Write a beautiful HTML report directly (no template — you design the page)
Save everything to results/YYYY-MM-DD_modelname/

Workflow

Step 1: Collect Data

Gather from the previous phases:

Model profile from research phase: name, URL, description, architecture, params, license, notable features
Stage results: smoke test pass/fail, quality outputs, performance metrics
Generated artifacts: images, audio, text files in artifacts/
Timing and cost information
Device info: GPU, VRAM, framework versions

Step 2: Write metrics.json

Run the metrics writer:

python .claude/skills/eval-reporter/scripts/metrics_writer.py \
  --output results/YYYY-MM-DD_modelname/metrics.json \
  --model MODEL_ID \
  --model-type MODEL_TYPE \
  --provider PROVIDER \
  --gpu GPU_NAME \
  --json-data '{"stages": {...}}'

Or construct the JSON directly following references/report-format.md.

Step 3: Write HTML Report

Write the HTML yourself. Do not use a template. You are an ML engineer writing a report for a technical audience. Design the page to be informative, honest, and beautiful.

Consult references/report-format.md for design guidelines (CSS palette, component examples).

Required sections:

Header — Model name (linked to HuggingFace page), type badge, date, provider/GPU
Model Overview — Architecture, parameter count, license, key features, 1-2 sentence description from model card
Execution Environment — GPU, VRAM, framework versions, provider
Smoke Test — Pass/fail, load time, any issues
Quality Results — Test inputs paired with outputs:
- TTS: show input text above each <audio> player
- Image gen: show prompt above each <img>
- LLM: show prompt and response together
- Always show what went IN and what came OUT
Performance — Key metrics table, comparison to published benchmarks if known
Conclusion — Your honest assessment: strengths, weaknesses, comparison to claimed performance, practical recommendations
Reproduction — Link to workspace/run.py, note the provider and cost

Design principles:

Dark theme preferred (see references/report-format.md for CSS palette)
Responsive layout, readable at any width
Use semantic HTML — <table>, <audio>, <img>, not div soup
Reference artifacts by relative path (not base64)

Step 4: Organize Result Directory

results/YYYY-MM-DD_modelname/
├── report.html          # Human-readable report (GitHub Pages ready)
├── metrics.json         # Structured data for comparison
├── artifacts/           # Generated outputs (images, audio, text)
│   ├── output_001.png
│   ├── output_002.wav
│   └── ...
└── workspace/           # Reproduction scripts (committed)
    └── run.py           # The inference script used

Step 5: Update Index Page

Regenerate the top-level index.html (report listing for GitHub Pages):

python .claude/skills/eval-reporter/scripts/generate_index.py

Important

Consult references/report-format.md for the metrics.json schema and CSS design reference
Images should be referenced by relative path in HTML (not base64, to keep file size small)
Audio files: use <audio controls> tags in HTML
Always include the run.py script for reproducibility
Be honest in assessments — document failures and limitations clearly
The report should be self-contained and understandable without reading metrics.json

المزيد من هذا المستودع

نفس المستودع

gpu-runner

nyosegawa/agentic-bench

Execute model inference on GPU cloud providers. Handles code generation, deployment, execution, and result collection across HF Inference API/Endpoints, Colab, Modal, beam.cloud, Vast.ai, and RunPod. Use when running models on GPU, deploying to cloud, executing notebooks, or troubleshooting GPU execution failures. Triggers on "run on GPU", "execute model", "deploy to modal", "colab notebook", "beam deploy", "HF inference", "HF endpoints", "vast", "runpod".

2026-02-215

model-researcher

nyosegawa/agentic-bench

Investigate model specifications, requirements, and evaluation strategy. Use when researching a model before benchmarking: reading HuggingFace model cards, estimating VRAM requirements, selecting GPU providers, and determining evaluation approach. Triggers on "model research", "investigate model", "model info", "VRAM estimate", "which provider", "model card".

2026-02-215

agentic-bench

nyosegawa/agentic-bench

Autonomous model validation and benchmarking. Investigates any ML model (LLM, image gen, TTS, time series, etc.), runs it on GPU cloud, evaluates quality and performance, and generates HTML reports. Use when user asks to verify, benchmark, evaluate, or test a model. Triggers on "verify model", "benchmark", "evaluate model", "test model", "run benchmark", "model evaluation", "モデルを検証", "ベンチマーク", "モデルを試して".

2026-02-215

name

eval-reporter

description

Eval Reporter

You are generating evaluation reports from model benchmark results.

Your Goal

Given evaluation outputs and metrics:

Write structured metrics.json with all measurements
Write a beautiful HTML report directly (no template — you design the page)
Save everything to results/YYYY-MM-DD_modelname/

Workflow

Step 1: Collect Data

Gather from the previous phases:

Model profile from research phase: name, URL, description, architecture, params, license, notable features
Stage results: smoke test pass/fail, quality outputs, performance metrics
Generated artifacts: images, audio, text files in artifacts/
Timing and cost information
Device info: GPU, VRAM, framework versions

Step 2: Write metrics.json

Run the metrics writer:

python .claude/skills/eval-reporter/scripts/metrics_writer.py \
  --output results/YYYY-MM-DD_modelname/metrics.json \
  --model MODEL_ID \
  --model-type MODEL_TYPE \
  --provider PROVIDER \
  --gpu GPU_NAME \
  --json-data '{"stages": {...}}'

Or construct the JSON directly following references/report-format.md.

Step 3: Write HTML Report

Write the HTML yourself. Do not use a template. You are an ML engineer writing a report for a technical audience. Design the page to be informative, honest, and beautiful.

Consult references/report-format.md for design guidelines (CSS palette, component examples).

Required sections:

Header — Model name (linked to HuggingFace page), type badge, date, provider/GPU
Model Overview — Architecture, parameter count, license, key features, 1-2 sentence description from model card
Execution Environment — GPU, VRAM, framework versions, provider
Smoke Test — Pass/fail, load time, any issues
Quality Results — Test inputs paired with outputs:
- TTS: show input text above each <audio> player
- Image gen: show prompt above each <img>
- LLM: show prompt and response together
- Always show what went IN and what came OUT
Performance — Key metrics table, comparison to published benchmarks if known
Conclusion — Your honest assessment: strengths, weaknesses, comparison to claimed performance, practical recommendations
Reproduction — Link to workspace/run.py, note the provider and cost

Design principles:

Dark theme preferred (see references/report-format.md for CSS palette)
Responsive layout, readable at any width
Use semantic HTML — <table>, <audio>, <img>, not div soup
Reference artifacts by relative path (not base64)

Step 4: Organize Result Directory

results/YYYY-MM-DD_modelname/
├── report.html          # Human-readable report (GitHub Pages ready)
├── metrics.json         # Structured data for comparison
├── artifacts/           # Generated outputs (images, audio, text)
│   ├── output_001.png
│   ├── output_002.wav
│   └── ...
└── workspace/           # Reproduction scripts (committed)
    └── run.py           # The inference script used

Step 5: Update Index Page

Regenerate the top-level index.html (report listing for GitHub Pages):

python .claude/skills/eval-reporter/scripts/generate_index.py

Important

Consult references/report-format.md for the metrics.json schema and CSS design reference
Images should be referenced by relative path in HTML (not base64, to keep file size small)
Audio files: use <audio controls> tags in HTML
Always include the run.py script for reproducibility
Be honest in assessments — document failures and limitations clearly
The report should be self-contained and understandable without reading metrics.json