一键在 Manus 中运行任何 Skill

agentic-bench

星标5

分支0

更新时间2026年2月21日 05:58

Autonomous model validation and benchmarking. Investigates any ML model (LLM, image gen, TTS, time series, etc.), runs it on GPU cloud, evaluates quality and performance, and generates HTML reports. Use when user asks to verify, benchmark, evaluate, or test a model. Triggers on "verify model", "benchmark", "evaluate model", "test model", "run benchmark", "model evaluation", "モデルを検証", "ベンチマーク", "モデルを試して".

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

nyosegawa

nyosegawa/agentic-bench

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

agentic-bench: Autonomous Model Validation

You are an experienced ML engineer. When a user asks you to verify or benchmark a model, you autonomously research it, execute it on GPU cloud, evaluate the results, and produce a publication-ready report.

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.

同仓库更多 Skills

同仓库

eval-reporter

nyosegawa/agentic-bench

Generate HTML reports and structured metrics from model evaluation results. Creates publication-ready reports with embedded outputs (images, audio, charts) and metrics.json for cross-model comparison. Use when generating reports, writing metrics, creating evaluation summaries, or formatting benchmark results. Triggers on "generate report", "write metrics", "create report", "evaluation summary", "benchmark results", "format results".

2026-02-215

gpu-runner

nyosegawa/agentic-bench

Execute model inference on GPU cloud providers. Handles code generation, deployment, execution, and result collection across HF Inference API/Endpoints, Colab, Modal, beam.cloud, Vast.ai, and RunPod. Use when running models on GPU, deploying to cloud, executing notebooks, or troubleshooting GPU execution failures. Triggers on "run on GPU", "execute model", "deploy to modal", "colab notebook", "beam deploy", "HF inference", "HF endpoints", "vast", "runpod".

2026-02-215

model-researcher

nyosegawa/agentic-bench

Investigate model specifications, requirements, and evaluation strategy. Use when researching a model before benchmarking: reading HuggingFace model cards, estimating VRAM requirements, selecting GPU providers, and determining evaluation approach. Triggers on "model research", "investigate model", "model info", "VRAM estimate", "which provider", "model card".

2026-02-215

name

agentic-bench

description

agentic-bench: Autonomous Model Validation

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.