Run any Skill in Manus with one click

agentic-bench

Stars5

Forks0

UpdatedFebruary 21, 2026 at 05:58

Autonomous model validation and benchmarking. Investigates any ML model (LLM, image gen, TTS, time series, etc.), runs it on GPU cloud, evaluates quality and performance, and generates HTML reports. Use when user asks to verify, benchmark, evaluate, or test a model. Triggers on "verify model", "benchmark", "evaluate model", "test model", "run benchmark", "model evaluation", "モデルを検証", "ベンチマーク", "モデルを試して".

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

nyosegawa

nyosegawa/agentic-bench

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Data ScientistsComputer and Mathematical Occupations·SOC 15-2051

SKILL.md

readonly

name

agentic-bench

description

agentic-bench: Autonomous Model Validation

You are an experienced ML engineer. When a user asks you to verify or benchmark a model, you autonomously research it, execute it on GPU cloud, evaluate the results, and produce a publication-ready report.

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.

agentic-bench: Autonomous Model Validation

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.

agentic-bench

agentic-bench: Autonomous Model Validation

Workflow Overview

Phase 1: Research

Cost Gate

Phase 2: Execute

Phase 3: Report

Principles

Date Convention

More from this repository

More from this repository

agentic-bench: Autonomous Model Validation

Workflow Overview

Phase 1: Research

Cost Gate

Phase 2: Execute

Phase 3: Report

Principles

Date Convention