Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

agentic-bench

Estrellas5

Forks0

Actualizado21 de febrero de 2026, 05:58

Autonomous model validation and benchmarking. Investigates any ML model (LLM, image gen, TTS, time series, etc.), runs it on GPU cloud, evaluates quality and performance, and generates HTML reports. Use when user asks to verify, benchmark, evaluate, or test a model. Triggers on "verify model", "benchmark", "evaluate model", "test model", "run benchmark", "model evaluation", "モデルを検証", "ベンチマーク", "モデルを試して".

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

nyosegawa

nyosegawa/agentic-bench

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Ocupaciones relacionadasSOC

Basado en la clasificación ocupacional SOC

Científicos de datosOcupaciones informáticas y matemáticas·SOC 15-2051

SKILL.md

readonly

name

agentic-bench

description

agentic-bench: Autonomous Model Validation

You are an experienced ML engineer. When a user asks you to verify or benchmark a model, you autonomously research it, execute it on GPU cloud, evaluate the results, and produce a publication-ready report.

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.

Más de este repositorio

mismo repositorio

eval-reporter

nyosegawa/agentic-bench

Generate HTML reports and structured metrics from model evaluation results. Creates publication-ready reports with embedded outputs (images, audio, charts) and metrics.json for cross-model comparison. Use when generating reports, writing metrics, creating evaluation summaries, or formatting benchmark results. Triggers on "generate report", "write metrics", "create report", "evaluation summary", "benchmark results", "format results".

2026-02-215

gpu-runner

nyosegawa/agentic-bench

Execute model inference on GPU cloud providers. Handles code generation, deployment, execution, and result collection across HF Inference API/Endpoints, Colab, Modal, beam.cloud, Vast.ai, and RunPod. Use when running models on GPU, deploying to cloud, executing notebooks, or troubleshooting GPU execution failures. Triggers on "run on GPU", "execute model", "deploy to modal", "colab notebook", "beam deploy", "HF inference", "HF endpoints", "vast", "runpod".

2026-02-215

model-researcher

nyosegawa/agentic-bench

Investigate model specifications, requirements, and evaluation strategy. Use when researching a model before benchmarking: reading HuggingFace model cards, estimating VRAM requirements, selecting GPU providers, and determining evaluation approach. Triggers on "model research", "investigate model", "model info", "VRAM estimate", "which provider", "model card".

2026-02-215

name

agentic-bench

description

agentic-bench: Autonomous Model Validation

Workflow Overview

User: "Verify {model_name}"
  |
  v
Phase 1: Research (model-researcher)
  → Read model card, estimate VRAM, select provider, plan evaluation
  |
  v
Phase 2: Execute (gpu-runner)
  → Write inference code, run on GPU cloud, collect outputs
  |
  v
Phase 3: Report (eval-reporter)
  → Generate metrics.json + HTML report, commit to results/

Phase 1: Research

Consult the model-researcher skill knowledge:

Run python .claude/skills/model-researcher/scripts/hf_model_info.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/hf_inference_check.py MODEL_ID --json
Run python .claude/skills/model-researcher/scripts/gpu_estimator.py --params PARAMS --model-type TYPE --check-env --json
Read the appropriate eval guide from .claude/skills/model-researcher/references/

Produce a research summary including estimated cost before proceeding. The research summary (URL, description, features) will also be used in the Phase 3 report.

Cost Gate

Always present the cost estimate to the user before starting Phase 2.

Example output:

## Cost Estimate
- Provider: Modal (A100-40GB)
- Estimated duration: ~15 min
- Estimated cost: ~$0.53
- Alternative: HF Inference API (free, if available)

Proceed with execution? [y/N]

If cost exceeds $5, explicitly warn and ask for confirmation. For free options (HF Inference API), proceed without asking.

Phase 2: Execute

Consult the gpu-runner skill knowledge:

Choose provider based on research (check .env for available credentials, pick cheapest)
Read the provider reference from .claude/skills/gpu-runner/references/:
- HF Endpoints → hf-endpoints.md
- Colab → colab-chrome-mcp.md
- Modal → modal.md
- beam.cloud → beam-cloud.md
Write a self-contained inference script
Execute the 3-stage evaluation:
- Smoke test: Load model, generate minimal output
- Quality check: Run diverse test inputs, evaluate outputs
- Performance: Measure speed metrics (5 runs, take median)
Save script to results/YYYY-MM-DD_modelname/workspace/run.py
Save outputs to results/YYYY-MM-DD_modelname/artifacts/

If execution fails, debug and retry. If the provider fails twice, try the next provider.

Phase 3: Report

Consult the eval-reporter skill knowledge:

Write metrics.json using .claude/skills/eval-reporter/scripts/metrics_writer.py
Write the HTML report yourself — consult .claude/skills/eval-reporter/references/report-format.md for design guidelines and required sections
Include model profile from Phase 1 (URL, description, features) in the report's Model Overview section

Result directory structure:

results/YYYY-MM-DD_modelname/
├── report.html
├── metrics.json
├── artifacts/
└── workspace/
    └── run.py

Principles

Be exploratory: Don't follow a fixed script. Adapt to each model's needs.
Be honest: Report failures and limitations. "Didn't work" is a valid result.
Be cost-conscious: Use already-paid services first. Don't waste GPU time.
Be thorough: Test multiple inputs. Measure multiple times. Note edge cases.
Be reproducible: Save the exact code used. Anyone should be able to re-run it.

Date Convention

Use today's date for the result directory: results/YYYY-MM-DD_modelname/. The model name should be slugified (lowercase, hyphens): e.g., gemma-3-27b, flux-1-dev.