Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

eval-recipes-runner

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Activates when testing with eval-recipes, running evals, or benchmarking changes.

Ejecutar en Manus

Resumen

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Activates when testing with eval-recipes, running evals, or benchmarking changes.

Comando de instalación

npx skills add https://github.com/rysweet/AzureHayMaker --skill eval-recipes-runner

Copia y pega este comando en Claude Code para instalar la habilidad

Fuente

rysweet/AzureHayMaker

Estrellas0

Forks1

Actualizado15 de enero de 2026, 21:26

SKILL.md

readonly

name	eval-recipes-runner
version	1.0.0
description	Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Activates when testing with eval-recipes, running evals, or benchmarking changes.

eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

How It Works

Setup (One-Time)

# Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes

# Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

# Install dependencies
uv sync

Running Benchmarks

Test a specific branch:

# Update install.dockerfile to use specific branch
# Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Compare before/after:

# Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

# Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

# Compare scores

Available Tasks

Common tasks from eval-recipes:

linkedin_drafting - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
email_drafting - Create CLI tool for emails (scored 26/100 before)
arxiv_paper_summarizer - Research tool
github_docs_extractor - Documentation tool
Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

Identify the branch/PR to test

Update agent config to use that branch:

# In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
    cd /tmp/amplihack && \
    git checkout BRANCH_NAME && \
    pip install -e .

Copy to eval-recipes:

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

Run benchmark:

cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

Report scores and compare with baseline

Expected Scores

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout feat/issue-1435-task-classification
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

eval-recipes cloned to ~/eval-recipes
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
Docker installed (for containerized runs)
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

# Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
  uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done

# Compare results
cat .benchmark_results/*/amplihack/*/score.txt

Más de este repositorio

mismo repositorio

azure-devops-cli

rysweet/AzureHayMaker

Expert guidance for Azure DevOps CLI (az devops) - automation, pipelines, repos, boards, and artifacts management. Use when working with Azure DevOps, managing pipelines, creating work items, or when user mentions ADO, builds, releases, or Azure repos.

2026-01-150

backlog-curator

rysweet/AzureHayMaker

Expert backlog manager that prioritizes work using multi-criteria scoring, analyzes dependencies, and recommends optimal next tasks. Activates when managing backlogs, prioritizing work, adding items, or analyzing what to work on next.

2026-01-150

microsoft-agent-framework

rysweet/AzureHayMaker

Expert knowledge of Microsoft Agent Framework for building production AI agents and workflows. Use when building agents with Microsoft's framework, multi-agent orchestration, or tool integration.

2026-01-150

pm-architect

rysweet/AzureHayMaker

Expert project manager orchestrating backlog-curator, work-delegator, workstream-coordinator, and roadmap-strategist sub-skills. Coordinates complex software projects through delegation and strategic oversight. Activates when managing projects, coordinating work, or tracking overall progress.

2026-01-150

roadmap-strategist

rysweet/AzureHayMaker

Expert strategist managing project roadmaps, goals, milestones, and strategic direction. Tracks goal progress, ensures alignment, and provides strategic recommendations. Activates when planning roadmaps, setting goals, tracking milestones, or discussing strategic direction.

2026-01-150

work-delegator

rysweet/AzureHayMaker

Expert delegation specialist that creates comprehensive context packages for coding agents, analyzes requirements, identifies relevant files, and generates clear instructions. Activates when delegating work, assigning tasks, creating delegation packages, or preparing agent instructions.

2026-01-150

Fuente

rysweet

rysweet/AzureHayMaker

Abrir repositorio de GitHub Ver repositorios del creador

Comando de instalación

Descarga

Ejecutar en Manus

Útil paraSOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas15-1253L4

name	eval-recipes-runner
version	1.0.0
description	Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Activates when testing with eval-recipes, running evals, or benchmarking changes.

eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

How It Works

Setup (One-Time)

# Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes

# Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

# Install dependencies
uv sync

Running Benchmarks

Test a specific branch:

# Update install.dockerfile to use specific branch
# Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Compare before/after:

# Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

# Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

# Compare scores

Available Tasks

Common tasks from eval-recipes:

linkedin_drafting - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
email_drafting - Create CLI tool for emails (scored 26/100 before)
arxiv_paper_summarizer - Research tool
github_docs_extractor - Documentation tool
Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

Identify the branch/PR to test

Update agent config to use that branch:

# In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
    cd /tmp/amplihack && \
    git checkout BRANCH_NAME && \
    pip install -e .

Copy to eval-recipes:

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

Run benchmark:

cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

Report scores and compare with baseline

Expected Scores

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout feat/issue-1435-task-classification
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

eval-recipes cloned to ~/eval-recipes
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
Docker installed (for containerized runs)
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

# Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
  uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done

# Compare results
cat .benchmark_results/*/amplihack/*/score.txt