Run any Skill in Manus with one click

agent-evaluation

Stars453

Forks139

UpdatedFebruary 11, 2026 at 08:53

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Dokhacgiakhoa

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

File Explorer

7 files

SKILL.md

readonly

version	4.1.0-fractal
name	agent-evaluation
description	Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
source	vibeship-spawner-skills (Apache 2.0)

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

blockrun

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Use when user needs capabilities Claude lacks (image generation, real-time X/Twitter data) or explicitly requests external models ("blockrun", "use grok", "use gpt", "dall-e", "deepseek")

2026-06-09453

ai-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Principal AI Architect and Machine Learning Engineer.

2026-06-09453

cloud-architect-master

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Cloud and Multi-Cloud Architect Master Skill.

2026-06-09453

cro-expert-kit

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Conversion Rate Optimization Toolkit.

2026-06-09453

database-migration

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DB: Zero-Downtime, Schema Design (3NF), SQL/NoSQL.

2026-06-09453

deployment-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DEPLOY: CI/CD Pipelines, Docker, K8s, GitOps.

2026-06-09453

version	4.1.0-fractal
name	agent-evaluation
description	Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
source	vibeship-spawner-skills (Apache 2.0)

Agent Evaluation

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

More from this repository

More from this repository

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching