一键在 Manus 中运行任何 Skill

agent-evaluation

星标453

分支139

更新时间2026年2月11日 08:53

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

Dokhacgiakhoa

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

blockrun

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Use when user needs capabilities Claude lacks (image generation, real-time X/Twitter data) or explicitly requests external models ("blockrun", "use grok", "use gpt", "dall-e", "deepseek")

2026-06-09453

ai-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Principal AI Architect and Machine Learning Engineer.

2026-06-09453

cloud-architect-master

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Cloud and Multi-Cloud Architect Master Skill.

2026-06-09453

cro-expert-kit

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Conversion Rate Optimization Toolkit.

2026-06-09453

database-migration

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DB: Zero-Downtime, Schema Design (3NF), SQL/NoSQL.

2026-06-09453

deployment-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DEPLOY: CI/CD Pipelines, Docker, K8s, GitOps.

2026-06-09453

version	4.1.0-fractal
name	agent-evaluation
description	Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
source	vibeship-spawner-skills (Apache 2.0)

Agent Evaluation

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

同仓库更多 Skills

同仓库更多 Skills

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching