تشغيل أي مهارة في Manus بنقرة واحدة

agent-evaluation

النجوم٤٥٣

التفرعات١٣٩

آخر تحديث١١ فبراير ٢٠٢٦ في ٠٨:٥٣

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

Dokhacgiakhoa

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

مستكشف الملفات

7 ملفات

SKILL.md

readonly

version	4.1.0-fractal
name	agent-evaluation
description	Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
source	vibeship-spawner-skills (Apache 2.0)

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

blockrun

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Use when user needs capabilities Claude lacks (image generation, real-time X/Twitter data) or explicitly requests external models ("blockrun", "use grok", "use gpt", "dall-e", "deepseek")

2026-06-09453

ai-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Principal AI Architect and Machine Learning Engineer.

2026-06-09453

cloud-architect-master

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Cloud and Multi-Cloud Architect Master Skill.

2026-06-09453

cro-expert-kit

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

Elite Conversion Rate Optimization Toolkit.

2026-06-09453

database-migration

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DB: Zero-Downtime, Schema Design (3NF), SQL/NoSQL.

2026-06-09453

deployment-engineer

Dokhacgiakhoa/Agent-skills-setup-for-AntiGravity

MASTER DEPLOY: CI/CD Pipelines, Docker, K8s, GitOps.

2026-06-09453

version	4.1.0-fractal
name	agent-evaluation
description	Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
source	vibeship-spawner-skills (Apache 2.0)

Agent Evaluation

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching

المزيد من هذا المستودع

المزيد من هذا المستودع

Agent Evaluation

Capabilities

Requirements

Patterns

🧠 Knowledge Modules (Fractal Skills)

1. Statistical Test Evaluation

2. Behavioral Contract Testing

3. Adversarial Testing

4. ❌ Single-Run Testing

5. ❌ Only Happy Path Tests

6. ❌ Output String Matching