Run any Skill in Manus with one click

$pwd:

agent-evaluation

Name: Agent Evaluation
Author: techwavedev

// You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

Run Skill in Manus

$ git log --oneline --stat

stars:3

forks:2

updated:April 12, 2026 at 17:05

SKILL.md

readonly

name	agent-evaluation
description	You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
risk	safe
source	vibeship-spawner-skills (Apache 2.0)
date_added	2026-02-27

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

AGI Framework Integration

Adapted for @techwavedev/agi-agent-kit Original source: antigravity-awesome-skills

Memory-First Protocol

Retrieve prior agent configurations, team compositions, and orchestration patterns. Critical for multi-agent system consistency.

# Check for prior AI agent orchestration context before starting
python3 execution/memory_manager.py auto --query "agent patterns and orchestration strategies for Agent Evaluation"

Storing Results

After completing work, store AI agent orchestration decisions for future sessions:

python3 execution/memory_manager.py store \
  --content "Agent pattern: hierarchical orchestration with Control Tower dispatcher, 3 specialist sub-agents" \
  --type decision --project <project> \
  --tags agent-evaluation ai-agents

Multi-Agent Collaboration

This skill is inherently multi-agent. Use cross-agent context to coordinate task distribution and avoid duplicate work.

python3 execution/cross_agent_context.py store \
  --agent "<your-agent>" \
  --action "Agent architecture designed — Control Tower + specialist agents with shared Qdrant memory" \
  --project <project>

Control Tower Integration

Register agents and tasks with the Control Tower (execution/control_tower.py) for centralized orchestration across machines and LLM providers.

Blockchain Identity

Each agent has a cryptographic Ed25519 identity. All memory writes are signed — enabling trust verification in multi-agent systems.

related-skills.json

same repository

claude-code-design.md

from "techwavedev/agi-agent-kit"

Answer questions about Claude Code / Anthropic agent design patterns (CLAUDE.md layering, skills anatomy, progressive disclosure, memory-first, orchestration, sub-agents, worktree isolation, Karpathy loop). Primary backend is a NotebookLM RAG over Anthropic design notebooks; falls back to bundled references when the notebook is not registered.

2026-04-153

notebooklm-internal.md

from "techwavedev/agi-agent-kit"

INTERNAL ONLY. Forked write-capable variant of the public notebooklm skill. Adds programmatic source ingestion (add_source) for the X/YouTube → NotebookLM pipeline (#119). Headless, non-interactive, cron-driven. NEVER ship publicly.

2026-04-153

webcrawler.md

from "techwavedev/agi-agent-kit"

Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.

2026-04-123

ui-ux-pro-max.md

from "techwavedev/agi-agent-kit"

UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 9 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind, shadcn/ui). Actions: plan, build, cr...

2026-04-123

image-ai-generator.md

from "techwavedev/agi-agent-kit"

Generates images via Openrouter API using AI image models. Supports two modes: test (cheap model for iteration) and production (high-quality model for final output). Handles prompt construction, API calls, base64 decoding, and file saving. Supports reference images (logos, mascots) for brand-consistent generation.

2026-04-123

qdrant-memory.md

from "techwavedev/agi-agent-kit"

Intelligent token optimization through Qdrant-powered semantic caching and long-term memory. Use for (1) Semantic Cache - avoid LLM calls entirely for semantically similar queries with 100% token savings, (2) Long-Term Memory - retrieve only relevant context chunks instead of full conversation history with 80-95% context reduction, (3) Hybrid Search - combine vector similarity with keyword filtering for technical queries, (4) Memory Management - store and retrieve conversation memories, decisions, and code patterns with metadata filtering. Triggers when needing to cache responses, remember past interactions, optimize context windows, or implement RAG patterns.

2026-04-123

package.json

"author": "techwavedev"

"repository": "techwavedev/agi-agent-kit"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	agent-evaluation
description	You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
risk	safe
source	vibeship-spawner-skills (Apache 2.0)
date_added	2026-02-27

Agent Evaluation

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

When to Use

This skill is applicable to execute the workflow or actions described in the overview.

AGI Framework Integration

Adapted for @techwavedev/agi-agent-kit Original source: antigravity-awesome-skills

Memory-First Protocol

Retrieve prior agent configurations, team compositions, and orchestration patterns. Critical for multi-agent system consistency.

# Check for prior AI agent orchestration context before starting
python3 execution/memory_manager.py auto --query "agent patterns and orchestration strategies for Agent Evaluation"

Storing Results

After completing work, store AI agent orchestration decisions for future sessions:

python3 execution/memory_manager.py store \
  --content "Agent pattern: hierarchical orchestration with Control Tower dispatcher, 3 specialist sub-agents" \
  --type decision --project <project> \
  --tags agent-evaluation ai-agents

Multi-Agent Collaboration

This skill is inherently multi-agent. Use cross-agent context to coordinate task distribution and avoid duplicate work.

python3 execution/cross_agent_context.py store \
  --agent "<your-agent>" \
  --action "Agent architecture designed — Control Tower + specialist agents with shared Qdrant memory" \
  --project <project>

Control Tower Integration

Register agents and tasks with the Control Tower (execution/control_tower.py) for centralized orchestration across machines and LLM providers.

Blockchain Identity

Each agent has a cryptographic Ed25519 identity. All memory writes are signed — enabling trust verification in multi-agent systems.

agent-evaluation

Agent Evaluation

Capabilities

Requirements

Patterns

Statistical Test Evaluation

Behavioral Contract Testing

Adversarial Testing

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Related Skills

When to Use

AGI Framework Integration

Memory-First Protocol

Storing Results

Multi-Agent Collaboration

Control Tower Integration

Blockchain Identity

More from this repository

Agent Evaluation

Capabilities

Requirements

Patterns

Statistical Test Evaluation

Behavioral Contract Testing

Adversarial Testing

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Related Skills

When to Use

AGI Framework Integration

Memory-First Protocol

Storing Results

Multi-Agent Collaboration

Control Tower Integration

Blockchain Identity

More from this repository