Run any Skill in Manus with one click

agent-evaluation

Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.

Run Skill in Manus

Stars11

Forks1

UpdatedMay 25, 2026 at 13:43

Source

VKirill

VKirill/antigravity-for-claude-code

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

File Explorer

11 files

SKILL.md

readonly

More from this repository

same repository

agent-builder

VKirill/antigravity-for-claude-code

Designs and authors Claude Code sub-agents (.claude/agents/*.md files) that integrate with an existing skill-stack. Use when: subagent, sub-agent, agent design, .claude/agents, custom agent, agent frontmatter, verifier agent, planner agent, agent description, agent delegation, Explore, Plan, general-purpose, Agent tool, agent permissions, agent tools field, agent skills preload, agent memory field, isolation worktree, permissionMode, agent decomposition, multi-agent decision, planner SPEC, verifier checklist, agent context isolation, tool-restricted agent, нужен ли мне агент, как сделать сабагента, как настроить .claude/agents. SKIP: building skills (→skill-evaluation), evaluating LLM agents in production (→agent-evaluation), Claude API / Messages SDK outside Claude Code (→claude-api), MCP server authoring (→mcp-server-author), agent-team coordination across separate sessions (→agent-teams), codex-specific agent authoring (use codex own conventions).

2026-05-2511

architecture-craft

VKirill/antigravity-for-claude-code

System architecture discipline from Newman + Khononov + Richards/Ford. Use when: project-architect designs a new system, decomposes a monolith, chooses architecture style, writes ADRs, defines service boundaries. Trigger terms: bounded context, subdomain, architecture quantum, ADR, fitness function, saga, strangler fig. SKIP: single-file refactor, tactical UI design.

2026-05-2511

astro

VKirill/antigravity-for-claude-code

Build modern websites with Astro 6.x — Islands Architecture, zero-JS defaults, Server Islands, Actions, Content Layer, View Transitions. Use for static sites, content-heavy sites, marketing pages, blogs, documentation, e-commerce fronts. Trigger terms: astro, astro.build, .astro file, Islands Architecture, content collections, View Transitions, Server Islands, Astro Actions, astro:db, hybrid SSR/SSG.

2026-05-2511

better-auth

VKirill/antigravity-for-claude-code

Better Auth — framework-agnostic TypeScript authentication. Email/password, OAuth/social login, sessions, JWT, 2FA, magic link, email OTP, organizations/teams, RBAC, passkeys. Use when: better-auth, betterAuth, authClient, auth.api, createAuthClient, signUp.email, signIn.email, signIn.social, signIn.magicLink, getSession, useSession, twoFactor, magicLink, emailOTP, organization plugin, trustedOrigins, baseURL, BETTER_AUTH_SECRET, NextAuth alternative, Auth.js migration, Prisma/Drizzle/Kysely auth adapter, cookie sessions, OAuth callback, social provider, sign-in, login, OAuth flow, JWT issuance. SKIP: Auth0/Clerk/Supabase Auth platforms, Lucia (sunset), Passport.js (legacy Express middleware), NextAuth/Auth.js usage (different lib).

2026-05-2511

biome

VKirill/antigravity-for-claude-code

Biome 2 — Rust-based lint + format for JS/TS/JSON/CSS. Use when: biome, biome.json, biome.jsonc, biome check, biome format, biome lint, biome ci, ESLint replacement, Prettier replacement, fast formatter, single-file-config, organize imports, recommended rules, suppression comments, VS Code Biome extension, pre-commit hook. SKIP: ESLint-specific flat config (→eslint), Prettier-only formatting questions, dprint.

2026-05-2511

brainstorming

VKirill/antigravity-for-claude-code

Requirements gathering before any creative work. Use BEFORE starting a feature/component/fix — explores user intent, scope, success criteria via 1-3 targeted questions, NOT free-form discussion. Replaces superpowers:brainstorming. Use when: новая фича, новая идея, brainstorm, поговорим про X, давай придумаем.

2026-05-2511

Topic

File

Index + framework decision map

references/REFERENCE.md

Behavioral testing — invariant, contract, snapshot, regression suites

references/behavioral-testing.md

Capability benchmarks — SWE-bench, AgentBench, HumanEval, Pass@k, calibration

references/capability-benchmarks.md

Reliability metrics — N-run consistency, failure taxonomy, error budgets, SLOs

references/reliability-metrics.md

LLM-as-judge — rubric design, bias mitigation, dual-judge, human correlation

references/llm-as-judge.md

Production monitoring — LangSmith, LangFuse, Arize Phoenix, Helicone, drift detection

references/production-monitoring.md

Red teaming — prompt injection, jailbreaks, tool misuse, data leakage, adversarial inputs

references/red-teaming.md

Statistical design — sample size, CIs, A/B testing, multi-armed bandit

references/statistical-design.md

Routing eval cases — 10 positive / 10 negative / 5 edge

references/eval-cases.md

Topic

File

Index + framework decision map

references/REFERENCE.md

Behavioral testing — invariant, contract, snapshot, regression suites

references/behavioral-testing.md

Capability benchmarks — SWE-bench, AgentBench, HumanEval, Pass@k, calibration

references/capability-benchmarks.md

Reliability metrics — N-run consistency, failure taxonomy, error budgets, SLOs

references/reliability-metrics.md

LLM-as-judge — rubric design, bias mitigation, dual-judge, human correlation

references/llm-as-judge.md

Production monitoring — LangSmith, LangFuse, Arize Phoenix, Helicone, drift detection

references/production-monitoring.md

Red teaming — prompt injection, jailbreaks, tool misuse, data leakage, adversarial inputs

references/red-teaming.md

Statistical design — sample size, CIs, A/B testing, multi-armed bandit

references/statistical-design.md

Routing eval cases — 10 positive / 10 negative / 5 edge

references/eval-cases.md

name	agent-evaluation
description	Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.
tags	["agents","testing"]
source	vibeship-spawner-skills (Apache 2.0)
risk	low-stakes

name	agent-evaluation
description	Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.
tags	["agents","testing"]
source	vibeship-spawner-skills (Apache 2.0)
risk	low-stakes

agent-evaluation

Usage

Use this skill when

Do not use this skill when

Purpose

Capabilities

Behavioral Testing

Capability Benchmarks

Reliability Metrics

LLM-as-Judge

Production Monitoring

Red Teaming

Statistical Design

Behavioral Traits

Important Constraints

Related Skills

API Reference

Usage

Use this skill when

Do not use this skill when

Purpose

Capabilities

Behavioral Testing

Capability Benchmarks

Reliability Metrics

LLM-as-Judge

Production Monitoring

Red Teaming

Statistical Design

Behavioral Traits

Important Constraints

Related Skills

API Reference