一键在 Manus 中运行任何 Skill

agent-evaluation

Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.

在 Manus 中运行

星标11

分支1

更新时间2026年5月25日 13:43

来源

VKirill

VKirill/antigravity-for-claude-code

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

软件质量保证分析师与测试员计算机与数学类职业15-1253L4

文件资源管理器

11 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

agent-builder

VKirill/antigravity-for-claude-code

Designs and authors Claude Code sub-agents (.claude/agents/*.md files) that integrate with an existing skill-stack. Use when: subagent, sub-agent, agent design, .claude/agents, custom agent, agent frontmatter, verifier agent, planner agent, agent description, agent delegation, Explore, Plan, general-purpose, Agent tool, agent permissions, agent tools field, agent skills preload, agent memory field, isolation worktree, permissionMode, agent decomposition, multi-agent decision, planner SPEC, verifier checklist, agent context isolation, tool-restricted agent, нужен ли мне агент, как сделать сабагента, как настроить .claude/agents. SKIP: building skills (→skill-evaluation), evaluating LLM agents in production (→agent-evaluation), Claude API / Messages SDK outside Claude Code (→claude-api), MCP server authoring (→mcp-server-author), agent-team coordination across separate sessions (→agent-teams), codex-specific agent authoring (use codex own conventions).

2026-05-2511

architecture-craft

VKirill/antigravity-for-claude-code

System architecture discipline from Newman + Khononov + Richards/Ford. Use when: project-architect designs a new system, decomposes a monolith, chooses architecture style, writes ADRs, defines service boundaries. Trigger terms: bounded context, subdomain, architecture quantum, ADR, fitness function, saga, strangler fig. SKIP: single-file refactor, tactical UI design.

2026-05-2511

astro

VKirill/antigravity-for-claude-code

Build modern websites with Astro 6.x — Islands Architecture, zero-JS defaults, Server Islands, Actions, Content Layer, View Transitions. Use for static sites, content-heavy sites, marketing pages, blogs, documentation, e-commerce fronts. Trigger terms: astro, astro.build, .astro file, Islands Architecture, content collections, View Transitions, Server Islands, Astro Actions, astro:db, hybrid SSR/SSG.

2026-05-2511

better-auth

VKirill/antigravity-for-claude-code

Better Auth — framework-agnostic TypeScript authentication. Email/password, OAuth/social login, sessions, JWT, 2FA, magic link, email OTP, organizations/teams, RBAC, passkeys. Use when: better-auth, betterAuth, authClient, auth.api, createAuthClient, signUp.email, signIn.email, signIn.social, signIn.magicLink, getSession, useSession, twoFactor, magicLink, emailOTP, organization plugin, trustedOrigins, baseURL, BETTER_AUTH_SECRET, NextAuth alternative, Auth.js migration, Prisma/Drizzle/Kysely auth adapter, cookie sessions, OAuth callback, social provider, sign-in, login, OAuth flow, JWT issuance. SKIP: Auth0/Clerk/Supabase Auth platforms, Lucia (sunset), Passport.js (legacy Express middleware), NextAuth/Auth.js usage (different lib).

2026-05-2511

biome

VKirill/antigravity-for-claude-code

Biome 2 — Rust-based lint + format for JS/TS/JSON/CSS. Use when: biome, biome.json, biome.jsonc, biome check, biome format, biome lint, biome ci, ESLint replacement, Prettier replacement, fast formatter, single-file-config, organize imports, recommended rules, suppression comments, VS Code Biome extension, pre-commit hook. SKIP: ESLint-specific flat config (→eslint), Prettier-only formatting questions, dprint.

2026-05-2511

brainstorming

VKirill/antigravity-for-claude-code

Requirements gathering before any creative work. Use BEFORE starting a feature/component/fix — explores user intent, scope, success criteria via 1-3 targeted questions, NOT free-form discussion. Replaces superpowers:brainstorming. Use when: новая фича, новая идея, brainstorm, поговорим про X, давай придумаем.

2026-05-2511

Topic

File

Index + framework decision map

references/REFERENCE.md

Behavioral testing — invariant, contract, snapshot, regression suites

references/behavioral-testing.md

Capability benchmarks — SWE-bench, AgentBench, HumanEval, Pass@k, calibration

references/capability-benchmarks.md

Reliability metrics — N-run consistency, failure taxonomy, error budgets, SLOs

references/reliability-metrics.md

LLM-as-judge — rubric design, bias mitigation, dual-judge, human correlation

references/llm-as-judge.md

Production monitoring — LangSmith, LangFuse, Arize Phoenix, Helicone, drift detection

references/production-monitoring.md

Red teaming — prompt injection, jailbreaks, tool misuse, data leakage, adversarial inputs

references/red-teaming.md

Statistical design — sample size, CIs, A/B testing, multi-armed bandit

references/statistical-design.md

Routing eval cases — 10 positive / 10 negative / 5 edge

references/eval-cases.md

Topic

File

Index + framework decision map

references/REFERENCE.md

Behavioral testing — invariant, contract, snapshot, regression suites

references/behavioral-testing.md

Capability benchmarks — SWE-bench, AgentBench, HumanEval, Pass@k, calibration

references/capability-benchmarks.md

Reliability metrics — N-run consistency, failure taxonomy, error budgets, SLOs

references/reliability-metrics.md

LLM-as-judge — rubric design, bias mitigation, dual-judge, human correlation

references/llm-as-judge.md

Production monitoring — LangSmith, LangFuse, Arize Phoenix, Helicone, drift detection

references/production-monitoring.md

Red teaming — prompt injection, jailbreaks, tool misuse, data leakage, adversarial inputs

references/red-teaming.md

Statistical design — sample size, CIs, A/B testing, multi-armed bandit

references/statistical-design.md

Routing eval cases — 10 positive / 10 negative / 5 edge

references/eval-cases.md

name	agent-evaluation
description	Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.
tags	["agents","testing"]
source	vibeship-spawner-skills (Apache 2.0)
risk	low-stakes

name	agent-evaluation
description	Tests and benchmarks LLM agents covering behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when evaluating agent quality, designing eval suites, building regression tests, or measuring real-world reliability beyond benchmark scores.
tags	["agents","testing"]
source	vibeship-spawner-skills (Apache 2.0)
risk	low-stakes

agent-evaluation

Usage

Use this skill when

Do not use this skill when

Purpose

Capabilities

Behavioral Testing

Capability Benchmarks

Reliability Metrics

LLM-as-Judge

Production Monitoring

Red Teaming

Statistical Design

Behavioral Traits

Important Constraints

Related Skills

API Reference

Usage

Use this skill when

Do not use this skill when

Purpose

Capabilities

Behavioral Testing

Capability Benchmarks

Reliability Metrics

LLM-as-Judge

Production Monitoring

Red Teaming

Statistical Design

Behavioral Traits

Important Constraints

Related Skills

API Reference