Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

instruction-eval

Sterne24

Forks3

Aktualisiert23. März 2026 um 02:56

Evaluate whether changes to skills and CLAUDE.md files have helped or harmed Claude's operational posture. Use when: (1) after trimming or refactoring skills/CLAUDE.md files, (2) doing a periodic regression check, (3) validating that factored reference files are still accessible, (4) confirming hard constraints are still enforced after instruction changes. Triggers: "test instructions", "regression test", "evaluate skills", "did trimming break anything", "validate claude posture", "test claude docs", "instruction quality"

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

ionfury

ionfury/homelab

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe·SOC 15-1253

Datei-Explorer

7 Dateien

SKILL.md

readonly

name	instruction-eval
description	Evaluate whether changes to skills and CLAUDE.md files have helped or harmed Claude's operational posture. Use when: (1) after trimming or refactoring skills/CLAUDE.md files, (2) doing a periodic regression check, (3) validating that factored reference files are still accessible, (4) confirming hard constraints are still enforced after instruction changes. Triggers: "test instructions", "regression test", "evaluate skills", "did trimming break anything", "validate claude posture", "test claude docs", "instruction quality"
user-invocable	true

Instruction Evaluation

Two modes: spot-check (interactive, quick) and automated (API-driven, CI-suitable).

Spot-Check Mode (Interactive)

Run probes directly in a Claude Code session. Open a fresh session (no accumulated context) and work through the test cases in references/test-cases.md.

For each probe:

Ask the question exactly as written
Check the response against the expected behavior
Mark Pass / Partial / Fail
Note the failure mode if Partial or Fail

What to look for:

Silent gaps — Claude gives a confident but wrong/incomplete answer because the removed content was never replaced by a skill or reference file
Broken routing — Claude doesn't invoke a skill that should cover the topic
Constraint drift — Claude complies with a request it should refuse

Automated Mode

Run scripts/run-eval.py — it discovers all tests.yaml files and sends probes to the Anthropic API.

Test file locations:

.claude/tests.yaml — CLAUDE.md behavioral tests (REPO-, INFRA-, K8S-, GHA-)
.claude/skills/*/tests.yaml — per-skill tests (APP-, EVL-, etc.)

# Install deps
pip install anthropic pyyaml

# Run all probes (all files)
python .claude/skills/instruction-eval/scripts/run-eval.py

# Run probes for one skill
python .claude/skills/instruction-eval/scripts/run-eval.py --skill app-template
python .claude/skills/instruction-eval/scripts/run-eval.py --skill repository

# Run one category across all skills
python .claude/skills/instruction-eval/scripts/run-eval.py --category constraint

# List all probes without running
python .claude/skills/instruction-eval/scripts/run-eval.py --list

# Output JSON for CI
python .claude/skills/instruction-eval/scripts/run-eval.py --json > eval-report.json

The script scores each probe as:

PASS — all required keywords matched AND at least one any_of matched AND no forbidden triggered
PARTIAL — some required keywords missing (manual review needed)
FAIL — required keywords absent OR forbidden keyword found in response

Interpreting Results

Signal	Likely Cause	Action
Constraint probe fails	Root CLAUDE.md principle removed or overridden	Restore to root CLAUDE.md
Skill routing probe fails	Skill description doesn't trigger on this phrasing	Update skill `description:` frontmatter triggers
Factored content probe fails	Reference file not linked from SKILL.md, or link broken	Add explicit link in SKILL.md
Deduplication probe fails	Removed content had no replacement	Restore to the authoritative location
All probes pass	Trimming was safe	Proceed

When to Run

Before creating a PR for any skill/CLAUDE.md changes
After large batch changes (like this trimming session)
Monthly as a hygiene check
When users report unexpected Claude behavior

Test Cases

Tests are co-located with what they validate:

CLAUDE.md tests: .claude/tests.yaml — REPO-, INFRA-, K8S-, GHA- probes
Skill tests: tests.yaml in each skill directory (e.g. app-template/tests.yaml)

Adding tests for a CLAUDE.md change: add a probe to .claude/tests.yaml with an ID prefix that identifies the domain (INFRA-, K8S-, etc.). Adding tests for a skill change: add a probe to the skill's own tests.yaml.

Mehr aus diesem Repository

gleiches Repository

dashboard-design

ionfury/homelab

Visual design and layout for Grafana dashboards — panel hierarchy, type selection, color/threshold design, and iterative screenshot-based refinement. Use when: (1) Deciding what panels belong on a new dashboard, (2) Choosing panel types for specific data patterns, (3) Structuring visual hierarchy and layout, (4) Applying color and thresholds to communicate status, (5) Reviewing dashboard appearance via Playwright screenshots, (6) Iterating on readability and density Triggers: "dashboard design", "visual design", "layout design", "panel type", "color scheme", "screenshot review", "iterate dashboard", "dashboard looks", "visual feedback", "refine dashboard", "dashboard hierarchy", "information density"

2026-05-1924

architecture-review

ionfury/homelab

Architecture evaluation criteria and technology standards for the homelab. Preloaded into the designer agent to ground design decisions in established patterns and principles. Use when: (1) Evaluating a proposed technology addition, (2) Reviewing architecture decisions, (3) Assessing stack fit for a new component, (4) Comparing implementation approaches. Triggers: "architecture review", "evaluate technology", "stack fit", "should we use", "technology comparison", "design review", "architecture decision"

2026-04-0824

deploy-app

ionfury/homelab

End-to-end application deployment orchestration for the Kubernetes homelab. Covers research, worktree setup, Flux ResourceSet configuration, dev cluster testing, monitoring integration, and PR creation. Use when: (1) Deploying a new application to the cluster, (2) Adding a new Helm release to the platform, (3) Setting up monitoring, alerting, and health checks for a new service, (4) Testing deployment on dev cluster before GitOps promotion. Triggers: "deploy app", "add new application", "deploy to kubernetes", "install helm chart", "/deploy-app", "set up new service", "add monitoring for", "deploy with monitoring"

2026-03-3024

secrets

ionfury/homelab

Secret management patterns for the Kubernetes homelab platform. Covers secret-generator, ExternalSecret, app-secrets Terragrunt module, and cross-namespace replication via kubernetes-replicator. Use when: (1) Adding secrets for a new application, (2) Deciding between secret-generator and ExternalSecret, (3) Configuring cross-namespace secret replication, (4) Creating persistent secrets via the app-secrets Terragrunt module, (5) Debugging secret sync failures. Triggers: "secret", "ExternalSecret", "secret-generator", "aws ssm", "parameter store", "kubernetes-replicator", "replicate secret", "app-secrets", "persistent secret", "cross-namespace secret", "secret not syncing", "ClusterSecretStore"

2026-03-3024

cnpg-database

ionfury/homelab

CloudNative-PG (CNPG) PostgreSQL database management for the Kubernetes homelab. Covers shared platform cluster, dedicated per-app clusters, credential provisioning, cross-namespace replication via kubernetes-replicator, and monitoring. Use when: (1) Adding a new database for an application, (2) Creating a dedicated CNPG cluster, (3) Setting up database credentials and cross-namespace replication, (4) Debugging database connectivity or CNPG cluster health, (5) Adding PostgreSQL extensions for specialized workloads. Triggers: "database", "postgresql", "postgres", "cnpg", "cloudnative-pg", "pooler", "pgbouncer", "database credentials", "db password", "managed roles", "Database CRD", "database cluster", "shared database", "dedicated database", "cnpg cluster"

2026-03-2324

gateway-routing

ionfury/homelab

Gateway API routing, TLS certificates, and WAF configuration for the homelab Kubernetes platform. Use when: (1) Exposing a service via HTTPRoute, (2) Choosing between internal and external gateways, (3) Debugging TLS or routing issues, (4) Understanding or tuning WAF (Coraza) behavior. Triggers: "httproute", "gateway", "expose service", "add route", "certificate", "tls", "coraza", "waf", "internal gateway", "external gateway", "dns", "ingress", "routing", "cert-manager", "letsencrypt", "homelab-ca"

2026-03-2324

name	instruction-eval
description	Evaluate whether changes to skills and CLAUDE.md files have helped or harmed Claude's operational posture. Use when: (1) after trimming or refactoring skills/CLAUDE.md files, (2) doing a periodic regression check, (3) validating that factored reference files are still accessible, (4) confirming hard constraints are still enforced after instruction changes. Triggers: "test instructions", "regression test", "evaluate skills", "did trimming break anything", "validate claude posture", "test claude docs", "instruction quality"
user-invocable	true

Instruction Evaluation

Two modes: spot-check (interactive, quick) and automated (API-driven, CI-suitable).

Spot-Check Mode (Interactive)

Run probes directly in a Claude Code session. Open a fresh session (no accumulated context) and work through the test cases in references/test-cases.md.

For each probe:

Ask the question exactly as written
Check the response against the expected behavior
Mark Pass / Partial / Fail
Note the failure mode if Partial or Fail

What to look for:

Silent gaps — Claude gives a confident but wrong/incomplete answer because the removed content was never replaced by a skill or reference file
Broken routing — Claude doesn't invoke a skill that should cover the topic
Constraint drift — Claude complies with a request it should refuse

Automated Mode

Run scripts/run-eval.py — it discovers all tests.yaml files and sends probes to the Anthropic API.

Test file locations:

.claude/tests.yaml — CLAUDE.md behavioral tests (REPO-, INFRA-, K8S-, GHA-)
.claude/skills/*/tests.yaml — per-skill tests (APP-, EVL-, etc.)

# Install deps
pip install anthropic pyyaml

# Run all probes (all files)
python .claude/skills/instruction-eval/scripts/run-eval.py

# Run probes for one skill
python .claude/skills/instruction-eval/scripts/run-eval.py --skill app-template
python .claude/skills/instruction-eval/scripts/run-eval.py --skill repository

# Run one category across all skills
python .claude/skills/instruction-eval/scripts/run-eval.py --category constraint

# List all probes without running
python .claude/skills/instruction-eval/scripts/run-eval.py --list

# Output JSON for CI
python .claude/skills/instruction-eval/scripts/run-eval.py --json > eval-report.json

The script scores each probe as:

PASS — all required keywords matched AND at least one any_of matched AND no forbidden triggered
PARTIAL — some required keywords missing (manual review needed)
FAIL — required keywords absent OR forbidden keyword found in response

Interpreting Results

Signal	Likely Cause	Action
Constraint probe fails	Root CLAUDE.md principle removed or overridden	Restore to root CLAUDE.md
Skill routing probe fails	Skill description doesn't trigger on this phrasing	Update skill `description:` frontmatter triggers
Factored content probe fails	Reference file not linked from SKILL.md, or link broken	Add explicit link in SKILL.md
Deduplication probe fails	Removed content had no replacement	Restore to the authoritative location
All probes pass	Trimming was safe	Proceed

When to Run

Before creating a PR for any skill/CLAUDE.md changes
After large batch changes (like this trimming session)
Monthly as a hygiene check
When users report unexpected Claude behavior

Test Cases

Tests are co-located with what they validate:

CLAUDE.md tests: .claude/tests.yaml — REPO-, INFRA-, K8S-, GHA- probes
Skill tests: tests.yaml in each skill directory (e.g. app-template/tests.yaml)