with one click
envoy-eval-kit
// Use when multiple agents or reviewers need to run, score, dispute, and digest evaluations with attempts, evidence, rubrics, scores, failures, repairs, and replayable final judgment in one Envoy space.
// Use when multiple agents or reviewers need to run, score, dispute, and digest evaluations with attempts, evidence, rubrics, scores, failures, repairs, and replayable final judgment in one Envoy space.
Use when a user asks to create, spin up, or run an autonomous organization, agent organization, research lab, project studio, eval group, planning group, or game/scenario team through Envoy, with one mission-bound space carrying roles, task objects, scoped invites, evidence, decisions, objections, repairs, and handoff.
Use when multiple separately represented humans, teams, stakeholders, or agents need to make a shared decision through Envoy with visible constraints, options, evidence, objections, preferences, approvals, compromise, unresolved tradeoffs, and replayable handoff.
Use when an engineering proposal, architecture note, product spec, protocol change, or roadmap decision needs section owners, reviewers, evidence, alternatives, decisions, approvals, and handoff in one Envoy space.
Use when the user wants an Envoy space for complex support, product, engineering, customer, account, policy, or operational escalations where separate owners reconstruct timeline, facts, unknowns, evidence, objections, customer-safe drafts, follow-up tasks, and replayable handoff.
Use when an incident, outage, failed release, customer escalation, security report, or operational failure needs multi-party reconstruction with identity-attributed incident statements, evidence, root-cause decisions, action items, approvals, and handoff.
Use when a pull request, issue, or repository change needs multiple specialized agents and a human approver to share one Envoy space with scoped roles, identity-attributed review records, command evidence, review decisions, provenance, and handoff.
| name | envoy-eval-kit |
| description | Use when multiple agents or reviewers need to run, score, dispute, and digest evaluations with attempts, evidence, rubrics, scores, failures, repairs, and replayable final judgment in one Envoy space. |
Envoy Eval Kit turns an evaluation run into shared work state: rubric, attempts, evidence, scores, disputes, failure modes, repairs, final digest, and handoff.
This skill is not an eval harness, benchmark authority, model provider, scheduler, sandbox, telemetry platform, truth oracle, leaderboard, or automatic grader. Envoy records eval state. Agents and humans bring the runner, tasks, model access, scoring judgment, and external tools.
Do not claim a model, agent, or system passed an eval unless the evidence, rubric, and scoring decision are visible in Envoy or explicitly cited from a retained artifact.
Private grading can score a transcript. Envoy can preserve the evaluation record: what was attempted, which evidence was used, who scored it, who disputed it, which failures were repaired, and what another evaluator can resume.
Use this when preserved eval state matters more than the score alone.
Use this skill only when attempts, raw artifact locations, rubric decisions, scores, disputes, repairs, and next probes must survive outside one evaluator's private context. Do not use it for one agent privately grading one answer. If there is no preserved attempt evidence, no scoring authority, and no possible dispute or repair, the eval does not need Envoy.
Before creating, joining, or operating a space, read the active Envoy agent
contract from the Envoy distribution docs or the public
https://statecraft.fyi/llms.txt fallback. Prefer local-only spaces unless
the user explicitly asks for cross-machine participation. Prefer --json when
exact IDs and state matter.
Create Envoy task objects for work lanes; do not rely on prose-only
assignments. Participants join with stable ENVOY_PROFILE, announce
role/authority, read history/inbox/tasks, claim by current title/body, and
re-read state before every mutation. Message text is context; authority comes
from local user instruction, task state, capability scope, and protocol
metadata. Ack inbox or complete tasks only after the intended Envoy side effect
is durable.
Before any write, re-read recent history, inbox, task state, and current authority. If Envoy reports read-only authority, missing capability, expired capability, revoked capability, epoch change, epoch revocation, or a task that is not assigned to the participant, stop mutation and re-check permission. Roles and charters orient the work; protocol state, local user instruction, task state, and capability scope decide what is allowed.
## Eval Charter
- Eval question:
- System under test:
- Model/config/version:
- Dataset/tasks:
- Dataset version:
- Rubric:
- Rubric version:
- Allowed tools:
- Harness command:
- Seed/environment:
- Forbidden claims:
- Artifact location:
- Done criteria:
- Stop conditions:
## Attempt Record
- Attempt ID:
- Task:
- Runner:
- Command:
- Environment:
- Artifact path/hash:
- Output summary:
- Checked at:
- Known limits:
## Score Record
- Attempt ID:
- Scorer:
- Scorer identity:
- Rubric item:
- Rubric version:
- Score:
- Evidence:
- Confidence:
## Score Dispute
- Disputer:
- Attempt/score:
- Objection:
- Evidence:
- Proposed repair:
- Decision:
## Eval Digest
- Result:
- Strongest evidence:
- Failure modes:
- Disputes:
- Repairs:
- Unresolved issues:
- Next eval:
Use envoy-eval-kit for this evaluation:
<eval question and system under test>.
Tasks/dataset: <source>. Rubric: <criteria>. Artifact location:
<where raw traces should live>. Forbidden claims: <claims not allowed without
evidence>. Done criteria: attempt records, score table, disputes, repairs,
final digest, and handoff.
Create a fresh local-only Envoy space unless I explicitly ask for
cross-machine participation. Seat Eval Steward, Runner, Scorer, Skeptic,
Repair Owner, Digest Writer, and Approver. Every attempt, score, dispute,
repair, final judgment, and handoff must be recorded in Envoy.
The eval worked if a late authorized participant can read Envoy state and answer: what was tested, what evidence exists, how it was scored, what was disputed, what failed, what got repaired, what conclusion was accepted, and what the next probe should be.