원클릭으로
eval-outcomes
// Holdout-safe Outcomes grading: project the locked eval substrate, then ingest one verdict.
// Holdout-safe Outcomes grading: project the locked eval substrate, then ingest one verdict.
Review completed work and learn.
Make an out-of-session Claude (Managed Agent or Agent SDK loop) AgentOps-native — via skills + the ao CLI + CI, not hooks.
Make an out-of-session agent AgentOps-native via skills + ao CLI + CI, not hooks.
Explain AgentOps workflows.
Explain AgentOps workflows.
Run autonomous improvement loops.
| name | eval-outcomes |
| description | Holdout-safe Outcomes grading: project the locked eval substrate, then ingest one verdict. |
Quick Ref: Grade an agent's output via Outcomes (or any model) without forking the bar.
ao eval outcomes compileprojects a locked Task into a holdout-safe rubric (refuses to leaktarget/ground_truth— Managed Agents are not ZDR); grade it;ao eval outcomes ingestwrites the score back as the one council verdict record. Outcomes is a projection, never an alternate authority.
Codex has no Managed Agents loop. Call the same ao eval outcomes compile <input.json> to get a holdout-safe rubric, grade it locally (Inspect AI over the dev split, or the bushido llama.cpp Qwen grader over tailnet), then ao eval outcomes ingest <score.json> --json. Net: Codex never touches the cloud Outcomes API but produces a byte-identical verdict record.
Load and follow the skill instructions from the sibling SKILL.md — OR read skills/eval-outcomes/SKILL.md in the host repo for the canonical specification. Then apply the three-phase Workflow (compile → grade → ingest), honoring the Critical Constraints (never send holdout target/ground_truth/PII; carry judge_content_hash; register a global Dolt burn for holdout grades).