| name | cautilus-agent |
| description | Use when intentful behavior evaluation itself is the task and the repo should run Cautilus's checked-in workflow instead of reconstructing compare, held-out, and review commands by hand. |
Cautilus Agent
Use Cautilus Agent when intentful behavior evaluation itself is the task and the repo wants to run the checked-in Cautilus workflow instead of rebuilding eval fixtures, packets, reports, claim discovery, review, or improve commands by hand.
For external host repos during the current contract rewrite, treat evaluate fixture, evaluate observation, and post-run eval skill-experiment compare as stable; claim discovery automation, improve automation, live app-runner workflows, and review-learning capture remain opt-in.
eval skill-experiment compare compares host-preserved baseline and variant outputs; it does not clone, install, or execute skills.
Cautilus Agent assumes a Cautilus binary is available.
In the Cautilus product repo itself, prefer the checked-in source launcher ./bin/cautilus over cautilus on PATH, because the installed machine binary can lag the current checkout.
In consumer repos, use cautilus on PATH.
If no binary is available, install the CLI first and verify with cautilus --version.
To materialize Cautilus Agent in a host repo, run cautilus init --repo-root ..
Product Shape
Cautilus is a standalone binary plus Cautilus Agent.
Host repos own adapters, fixtures, prompts, wrappers, and policy.
The binary owns command discovery, packet examples, deterministic scans, validation, and reusable evaluation artifacts.
Cautilus Agent owns routing, sequencing, user-facing decision boundaries, and LLM-backed claim review work.
eval and improve may still exercise model-involving behavior through adapter-owned runners.
The current external-adoption front door is eval: verify bounded intentful behavior with explicit fixtures and adapters, and compare host-preserved skill experiment outputs after a run.
The broader product also includes opt-in claim and improve surfaces.
During the contract rewrite, do not present claim, improve, live app-runner workflows, or review-learning packet capture as stable cross-repo defaults.
CLI First
Resolve the binary before workflow commands:
CAUTILUS_BIN=cautilus
if [ -x ./bin/cautilus ]; then CAUTILUS_BIN=./bin/cautilus; fi
Let the binary print command families and packet examples.
Use "$CAUTILUS_BIN" commands --json, "$CAUTILUS_BIN" --help, portable cautilus doctor commands --json / cautilus --help, or a command's --example-input / --example-output surface instead of copying broad command lists into the answer.
Use command-cookbook.md only after the binary has identified the relevant command family and a concrete multi-step invocation is needed.
No-Input Orientation
When invoked with no task detail, orient first:
"$CAUTILUS_BIN" doctor status --repo-root . --json
Read cautilus.agent_status.v1 as the current product map.
Summarize binary health, agent-surface readiness, adapter state, selected claim-state availability, scan entries, linked Markdown depth, and nextBranches.
When claimState.orientationState is present, treat it as the selected claim map for status and branch commands.
Keep claimState.configuredState as the writable discovery baseline, not necessarily the most useful review packet.
Then help the user pick the next branch or stop, presenting branch labels and reasons in coordinator-facing language first.
Treat each nextBranches[].label as the human choice name; branch id is a secondary stable packet reference only when it helps auditability.
If the user answers with a number or short label after you present branch choices, resolve that answer against the branch labels you just showed and execute that selected branch.
Do not reinterpret a numbered branch selection as a request for another status summary.
If nextBranches includes initialize_adapter and the user delegated setup continuation, run the adapter setup branch and then rerun doctor status.
If claim state is missing, present the bounded scan entries and depth, explicitly ask the user to "confirm this scan scope or adjust it", state that LLM review needs a separate review budget, and only then enter claim discovery.
If claim state exists, read or refresh that packet before planning new proof work.
Declared Claim Discovery
Use this path when the user asks whether a repo proves what it claims, whether docs and behavior are aligned, or which scenarios still need to be created.
In external consumer repos during the contract rewrite, use this path only after the user explicitly opts into non-eval Cautilus work.
For these direct questions, do not run discover claims until scan entries/depth are stated and the user confirms or adjusts the scope; keep LLM review as a separate budgeted branch.
Do not hard-code the search to README; by default, the binary starts from adapter-owned claim_discovery.entries or README.md/AGENTS.md/CLAUDE.md and follows repo-local Markdown links to depth 3.
Use repeated --source arguments only when the user or adapter has selected an explicit truth-surface inventory.
Initial scan:
"$CAUTILUS_BIN" discover claims --repo-root . --output <claims.json>
cautilus discover claims --repo-root . --output <claims.json>
Refresh from existing state:
"$CAUTILUS_BIN" discover claims --previous <claims.json> --refresh-plan --output <refresh-plan.json>
When the selected branch is "compare the saved claim map with recent repo changes", "refresh existing claim state", or an equivalent refresh_claims_from_diff branch, run the refresh-plan command above and read refreshSummary.
Do not satisfy that branch by running only discover claims status; discover claims status belongs to the separate "inspect the saved claim map" branch.
Read refreshSummary first.
Explain it as a saved claim map catching up to current repo changes, not as a completed claim refresh.
Say what was recorded, what was not changed yet, and which next choice is safest.
Prefer labels such as "compare the saved claim map with recent repo changes" and "update the saved claim map before review or eval planning"; include internal branch ids only as references when needed.
After writing a refresh plan, inspect .refreshSummary directly, for example with jq '.refreshSummary' <refresh-plan.json>.
Do not reconstruct changedClaimCount, carriedForwardClaimCount, or source hotspots from raw changedSources or claimPlan when refreshSummary exists.
Status from existing state:
"$CAUTILUS_BIN" discover claims status --input <claims.json> --sample-claims 10
Before creating fixtures, keep proof class and readiness separate.
Use human-auditable for source/doc judgment, deterministic for unit/lint/type/build/schema/CI proof, cautilus-eval for model/agent/prompt/skill/workflow evidence, needs-scenario for claims needing scenario decomposition, and needs-alignment for docs/code/adapter/skill surfaces that must be reconciled before proof would be honest.
After discovery or refresh, summarize scanned entry files, linked Markdown count and depth, raw candidate count, claim summary by proof mechanism/readiness/evidence/review/lifecycle, and the groups that look ready for deterministic tests, Cautilus scenarios, alignment work, or human-auditable review.
When the next natural branch is claim review, explain that it is a budgeted LLM review branch before presenting it as a choice.
The coordinator should understand that choosing review means setting a review budget before any reviewer lanes, result application, eval planning, edits, or commits happen.
Use discover claims status --sample-claims <n> as the canonical status view before hand-inspecting packet fields.
Read discoveryBoundary before judging false negatives.
If a declared promise is present inside the configured entry docs or linked Markdown graph and discovery missed it, report a discover claims false-negative bug; if a user-facing feature is missing from that graph, report it as an entry-surface or narrative gap unless another reviewed artifact proves the binary skipped an in-scope declaration.
During curation, reduce false positives before proof work and scan for likely missing public promises so the user can decide whether they were intentionally excluded or simply under-documented.
Read actionSummary.primaryBuckets before making a next-work recommendation.
Use the bucket recommendedActor and summary fields to separate agent work, human confirmation, deterministic proof, Cautilus eval planning, scenario design, alignment, and split-or-defer branches.
Use actionSummary.crossCuttingSignals for review debt or stale-evidence warnings that can coexist with a primary proof branch.
When preparing a focused review queue, pass --action-bucket <bucket> to discover claims review-input instead of hand-filtering claim JSON.
When triaging needs-scenario or needs-alignment, stay inside packet surfaces: prepare the focused action bucket, record reclassification decisions as cautilus.claim_review_result.v1, and apply them with discover claims apply-review.
Promote only concrete command, packet, runner, schema, or skill-behavior claims to ready-for-proof; keep product-boundary, future-surface, ownership, umbrella, and readiness-definition claims non-ready until split or aligned.
In the Cautilus product repo, when raw claim status or review packets are too large for a maintainer to judge directly, run npm run claims:status-report and read .cautilus/claims/claim-status-report.md before asking for human decisions.
If the maintainer is reviewing from a constrained terminal or phone, run npm run claims:status-server so they can read the report in a browser and save section comments as .cautilus/claims/claim-status-comments.json.
When raw candidates are too granular for product judgment, curate a canonical claim spec tree before continuing HITL.
Treat raw candidates as high-recall proof-planning inputs, not as the human-facing promise map.
For Cautilus itself, keep the curated user-facing promises in docs/specs/user/index.spec.md and per-claim pages under docs/specs/user/.
Keep the maintainer-facing index in docs/specs/contracts/index.spec.md.
In the Cautilus product repo, product-meaning review should start from those spec docs; use the status report for packet audit, debugging, or deciding which remaining raw candidates are not yet absorbed.
User-facing claims must use plain product language.
Order user-facing spec indexes by the user's feature mental model before cross-cutting implementation promises.
For Cautilus itself, lead with claim, eval, improve, then doctor or readiness, then supporting promises such as portability, packet/reporting surfaces, and proof-debt visibility.
For other repos, infer the equivalent top user jobs from the adapter, README, and source docs instead of copying Cautilus-specific command names.
Use adapter semantic_groups, source-doc headings, declared product surfaces, and README or guide structure as the portable signal for those top user jobs.
Keep each index short; put subclaims and evidence placeholders in the per-claim spec pages.
Maintainer-facing claims may use internal terms, but they must stay aligned with the user-facing claim specs and preserve source refs, proof route, evidence status, and next action.
Review packets and curation artifacts preserve absorbed raw claim ids and fingerprints when available; stable catalog docs may summarize the absorbed raw themes instead of listing volatile line-based ids.
When those catalogs change in the Cautilus product repo, run npm run claims:canonical-map before npm run claims:status-report so the report shows how raw user claims compress into the current U-claim catalog.
If discover claims status or doctor status reports gitState.isStale=true, run discover claims --previous <claims.json> --refresh-plan before claim review, review-result application, or eval planning.
Do not launch reviewers, apply review results, plan evals, edit files, or commit artifacts from a stale claim packet unless the user explicitly asks to override stale state.
If a view is missing, prefer adding a product-owned summary option or review packet over guessing raw JSON keys with ad hoc jq.
If a new discovery run changes only volatile metadata such as the reviewed git commit and not source inventory, candidate claims, labels, or evidence refs, report it as a semantic no-op and do not create a pointer-only commit.
When reporting a refresh-plan result to a coordinator, use this shape:
- What I did: recorded a comparison between the saved claim map and the current checkout.
- What I found: cite
refreshSummary.changedSourceCount, changedClaimCount, carriedForwardClaimCount, and the top changedClaimSources.
- What I did not do: did not update the saved claim map, review claims, plan evals, or edit product files unless the user selected that branch.
- Recommended next choice: use the first safe
refreshSummary.nextActions item in plain language.
LLM-backed claim review is a separate branch; before launching it, state the selected review budget in these exact labels: maximum clusters, claims per cluster, parallel lanes, excerpt chars, retry policy, and skipped-cluster policy.
Then ask the user to confirm or adjust that budget, and stop until that confirmation or adjustment arrives.
If the user selects the review branch without naming a budget, use the command defaults as a conservative deterministic prepare-input budget and say so before running discover claims review-input.
After discover claims review-input, restate the concrete selected budget with the same labels before any reviewer lane, even if the user already asked to launch a reviewer.
After a fresh first discovery for any review branch, run discover claims status --input <claims.json> --sample-claims <n> before discover claims review-input; this is the product-owned review queue summary, not optional narration.
After discover claims review-input, stop at the packet boundary unless the user has explicitly delegated reviewer launch.
When reviewer launch is explicitly delegated, use the smallest honest launch budget if none was provided: one cluster, one claim, one reviewer lane, no retries.
The default single-lane launch in an agent session is the current agent acting as the reviewer lane: review the selected cluster and write a valid cautilus.claim_review_result.v1 packet.
For claim evidence, prefer evidenceStatus=unknown unless the review input includes direct verified evidence; when attaching Cautilus eval evidence, use kind=eval-summary for eval-summary.json, kind=eval-observed for eval-observed.json, and reserve kind=fixture for checked-in fixture or scenario input files.
Use an external reviewer CLI helper only when that external lane was explicitly selected.
The launch is only complete when the selected reviewer lane returns a cautilus.claim_review_result.v1 packet; if an external lane is blocked from reaching its model provider, report a blocked reviewer launch rather than treating the helper invocation as evidence.
This branch proves reviewer launch, not review-result merge behavior.
It ends when the selected reviewer lane writes the cautilus.claim_review_result.v1 packet and reports the packet path, reviewed claim count, evidence label decisions, unresolved questions, and next branch choices.
If you need local confidence before stopping, inspect the written JSON shape directly; do not call discover claims apply-review as packet validation inside the reviewer-launch branch.
Treat discover claims apply-review as the next branch even when the output path is temporary and the intent is only validation.
After reviewer launch, stop before review-result application, eval planning, edits, or commits unless the user explicitly delegates the next branch.
In the later review-to-eval branch, apply cautilus.claim_review_result.v1, validate the reviewed claim packet, and only then plan eval fixtures for reviewed cautilus-eval claims that are ready-for-proof.
When review-to-eval is explicitly delegated in the same agent turn, keep the same bounded reviewer budget unless the user names a larger one, write the applied claim packet to a separate artifact, validate that artifact, run evaluate claims plan from the validated artifact, and stop before fixture authoring, eval execution, product edits, or commits.
The review and eval-planning commands reject stale claim packets by default; treat that error as a prompt to refresh, not as a reason to pass --allow-stale-claims automatically.
After a human review, HITL chunk, issue, pull request review, review.json, or review-summary.json produces a source review decision about a Cautilus proposal, use "$CAUTILUS_BIN" review feedback build to materialize cautilus.review_feedback.v1; include proposal evidence for accepted, narrowed, reframed, or rejected, reserve proposal-less packets for missing_critical, read the packet before reporting it, and do not treat it as eval pass/fail or source-ref verification; when comparing multiple explicitly selected review-feedback packets, use "$CAUTILUS_BIN" review feedback summarize --input <packet> ... to materialize cautilus.review_feedback_summary.v1 without inventing active-run packet discovery or default packet placement.
Eval Routing
Cautilus has two top-level evaluation surfaces and four fixture presets.
Use dev for AI-assisted development work such as repo contracts, tools, and skills.
Use app for AI-powered product behavior such as chat, prompt, and service responses.
The canonical command families are claim, eval, and improve.
Use cautilus evaluate fixture --fixture <fixture.json> when the repo already has a checked-in fixture and adapter-owned runner.
When the agent runtime is read-only, pass an explicit writable --output-dir; prefer /dev/shm/cautilus-<label> when available, otherwise a writable external temp directory.
For a fixture-runtime smoke where doctor --scope agent-surface or doctor status already shows the local skill surface is ready, --skip-preflight is the right boundary in a read-only agent runtime; state that preflight was skipped because the ready state was already observed.
Routing rule:
dev / repo: an AI development agent must obey the repo work contract.
dev / skill: a checked-in or portable development skill must still trigger, execute, and validate cleanly.
app / chat: multi-turn product conversation behavior regressed after a prompt or wrapper change.
app / prompt: a single product input/output behavior must remain stable.
Use cautilus discover scenarios --json only when you need the proposal-input normalization catalog.
Use cautilus doctor --repo-root . when the selected branch is repo evaluation readiness.
Use cautilus doctor --repo-root . --scope agent-surface or doctor --scope agent-surface when the selected branch is only local Cautilus Agent install/readiness.
Workflow
- Start from
doctor status, an explicit user branch, or an existing Cautilus packet.
- Restate the selected claim, baseline, intended behavior, and decision boundary.
- Use
workspace start for a multi-command run instead of inventing unrelated /tmp paths by hand.
- Use
evaluate comparison prepare when the run needs clean git-ref A/B workspaces.
- Run adapter-defined preflight commands before long evaluations.
- Run
cautilus evaluate fixture for checked-in fixtures and read eval-summary.json as the first bounded evaluation decision.
- Build
report.json only when the workflow needs the broader report/review/evidence/improve packet layer.
- If the adapter defines
executor_variants, run cautilus evaluate evaluate review variants instead of retyping ad hoc shell commands.
If evaluate review variants are requested but unavailable on the selected adapter, treat that as a gate defect to fix or explicitly waive before release.
- Use
discover scenarios propose when normalized proposal candidates already exist and the next move is a checked-in scenario packet.
- Use improve or GEPA-style search only after the claim and held-out proof surface are explicit.
- Report exact commands, exact adapter selection, exact artifact paths, and the final recommendation.
When the target repo is Cautilus itself, prefer the checked-in self-dogfood wrappers over rebuilding the mode/report/review chain by hand; see self-dogfood-runner.md for wrapper entries and claim boundaries.
Packet Reading
Use product-owned outputs instead of paraphrasing command results from memory.
cautilus.agent_status.v1: no-input orientation over binary health, agent surface, adapter, claim state, scan scope, and next branches.
cautilus.claim_proof_plan.v1: source-ref-backed claim candidates and proof-layer routing, not a verdict.
cautilus.claim_status_summary.v1: grouped status from an existing claim packet.
cautilus.claim_validation_report.v1: claim packet and evidence-ref validation, not evidence discovery.
cautilus.claim_eval_plan.v1: intermediate eval-fixture planning over reviewed claims, not a host fixture writer.
eval-cases.json: product-normalized test cases handed to the host-owned runner.
eval-observed.json: observed behavior packet written by the runner.
eval-summary.json: first bounded evaluation decision packet.
report.json, review.json, review-summary.json, evidence-bundle.json: broader decision packets for review, evidence, and improvement workflows.
improve-input.json, improve-search-result.json, improve-proposal.json, revision-artifact.json: bounded improvement and handoff packets.
Humans usually read HTML renders first, then the underlying packet if they need exact fields.
Agents should read the packet first, then cite HTML only when a browser view is the deliverable.
Guardrails
- Keep host-repo fixtures, prompts, wrappers, and acceptance policy outside the product boundary.
- Prefer CLI examples and schemas over hand-written packet JSON.
- Treat deterministic tests and CI checks as host-owned proof when they are cheaper and stronger than an LLM behavior evaluation.
- Treat review loops and improver output as bounded decision artifacts, not open-ended retries.