| name | task |
| description | Structured operational hygiene for research, data gathering, and novel tasks that don't have their own play |
| disable-model-invocation | true |
Task
For any task that doesn't have its own play: research, data gathering, one-off
analysis, building something new. The task itself is unstructured, but the
operational hygiene around it is not.
Arguments
$ARGUMENTS is optional:
{task-name}: Used as the filename for the task brief
(box/research/{task-name}-brief.md). If omitted, derive from the task
description.
Also Read
Before starting, read these references:
reference/framework.md: tendency catalog, intervention model — you need
these for unstructured terrain
reference/tooling-logistics.md: tested recipes for any data source
you're about to hit. Check this for gotchas, access patterns, and token
loading before attempting to access external sources yourself.
box/posthog-events.md: if querying PostHog, check known event names
before guessing
Phases
The play moves through five phases. Set todos at play start and mark each
phase transition explicitly.
[ ] Start: task brief created at box/research/{name}-brief.md
[ ] Plan: queries/steps written in task brief before executing
[ ] Execute: intermediate results saved to durable files
[ ] Synthesize: deliverable produced
[ ] Close: code committed, log updated, play promotion evaluated
Phase Details
1. Start
Resuming a prior investigation (.agent/state/active.json exists):
- Read
active.json for the investigation ID and brief_path
- Skip steps 1-2 below — reopen the existing task brief at
brief_path
- Run
python3 box/agent-state.py views --investigation I-NNN to regenerate
brief.md from current state (always regenerate on resume regardless of
status — this eliminates stale views from interrupted sessions)
- Read
.agent/views/brief.md for current investigation state (verified vs.
tentative claims, open questions, recommended next action)
python3 box/agent-state.py log session_start --investigation I-NNN --session-id $SESSION_ID
- Set the five phase todos as usual
Starting a new investigation (no active.json, or starting fresh):
- Copy
box/research/task-brief-template.md to box/research/{name}-brief.md
- Fill in the Objective and Intermediate Files sections
- Check
reference/tooling-logistics.md for gotchas relevant to your data
sources
python3 box/agent-state.py log session_start --session-id $SESSION_ID --brief-path box/research/{name}-brief.md (auto-generates investigation ID)
- Set the five phase todos
The task brief is the durable artifact. If context compaction hits, the brief
survives. If the session crashes, the next session can resume from the brief.
2. Plan
Write your queries, steps, and merge logic in the task brief before executing
any of them. This is the phase gate: nothing runs until the plan is written
down.
If you discover during planning that the task shape is unclear, say so.
Planning is where ambiguity gets resolved, not mid-execution.
Get user approval on the brief before moving to Execute. The plan is a
checkpoint, not just a file. Present it inline or via plan approval and wait
for explicit approval before executing.
3. Execute
Save each intermediate result to a durable file as it comes back. The file
paths should already be in the task brief from the Plan phase.
Log every data source interaction via the agent memory system:
- Each data source read:
python3 box/agent-state.py log source_read --source {name} --record-id {id}
- Each query:
python3 box/agent-state.py log query_run --tool {name} --query "{text}"
- Each source skip:
python3 box/agent-state.py log source_skipped --source {name} --reason "{why}"
- Delegate dispatch/collection:
python3 box/agent-state.py log delegate_dispatched|delegate_collected --task-id {id} ...
The --investigation flag is optional when active.json exists. Intermediate
results still save to box/research/ files as before — the event log tracks
that they were read.
When results look suspicious, stop and investigate. Do not rationalize
unexpected data. The tendency is to explain why wrong data is acceptable rather
than fix it. If a number looks off, the default is "this is wrong and I need
to understand why," not "this is probably fine because..."
If you need to modify code (scripts, client libraries, etc.) as part of
execution, note the changes in the task brief's Decisions section.
Context cost of data gathering. Subagent delegation protects main context
from volume reading but collect results stay in context for the rest of the
session. Use output_instructions to constrain return shape (file paths + key
findings, not full excerpts). For direct file reads, use line ranges — Grep
for structure first. Every KB here is a KB unavailable for Synthesize, Close,
and session-end.
4. Synthesize
⚠ SCOPE GATE. Before writing the deliverable, generate scope accounting
from the event log:
python3 box/agent-state.py accounting
Present the machine-derived output to the user. Fill in the agent-supplied
sections (plan commitments, additional context) before presenting. Wait for
the user to confirm coverage is sufficient before writing.
The machine-derived sections (sources read, queries run, delegates, claim
status) are authoritative — they come from the event log, not from memory.
The agent-supplied sections are explicitly marked as manual input.
Re-read intermediate files before composing. The data was read earlier but
the compose step must work from the files, not from memory of what they said.
This activates even when the data feels recent and in-context — "I already read
this" is the exact instinct that suppresses the re-read. Proved twice:
2026-03-27 (proposal reading), 2026-04-09 (phase assignments, blocking analysis).
Produce the deliverable. The shape depends on the task — could be a CSV, a
report, a card draft, a script.
Log claims in the deliverable. Claims proposed in the deliverable should
be logged:
Correctness claims require primary source verification. When the deliverable
includes judgments about whether an external system (bot, API, feature) produced
a correct or incorrect result, label those judgments as candidates pending
verification — not findings — until verified against the system's code, data,
or actual behavior. Conversation text, user reports, and error descriptions are
intermediaries, not primary sources. Verification is per-claim: verifying one
claim does not validate others about different features. Proved 2026-04-16:
5 bot conversations labeled "bot was incorrect" from conversation evidence;
codebase verification of 1 found the bot described a real feature applied to
the wrong problem.
Writing to large target files. Edit returns the full file as confirmation.
For end-of-file appends of approved content, Bash cat >> avoids the
echo-back.
Compaction insurance. Use approve_content for composed prose going to
production surfaces (card descriptions, analysis comments, observations, email
drafts, findings messages). Formulaic mutations (tracked replies, "This
shipped!" replies, reactions, link-back lines, state changes) go through
execute_approved only. Saved files at
.agenterminal/approved/{content_type}/ survive compaction. After compaction,
read the saved file to recover approved text. See play-specific approval
points in box/shortcut-ops.md for content_type and filename conventions
per play.
5. Close
⚠ NARRATION GATE. Close-phase steps feel procedural after the deliverable
ships — completion bias activates here reliably. Before EACH step below: say
one sentence to the user about what you are doing and wait for acknowledgment
before tool calls that commit, write log entries, or update reference docs.
Proved 2026-04-15: commit to main + two log entries + tooling update executed
as autonomous stream, user had to interrupt.
- End the investigation session and regenerate views:
python3 box/agent-state.py log session_end --session-id $SESSION_ID
python3 box/agent-state.py views
- Commit any code changes made during the task
- Update the investigation log if there were operational learnings:
python3 box/log-cli.py write --date ... --topic ... --lesson ... --bullet ...
- Play promotion: Was this a one-off, or did we just discover a repeatable
pattern? If the task is likely to recur, the artifacts from this run (the
task brief, the intermediate steps, the gotchas encountered) are the seed
for a new dedicated play. Draft the play definition and propose adding it
to the index.
When NOT to Use
This play is for tasks that hit external data sources and produce durable
artifacts: CSVs, reports, scripts. It is not for conversation-shaped work
like process design, brainstorming, or retrospectives, where the conversation
itself is the deliverable. The user triggers this play explicitly; don't
self-activate based on the word "novel."
What This Play Does NOT Prescribe
- What data sources to use
- How to structure the investigation
- What the deliverable looks like
- Methodology or analysis approach
Those flex to the task. The phases and their gates don't.