| name | deliverable |
| description | Multi-session deliverable play for projects spanning 3+ sessions with concrete outputs (proposals, strategies, wireframes). Provides project-level structure, evidence provenance, and cross-session handoff. |
| disable-model-invocation | true |
Multi-Session Deliverable Play
For projects that span multiple sessions and produce a concrete deliverable
(proposal, wireframe, strategy doc, email sequence, etc.). Provides project-level
structure that the Novel Task Play's per-session /task structure doesn't cover.
When to Use
- Work will span 3+ sessions
- Output is a concrete deliverable (not a card or investigation)
- Multiple data sources or research inputs feed the deliverable
- The deliverable will be consumed by someone (human, dev team, future session)
who wasn't present for the research
Arguments
$ARGUMENTS is optional:
{project-name}: Used as the filename prefix for all project artifacts
(box/research/{project-name}-inputs.md, etc.). If omitted, derive from the
task description.
Also Read
reference/framework.md: tendency catalog, intervention model
reference/tooling-logistics.md: tested recipes for data sources
box/posthog-events.md: if querying PostHog
Project Artifacts
Every project produces four artifacts in box/research/:
| Artifact | File | What it is | When created |
|---|
| Inputs | {project}-inputs.md | Source-of-truth evidence base. Every claim has a source (URL, file path, insight link, conversation ID). Methodology recorded. Claims tagged with verification status. | Session 1 |
| Deliverable | {project}-proposal.md (or -wireframe, -plan, etc.) | The output. References inputs for evidence. Does not introduce unsourced claims. | Phase 2 |
| State doc | {project}-state.md | Session handoff: file map with provenance, methodology, numbered decisions with dates/rationale, what's superseded, what's open. Entry point for any session. | Session 1, updated every session |
| Investigation log | Via python3 box/log-cli.py write | Operational learnings. Not raw findings — those go in the inputs doc. | As needed |
Create the artifact split early — after the first data-gathering session,
not after a combined doc gets unwieldy.
The state doc is the entry point. Any new session starts by reading the
state doc, not the deliverable. The state doc is the index; the other artifacts
are the content.
Phases
[ ] Phase 1 — Research: inputs doc built with sourced claims
[ ] Phase 2 — Proposal: deliverable drafted from inputs doc
[ ] Phase 3 — Execution (if applicable): artifacts that consume the proposal
[ ] Close: consistency pass, state doc current, user approval
Phase 1: Research
Goal: build the evidence base. The inputs doc is the artifact.
Sourcing discipline
Every claim entering the inputs doc must have provenance. This is the gate
that prevents unsourced claims from reaching the deliverable.
Web research claims must include a URL. Not a domain name — a URL. A
citation to "onely.com" or "backstageseo.com" without a URL is not a source.
Statistics must trace to a named study with methodology. A specific number
(correlation coefficient, percentage, multiplier) appearing on an SEO agency
blog is not a source — it's a secondary claim. The source is the study that
produced the number: who conducted it, what methodology, what sample size, when.
If the secondary source doesn't cite a primary study, the claim is flagged at
entry.
Flag-at-entry, not flag-later. If a web research subagent returns a
statistic that appears in multiple secondary sources but none cite a primary
study, it gets flagged before entering the inputs doc. Not consumed and audited
later. The cost of flagging at entry is one annotation. The cost of consuming
and auditing later is propagation corrections across every downstream artifact.
Verification statuses
| Status | Meaning |
|---|
| VERIFIED | Claim traced to primary source (API data, direct inspection, raw file, codebase, named study with URL) |
| UNVERIFIED | Claim sourced from secondary source (web research, subagent, vendor blog) — not yet checked against primary source |
| FLAGGED | Specific concern about accuracy, internal consistency, or missing primary source |
| SOURCE-INACCESSIBLE | Primary source exists but can't be accessed (paywalled, expired) |
| POINT-IN-TIME | Data was accurate at collection date but may have changed. Date noted. |
| JUDGMENT | Interpretive claim (thresholds, qualitative assessments). Directionally sound, specific threshold unstated. |
| PLACEHOLDER | Explicitly acknowledged as an assumption. Awaiting real data. |
Subagent methodology
Subagent delegation is the default for volume research. The pattern:
- Delegate claim extraction / research to subagents
- Collect structured results
- Spot-check before writing to disk. For any subagent finding that changes
a strategic conclusion, verify at least one claim against its cited primary
source (Claude in Chrome for JS-rendered or auth-gated pages, Tavily
search, Read the actual file, etc.) before writing to the inputs doc.
This is mechanical, not advisory.
- Write to disk section-by-section. The inputs doc on disk is the
checkpoint. Don't accumulate findings in context and write at the end.
Research verification database
For projects with volume web research (50+ claims across multiple delegate
dispatches), use the research verification database instead of manual
spot-checking. The DB enforces verification completeness structurally —
unverified claims cannot reach the generated output.
CLI tool: python3 box/research-claims.py
Workflow (replaces manual subagent methodology for volume research):
-
Init project:
research-claims.py init --project NAME --inputs-doc PATH --sections 3.1,3.2,3.4a
-
Dispatch + import (one session, mechanical):
- Kick off delegates with
save_path to box/research/{project}/delegate-output/
- As results arrive:
research-claims.py import --from-delegate PATH --section X
- Or deferred:
research-claims.py import --scan-dir PATH --section-map JSON
- Then:
research-claims.py scrape --project NAME (fetches source content)
-
Verify (one session per section, judgment):
research-claims.py status --project NAME — see what needs review
research-claims.py verify --project NAME --section X --needs-review
- For each claim: read source context, call
update --claim-id N --status STATUS --snapshot-id N
- Source content is pre-loaded from DB snapshots. No fetching during verification.
-
Generate (one session, mechanical):
research-claims.py generate --project NAME --section X --dry-run — preview
research-claims.py generate --project NAME --section X --apply — writes to
box/research/{project}/generated/section-{X}.md (atomic, chmod 444, audit-trailed)
-
Assemble (after all sections generated):
research-claims.py assemble --project NAME — substitutes generated sections
into <!-- GENERATED: X --> placeholders in the inputs doc
- Produces
box/research/{project}/assembled-inputs.md (read-only)
- The assembled file is the canonical input for Phase 2 (proposal drafting)
Claim statuses: pending → auto_confirmed (mechanical quote match) →
confirmed / partial / rejected / excluded (agent judgment). Only
confirmed, partial, rejected, and excluded are generation-eligible.
auto_confirmed requires agent review before generation.
File protection: A PreToolUse hook blocks Edit/Write to generated/ and
assembled-inputs.md. Generated sections are written only by
research-claims.py generate --apply and consumed only through assemble.
Source quality meta-assessment
When a research artifact (web research report, competitor analysis, landscape
survey) is collected, assess the source quality at the artifact level before
processing individual claims:
- How many distinct sources are cited?
- Do citations include URLs or just domain names?
- Are sources primary research or secondary commentary?
- Is there a self-referential citation pattern (sources citing each other)?
Record the meta-assessment in the inputs doc. This flags low-quality source
artifacts before their individual claims enter the evidence base.
Flagged claims registry
The registry is the organizing artifact for claim resolution. Each flagged
claim gets:
| Field | Content |
|---|
| ID | Sequential (F-01, F-02, ...) |
| Claim | The specific text |
| Location | File and line |
| Issue | What's wrong or unverified |
| Disposition | What to do about it and when |
Organize by priority:
- P1: Internal inconsistencies — same project says two different things. Fix required.
- P2: Unsourced load-bearing — no source, and downstream decisions depend on it.
- P3: Unsourced external — no primary source, used as a strategic premise.
- P4: Unverifiable from disk — likely accurate but no raw file saved.
- P5: Judgment — interpretive framing. Directionally sound.
Codebase grounding
If the deliverable references product behavior, budget at least one session
specifically for codebase grounding before writing proposal sections.
Exit gate
Inputs doc exists with sourced claims. Every factual claim traces to a primary
source or is explicitly flagged. Flagged claims have dispositions. State doc
has methodology and decisions. User approval before Phase 2.
Phase 2: Proposal
Goal: produce the deliverable draft from the inputs doc.
Entry gate
No claim in the deliverable that isn't in the inputs doc. If drafting
introduces new claims (from adversarial reviews, additional research, the
writer's own knowledge), they route through the inputs doc first: record source,
tag verification status, then reference from the deliverable. The instinct to
"just include it and source later" is where provenance dies.
Section-at-a-time drafting
Write a section, then check claims against the inputs doc before starting the
next section. The section is the unit of work, not the whole document.
Claim-check requires tool calls, not recall. After drafting a section,
verify at least 3 key claims with targeted Read calls to the inputs doc
(specific line ranges). Cognitive cross-referencing from memory — even of
recently-read content — creates a false verification gate. Proved 2026-04-10:
cognitive-only checking passed 21 claims with checkmarks; first mechanical
spot-check caught a data discrepancy (12 vs 13 metrics count).
Bulk corrections require a written plan. When applying more than 5
corrections to a single file, write the full before/after list to disk first.
The list serves as both an audit trail and narration structure — natural
breakpoints between edit batches. Advisory commitments to "narrate as I go"
fail under batch momentum; the written plan creates structure mechanically.
Proved 2026-04-10: three monitor flags for silent mutations, three verbal
commitments to narrate, zero behavioral change — until correction summary
file was written, after which per-category narration held.
Unsourced claims are not silent. Options: source it (add to inputs →
reference from deliverable), mark as unverified with a visible tag, or cut it.
"Probably true" is not a verification status.
Adversarial review
Project-specific, not play-mandated. Some projects benefit from adversarial
delegate review on each section. If used, record in the state doc. Claims
from adversarial reviews are new claims — they route through the inputs doc
like any other.
Exit gate
Every factual claim in the deliverable traces to the inputs doc. No
unverified claims without visible tags. State doc updated with proposal
decisions. User approval before Phase 3.
Phase 3: Execution (if applicable)
Goal: produce execution artifacts (cards, wireframes, implementations) that
consume the proposal.
Execution surfaces new findings
Wireframe sessions find competitor claims are wrong. Card-writing sessions
find product behavior doesn't match. These findings route back through
inputs doc → proposal update. Execution does not silently override the
proposal. If reality contradicts the proposal, the proposal gets corrected
first.
Mid-project corrections
When new research invalidates a finding that's already been consumed
downstream:
- Update the inputs doc with the corrected finding
- Update the deliverable if it references the finding
- For execution artifacts already produced: add a correction header at
the top of each affected document pointing to the inputs doc. This is the
standard pattern when inline edits to every consuming document aren't
feasible in the current session.
- Record which documents have correction headers vs. inline fixes in the
state doc.
Propagation scope scales with cascade depth. If the project has cascade
structure (L1 feeds L2 feeds L3 feeds L4), each correction may touch
multiple layers. Scope the propagation work explicitly rather than treating
it as a closing checkbox.
State doc becomes critical
Each session's update records: which execution artifacts were produced, which
proposal sections they consumed, and whether execution surfaced findings that
changed the proposal or inputs.
Exit gate
Closing consistency pass (see below).
Closing Consistency Pass
Before declaring a deliverable complete:
- Claim propagation check. Flagged claims from the inputs doc are not
present in the deliverable or execution artifacts under different wording.
Delegate a scan to a subagent if the artifact set is large.
- State doc decisions match deliverable text. Spot-check key decisions.
- "What's still open" is accurate. No items marked open that are resolved;
no resolved items missing from the record.
- Inputs doc is the single evidence source. No deliverable section
references a claim that isn't in the inputs doc.
- Correction headers documented. If mid-project corrections used the
header pattern, the state doc lists which documents have headers vs.
inline fixes.
Cross-Session Handoff
At session end, the state doc must be current. The state doc answers:
- What files exist and what each one is (file map with provenance status)
- What methodology was used (reproducible)
- What key decisions were made, when, and why (numbered, dated)
- What was superseded and by what
- What's still open
Any new session starts by reading the state doc. The state doc is the index;
the other artifacts are the content.
Origin
Generalized from the onboarding activation project (Feb–Mar 2026) and
validated by the SEO cascade provenance retrofit (Mar 2026, 4 sessions).
The retrofit tested the play against 40+ artifacts across 5 cascade levels,
extracting ~1,000 claims and resolving 52 flagged entries. Key learnings
that shaped this play: (a) sourcing discipline at entry prevents downstream
audit costs, (b) the flagged claims registry is the most valuable artifact,
(c) subagent spot-checking must be mechanical not advisory, (d) correction
headers are the right pattern for mid-project propagation, (e) web research
source quality varies dramatically and needs artifact-level assessment.
Amendments (2026-04-10, competitive landscape Session 6)
Origin: Competitor claims tagged UNVERIFIED flowed through Phase 1 exit gate
("traces to primary source OR explicitly flagged") into Phase 2, into the
deliverable, and through a declared-complete Phase 2 exit gate. The "or
flagged" escape hatch allowed unverified claims to reach stakeholder-facing
output. Proved: competitive landscape project, 6 sessions, 42 contradictions
found in systematic verification (16% error rate across 259 claims).
Amendment 1: Phase 1 exit gate — verification tier requirement
Current gate: "traces to a primary source OR explicitly flagged." New gate:
every claim that will appear in the deliverable body (not just the inputs
doc) must be either VERIFIED or explicitly framed as market context. The "or
flagged" escape hatch closes for deliverable-bound claims. Flagging stays
fine for the inputs doc.
Amendment 2: Phase 2 — pre-draft verification pass per section
Before drafting each section, identify the load-bearing claims — the ones
where the conclusion changes if the claim is wrong. Verify those mechanically
(fetch the page, read the file, run the query). Non-load-bearing context can
stay at web-research level but must be framed as such.
Amendment 3: Deliverable structure separates evidence tiers
"What Tailwind Does" and "Gaps" sections use verified-only claims. "Competitor
Landscape" sections get an explicit sourcing note at the top: "Based on vendor
marketing and review sites. See Methodology for verification approach." The
reader always knows which voice they're reading.
Amendment 4: Completion declaration — mechanical readiness check
Before declaring any phase done, grep the deliverable for internal tags
(UNVERIFIED, SUBAGENT-SOURCED, JUDGMENT, SOURCE-QUALITY-WARNING). Any hits
must be resolved — verified, reframed, or cut — before completion. This is
a 30-second check that would have caught this.
Amendment 5: Exit gate — audience question
After the process checklist, one more gate: "Would you hand this file to the
stated audience right now?" If no, it's not done. This is the meta-gate that
catches what the mechanical checks miss.