| name | auto-research-tick |
| description | Single entry point for the learn-ai auto-research loop. Reads docs/audits/auto-research/state.json to determine mode (baseline / Build Alpha functionality validation / nightly / dormant) and runs that mode's instructions. Use when invoked manually as `/auto-research-tick`, when fired by the nightly cron (after hardening), or when explicitly told to "run auto-research" or "do an auto-research tick". Auto-trigger only on those exact phrases — do not invoke this skill speculatively. Baseline and Build Alpha functionality validation modes are implemented; nightly mode is dormant until the hardening gate in baseline-math-rigor.md is clear. |
Auto-research tick
The auto-research loop has two long-term modes (baseline, nightly) plus dormant states, and may also run a bounded one-shot Build Alpha functionality validation pass when the state asks for it. This skill is the single entry point. The current mode lives in docs/audits/auto-research/state.json.
Mode dispatch
Read docs/audits/auto-research/state.json first. Branch on mode:
mode value | Action |
|---|
baseline-not-started | Initialize the baseline (see §Baseline mode → kickoff). |
baseline-in-progress | Resume the baseline from cursor (see §Baseline mode → resume). |
baseline-complete-awaiting-remediation | The baseline is done. Print a one-paragraph status from the most recent run summary, list open P0/P1 findings, and exit. Do not start nightly mode. |
build-alpha-validation-pending | Run the one-shot Build Alpha functionality validation pass (see §Build Alpha functionality validation mode). |
build-alpha-validation-complete-awaiting-review | The validation pass is done. Print the latest Build Alpha validation report path, the executive verdict, and the top architect recommendations; then exit. |
hardened-nightly | Not implemented yet. When you see this, refuse and tell the user to manually invoke the (future) nightly skill — the baseline must be acknowledged complete by a human before nightly mode goes live. |
Anything else: stop and ask the user. Do not guess.
Hard constraints (every mode)
These never change. If a constraint conflicts with the user's instruction in a single tick, stop and ask.
- Read-only outside
docs/audits/auto-research/. No edits to production code, tests, fixtures, configs, or any file under PythonDataService/, Backend/, Frontend/, references/, .claude/, .codex/ (if present), or any other source path.
- No commits, no branches, no PRs, no pushes. Even if the user has previously authorized commits in another context.
- No new dependencies. Not in
requirements-*.txt, *.csproj, package.json, anywhere.
- No regenerating golden fixtures. Ever.
- No loosening tolerances. Ever.
- No restarting containers. If a container is required and down, record the check as
not run, container down: <container_name> and continue. Do not run ./restart.sh, podman compose up, or any rebuild/restart command. Functionality-validation mode may inspect already-running services, logs, browser output, and HTTP responses; it may not mutate the runtime to make a check pass.
- No external network fetches (WebFetch / GitHub MCP) without an explicit user authorization recorded in the finding doc. Default to vendored references only:
references/, docs/references/, and the framework docs already cited in .claude/rules/.
- No running LEAN, no running QuantLib outside its existing Python wrapper, no installing anything.
- No installing Playwright. Functionality-validation mode should use Playwright only if an existing runtime/tool is already available. Do not run commands that download browsers or packages during the audit.
- Targeted tests only. Run
pytest -k <name>, dotnet test --filter <name>, or vitest run -t <name> to verify a specific static finding. Never the full test suite during a baseline tick.
- No "greenwashing" tests — never write a test asserting current behavior just to make a finding "go away".
Budget and termination
Per tick:
- Hard cap: 8 hours of wall-clock work or until usage runs out (whichever first).
- Soft check every ~10 minutes: persist
state.json and a partial run summary so a forced exit leaves clean state.
- On rate-limit / usage-cap detection: write the run summary, save state, exit cleanly. Do not retry; the next nightly cron (or next manual invocation) will resume.
- Exit early if a full sweep of the configured scope produces no new P0/P1/P2 findings — record that in the run summary as the "no new findings" terminator.
State management
docs/audits/auto-research/state.json schema (v1):
{
"mode": "baseline-not-started | baseline-in-progress | baseline-complete-awaiting-remediation | build-alpha-validation-pending | build-alpha-validation-complete-awaiting-review | hardened-nightly",
"phase": 1,
"phase_name": "inventory",
"cursor": "PythonDataService/app/engine/indicators/sma.py",
"last_run": "2026-05-06T03:14:00-04:00",
"budget_used_seconds": 24300,
"open_findings": ["F-0001", "F-0002"],
"closed_findings": [],
"runs": [
{"date": "2026-05-06", "phases_touched": [1, 2], "opened": 7, "closed": 0, "stop_reason": "budget"}
],
"baseline_started_at": "2026-05-06T23:00:00-04:00",
"baseline_completed_at": null,
"schema_version": 1
}
Write atomically: write to state.json.tmp, then rename. Persist after every finding open/close and at least every 10 minutes during long static scans.
Finding doc schema
One file per P0/P1/P2 finding at docs/audits/auto-research/findings/F-NNNN-<slug>.md. P3 findings roll up into a single findings/P3-rollup.md.
---
id: F-0001
severity: P0 | P1 | P2
status: open | repro-test-written | awaiting-human | deferred | fixed-verified | wontfix
area: inventory | python-authority | timestamp | provenance | fixture | tolerance | ingestion | wire | frontend-consumption | documentation
canonical_file: <path or "n/a">
reference: <vendored path, paper citation, or "missing">
first_seen: 2026-05-06
last_seen: 2026-05-06
phase: 1
---
## What
One paragraph in plain language.
## Where
File + line numbers. Multiple locations OK.
## Why this severity
Tied to the taxonomy in §9 of `baseline-math-rigor.md`.
## Reproduction
Command, expected vs observed. Skip if static-only.
## Suggested resolution (NOT auto-applied)
What a human would change. The skill writes this; the skill does NOT apply it.
## Provenance of the finding itself
Phase + cursor that produced it. Which reference (vendored path or commit) was consulted, if any.
status values the skill is allowed to set: open, repro-test-written, awaiting-human, fixed-verified. The skill must respect human-set deferred and wontfix and never reopen them silently.
Finding ID allocation: monotonically increasing across the lifetime of findings/. On startup, scan existing files and pick the next free ID.
Deduplication
Before opening a finding, hash key: (area, canonical_file_or_location, finding_type). If a finding with that key exists in findings/ (any status), update its last_seen and add a note rather than creating a duplicate.
============================================================
BASELINE MODE
============================================================
The baseline is a one-shot comprehensive audit producing recommendations only. Its full spec lives in docs/audits/auto-research/baseline-math-rigor.md. The skill's job is to fill that doc in and write the supporting findings/ files.
Kickoff (mode: baseline-not-started)
- Set
mode to baseline-in-progress, phase to 1, phase_name to inventory, baseline_started_at to now (ISO 8601 with offset).
- Update
baseline-math-rigor.md: set Status: to in-progress, Started: to today, Run count: to 1.
- Create today's run file at
docs/audits/auto-research/runs/YYYY-MM-DD.md with a header skeleton.
- Begin Phase 1.
Resume (mode: baseline-in-progress)
- Read
cursor. Pick up from the next item in the current phase.
- If
last_run is more than ~36 hours ago, note the gap in today's run summary (this likely means cron didn't fire — surface it, don't hide it).
- Increment
Run count in baseline-math-rigor.md.
- Append a new row to §7 (Runs) at the start of the run, finalize counts at the end.
Phases (dependency-ordered, severity sub-sorted within)
The phase order matches §5 of baseline-math-rigor.md. Each phase has a static sweep first; targeted test runs only when a finding's severity warrants verification and the relevant container is up.
Phase 1 — Canonical math inventory & source-of-truth gaps
Goal: Make docs/math-sources-of-truth.md correct and complete.
Static sweep:
- For every row in
docs/math-sources-of-truth.md: confirm the Canonical file path exists; record findings for moved/renamed/deleted entries.
- Code-side scan to surface undocumented canonical math the registry doesn't list. Heuristics:
- Files in
PythonDataService/app/engine/ not referenced anywhere in the registry
- Files in
PythonDataService/app/services/ whose names match math concepts (*_pricer*, *_greeks*, *_iv*, *solver*, *statistics*, *backtest*, *indicator*, *valuation*)
- Methods in
Backend/Services/Implementation/*.cs that contain arithmetic over price / Greek / indicator values (grep for Math., decimal, multiplication of model fields, etc.) and aren't registered as legacy duplicates
- TS files matching
Frontend/src/app/utils/*math*, *pricer*, *greeks*, *calculator*, *compute* not registered
- Confirm the same against
docs/architecture/engine-authority-map.md and docs/architecture/numerical-authority-migration-plan.md. Findings record drift between any two of: registry, authority map, migration plan, code reality.
Severity heuristics:
- Unregistered canonical math doing live computation → P1
- Registered file moved/renamed without registry update → P1
- Registry entry stuck in
pending-migration long after the migration plan's stated phase → P2 (escalate to P1 if the duplicate is still serving live traffic)
- Authority-map / migration-plan drift vs registry → P2
Output: §2 of baseline-math-rigor.md populated; one finding per inventory mismatch.
Phase 2 — Python math-authority violations (rule 5)
Goal: Confirm Python is the authority for canonical math; every duplicate has a parity test naming the canonical file.
Static sweep:
- For each registry row marked
legacy-ok or pending-migration: open the duplicate file and confirm:
- Its provenance block (or header comment) names the canonical Python file in
Canonical implementation
- It has a parity test named in
Validated against, and that test exists
- For every
Backend/Services/Implementation/*.cs containing arithmetic on price / Greek / indicator / statistic values: confirm it is either a passthrough to Python, an aggregation of Python-supplied numbers, or has a registered legacy-ok justification with parity test.
- Same sweep across
Frontend/src/app/**/*.ts for Math.-heavy files outside the registered legacy set (currently Frontend/src/app/utils/black-scholes.ts is the registered exception).
Severity heuristics:
- A non-Python layer computing canonical math without a parity test → P1 (escalate to P0 if the resulting number is rendered to the user as authoritative).
- Duplicate that points at canonical but has no parity test → P1.
- Stale parity test (last green run > 6 months ago) → P2.
Phase 3 — Timestamp boundary violations
Goal: Confirm int64 ms UTC is the only wire and storage format and the ban list is clean.
Static sweep — Python: grep PythonDataService/ for:
datetime.utcnow, datetime.utcfromtimestamp
datetime.now() without tz=
pd.to_datetime( without utc=True (multi-line aware)
.strftime(".*Z") on a naive datetime
- Any field literally named
timestamp / ts / time typed as str in a Pydantic model
Static sweep — .NET: grep Backend/ for:
DateTime.Parse(
DateTime.ParseExact( without an explicit offset designator in the format string
- Field named
timestamp / ts / time typed string or DateTime in a DTO
- Any new
DateTime instance in a canonicalization or ingestion path
Static sweep — TypeScript: grep Frontend/src/ for:
new Date(<string variable>) where the input variable did not come from a literal full-ISO-with-tz
Date.parse(
- Field named
timestamp / ts / time typed string or Date in an interface that crosses the wire
Severity heuristics:
- Ban-list violation in an active ingestion path → P0
- Ban-list violation in a serialization or DTO path → P0 if user-visible numbers depend on it; P1 otherwise
- Ban-list violation in a non-canonicalization helper → P2
- DTO field typed
string for a wire-side timestamp → P1
Reference: .claude/rules/numerical-rigor.md → "Timestamp rigor → Ban list".
Phase 4 — Provenance & reference gaps
Goal: Every canonical math file carries the 4-field block: Formula / Reference / Canonical implementation / Validated against.
Static sweep:
- For every Python file listed as
Canonical in docs/math-sources-of-truth.md: open it, look in the module docstring or the relevant function/class docstring for the four field labels.
- Same for
Backend/ and Frontend/ canonical-or-legacy-ok files in the registry.
- Cross-check
docs/references/<name>.md exists for every entry that names one.
Severity heuristics:
- Canonical file with no provenance block at all → P1
- Provenance block missing one or more fields → P1 if
Reference or Validated against is missing; P2 otherwise
Validated against: says manually checked / looks right / similar non-test phrasing → P1
- Reference cited but no corresponding
docs/references/<name>.md → P2
Phase 5 — Golden fixture gaps
Goal: Every canonical math has a fixture under PythonDataService/tests/fixtures/golden/<name>/ with input, output, and attribution.
Static sweep:
- For every canonical Python math file in the registry: walk
tests/fixtures/golden/ looking for a directory whose name matches.
- For each found fixture: confirm presence of input, output, and an attribution file (
README.md or attribution.json) that includes reference source, generation date, and regeneration command.
- Flag fixtures referenced in tests but missing on disk.
- Flag fixtures present on disk but referenced by no test.
Severity heuristics:
- Canonical math listed as
pending-fixture in the registry → P1
- Canonical math NOT listed
pending-fixture but fixture missing → P0 (the registry lies)
- Fixture present, attribution missing → P2
- Fixture older than 12 months with no upstream reference change → P3 (rollup)
Phase 6 — Tolerance hygiene
Goal: Every float comparison declares atol and rtol; loosened tolerances are justified.
Static sweep:
- grep
PythonDataService/tests/ and Backend.Tests/ and Frontend/src/ for:
np.allclose(, np.isclose(, assertAlmostEqual — flag any without explicit atol= and rtol=
assert.closeTo( (Vitest/Jest) — flag without explicit precision
Assert.Equal(.., .., delta: — flag without an inline rationale comment
- For each loosened tolerance (
atol > 1e-9 for indicators, atol > 1e-6 for PnL, atol > 1e-6 or rtol > 1e-6 for Greeks, atol > 1e-10 for probabilities): require an inline justification comment OR a justification in docs/references/<name>.md. Flag if absent.
Severity heuristics:
- Bare
np.allclose(a, b) with defaults → P1
- Loosened tolerance with no justification → P1
- Justification present but vague ("close enough", "after a few bars") → P2
Phase 7 — Ingestion fidelity
Goal: Polygon / IBKR / FRED / any external feed → preserves timestamp, dtype, ordering, monotonicity, and surfaces duplicates rather than silencing them.
Static sweep:
- Find every external-API client in
PythonDataService/app/services/ (polygon_*, ibkr_*, fred_*, etc.).
- For each: verify the timestamp parsing path lands at
int64 ms UTC, fail-fast on duplicates, fail-fast on non-monotonic sequences, no silent drop_duplicates, no forward-fill.
- Verify dtype handling: numeric fields stay numeric; precision loss (e.g.,
round(x, 6) before wire) is flagged.
Severity heuristics:
- Silent dedup or forward-fill in an ingestion path → P0
- Timestamp string passed through without canonicalization → P0
round(x, N) before wire on a numeric value the consumer treats as authoritative → P1
- dtype coercion inferred by pandas with no explicit
dtype= argument → P2
Phase 8 — Wire fidelity (Python → Backend → GraphQL → Frontend)
Goal: A number computed in Python arrives at the Frontend signal unmangled.
Static sweep:
- For each canonical Python output that is consumed by Frontend (find via FastAPI router →
Backend/Services/*Service.cs typed HttpClient → GraphQL resolver → Frontend service → component signal):
- At each hop, the field types preserve the value (no
decimal → double narrowing without justification, no int64 ms → string mutation, no number → string mutation in DTOs)
- No layer recomputes the value
- Same trace for timestamp fields.
Severity heuristics:
- Recomputation of a canonical value in Backend or Frontend → P0 (this is a math-authority violation; cross-link to a Phase 2 finding)
- Type narrowing on a numeric field that crosses the wire → P1
- Timestamp converted to string mid-pipeline (not at the display boundary) → P0
Phase 9 — Frontend consumption / display-only violations
Goal: UI displays without recomputing; DatePipe / toFixed / chart formatters are display-only and never round-tripped.
Static sweep:
- grep
Frontend/src/ for toFixed(, DatePipe, formatDate(, parseFloat(, Number( on incoming numeric fields.
- Confirm the formatted result is rendered to the DOM only — never stored back in a signal that crosses the wire, never sent to a service.
- Flag any chart-formatter output that gets persisted or sent in a request.
Severity heuristics:
- Display-formatted value sent back over the wire → P0
- Display-formatted value stored in a signal that's read by another computation → P1
- Inconsistent formatting between layers (e.g., 6dp here, 4dp there) for the same canonical field → P2 (rollup if cosmetic)
Phase 10 — Documentation & auditability polish
Goal: Reference notes complete, reconciliation reports present, warmup documented per indicator.
Static sweep:
- For every entry in
docs/math-sources-of-truth.md with a reference: confirm docs/references/<name>.md exists and includes commit/tag attribution.
- For every reconciled port: confirm
docs/references/reconciliations/<name>.md exists.
- For every indicator: confirm warmup behavior is documented in the module docstring per
.claude/rules/numerical-rigor.md → "Warmup rigor".
Severity heuristics:
- Missing reference note for a canonical math entry → P2
- Missing warmup docstring on an indicator → P3 (rollup)
- Missing reconciliation report for a strategy claimed parity-pinned → P2
Phase completion and exit
A phase is "complete" when:
- The static sweep has visited every item in scope for that phase
- Every finding has been written to
findings/ with status: open (or awaiting-human if it's been seen before)
- The relevant section of
baseline-math-rigor.md (§3.x and the row in §1) is updated
Update state.json: increment phase, reset cursor, persist atomically.
Baseline completion
When phase 10 finishes:
- Re-run the executive summary in §0 of
baseline-math-rigor.md from the current findings/ state.
- Populate §5 with concrete recommendations per group, ordered for unblocking.
- Confirm the §6 hardening gate matches the actual state of the repo (some boxes may already be checked).
- Set
mode to baseline-complete-awaiting-remediation. Set baseline_completed_at.
- Write a final run summary noting "baseline sweep complete".
- Tell the user (in the run summary) that the next step is human remediation, and that the nightly cron should not be scheduled until the §6 gate is clear.
============================================================
BUILD ALPHA FUNCTIONALITY VALIDATION MODE
============================================================
This is a one-shot functionality validation pass for the recently added Build Alpha-style research features. It is not a replacement for the numerical-rigor baseline and it must not modify production code. It writes evidence and conclusions only under docs/audits/auto-research/.
The detailed test charter lives at docs/audits/auto-research/build-alpha-functionality-validation.md. Read that file before doing any work.
Purpose
Act like a quant analyst and a system architect reviewing a research workstation:
- Visually inspect the actual UI output where the app is running.
- Capture Playwright screenshots/snapshots of each inspected Build Alpha / Research Lab screen when an existing Playwright runtime or browser automation tool is available.
- Trace the displayed numbers back to Python-owned API output whenever possible.
- Record the exact parameters/configurations tried.
- Judge functional correctness, numerical plausibility, reproducibility, and leakage risk.
- Conclude what should be fixed or built next, ordered by correctness impact.
Accuracy and functional correctness are more important than broad coverage. Do not call a feature "validated" because the screen renders; a feature is validated only when the request, response, displayed values, and domain interpretation all agree.
Preflight
- Read:
docs/audits/auto-research/build-alpha-functionality-validation.md
docs/architecture/build-alpha-style-features-1-8-research-spec.md
docs/audits/auto-research/state.json
- The latest Build Alpha / Research Lab implementation files discovered by static search.
- Capture environment facts in the report:
- current git branch and commit
- service/container status
- frontend URL checked
- backend/Python endpoint health if available
- Playwright/runtime availability and browser/preview availability
- If the app is not running, record visual checks as
not run, service down and continue with static/API checks that do not require restarting containers.
- If Playwright is unavailable, record visual evidence as
not run, Playwright unavailable and continue with API/static checks. Do not install Playwright or download browser binaries during the tick.
Validation Targets
Validate only implemented surfaces. Compare the spec's Features 1-8 against the actual app and classify each target:
validated
partially validated
not implemented
not run: dependency unavailable
failed functional correctness
failed numerical correctness
Targets:
- Strategy spec + run ledger / reproducibility.
- Signal catalog and primitive metadata.
- Backtest results workbench: equity, drawdown, trade log, metrics, provenance.
- Walk-forward / OOS validation.
- Monte Carlo risk lab.
- Noise, shifted-data, slippage, and cost perturbation tests.
- Null baselines and distribution comparison.
- Parameter sensitivity and parsimony scoring.
Required Evidence
The report must include:
- Parameter matrix tried: exact symbol, resolution, data window, EMA/RSI parameters, hold/cost/fill settings, seeds, simulation counts, fold settings, perturbation configs, null baseline settings, sensitivity grid.
- Playwright visual evidence: each screen inspected, viewport used, interaction path, screenshot/snapshot path under
docs/audits/auto-research/snapshots/YYYY-MM-DD/, what looked correct/incorrect, and whether cards/tables/charts were populated with coherent values.
- Numerical trace: for every headline metric shown in the UI, identify the source field from Python/API output and compare UI value vs source value. Use explicit tolerances for display rounding.
- Quant conclusions: whether the result is plausible, overfit-looking, under-sampled, unstable, or merely a plumbing proof.
- Architect recommendations: specific fixes or next builds, ordered by correctness risk and dependency sequence.
- Open risks: anything not proven because data, endpoint, browser, or service availability blocked it.
Playwright Snapshot Requirements
Use Playwright as the preferred visual evidence mechanism. Acceptable sources:
- A Claude/browser automation tool backed by Playwright.
- An already-installed local Node package that resolves
playwright or @playwright/test.
- An already-available
playwright CLI on PATH.
Do not install packages or browsers. If no acceptable source is available, state that directly in the report.
For each implemented screen, capture at least:
- Initial loaded state.
- Configured EMA control parameters before submit/run.
- Completed results state after the run finishes.
- Any error/empty state encountered.
Use at least desktop viewport 1440x1000. If time permits, also capture mobile/tablet only for layout-breaking evidence; numerical correctness is the priority. Screenshots should be full-page where practical and saved under:
docs/audits/auto-research/snapshots/YYYY-MM-DD/<feature>-<state>.png
Default EMA Control Parameters
Use these defaults unless the implemented UI/API only supports a narrower set. Record any substitutions.
- Symbol:
SPY
- Resolution: 15-minute regular-session bars
- Strategy: EMA crossover
- Fast EMA: 5
- Slow EMA: 10
- RSI filter: off for the baseline run; if supported, also test RSI(14) with thresholds 30/70
- Exit: 5-bar hold or the implemented declarative hold-period exit
- Costs: zero-cost baseline; if supported, also test 1 bps slippage/commission sensitivity
- Seeds: 42 for deterministic validations; 43 as a sanity check that stochastic paths change
- Walk-forward: fixed-spec validation first; optimize-on-train only if explicitly implemented
- Monte Carlo: use the implemented default; if configurable, use 100 simulations for a quick validation and record that it is a smoke-level sample, not a production confidence run
- Sensitivity grid: fast EMA 3-12, slow EMA 8-30, reject
fast >= slow
Output
Create or update today's report:
docs/audits/auto-research/runs/YYYY-MM-DD-build-alpha-functionality-validation.md
At the end:
- Set
state.json.mode to build-alpha-validation-complete-awaiting-review.
- Set
state.json.last_run to the current ISO timestamp with offset.
- Set
state.json.cursor to the report path and one-line verdict.
- Do not change
baseline_completed_at, remediation_completed_at, or historical baseline finding state.
If interrupted, leave mode as build-alpha-validation-pending, update cursor with the last completed target, and write a partial report.
============================================================
RUN SUMMARY TEMPLATE
============================================================
docs/audits/auto-research/runs/YYYY-MM-DD.md:
# Auto-research run — YYYY-MM-DD
**Mode:** baseline-in-progress
**Started:** <iso>
**Ended:** <iso>
**Stop reason:** budget | usage-cap | no-new-findings | manual-interrupt | error
**Phases touched:** [1, 2]
**Cursor at end:** <path or n/a>
## Findings opened
- F-NNNN — Px — area — one-line summary
## Findings updated (last_seen bumped)
- F-NNNN — Px — area — one-line summary
## Findings closed
- F-NNNN — verified by …
## Skipped
- <area / file> — reason (e.g., container down, pending human input)
## Notes for the human
- Anything the human should look at before the next run.
When you don't know
If the state file is missing, malformed, or describes a mode this version of the skill doesn't recognize: stop, write nothing, tell the user. The user owns the loop's lifecycle, not the skill.