| name | abm-reproducibility-checker |
| description | Verify another researcher can reproduce a WAGF experiment — manifests, seeds, configs, runnable commands, data provenance vs git blame, figure-script outputs match references. Use when the user says "audit reproducibility", "prepare for submission", "check this experiment folder", or any time a results directory needs a pre-publication integrity sweep. |
WAGF: ABM Reproducibility Checker
Verify that an outside researcher can reproduce a Water Agent Governance
Framework (WAGF) experiment from the artefacts on disk.
This skill exists because two real failure patterns hit the project:
- Priority-schema confound (2026-04-19): a default-off CLI flag was
silently injected into a subset of runs, shifting Y1 action
distributions by >40 pp. Caught only after manual spot-check.
- v21 dir-naming-vs-code-state mismatch (2026-04-25): 3 of 5
baseline seeds were generated before the v21 code fix landed,
despite the directory being named
production_v21_*. The paper text
claimed n=5 v21 data; the actual disk state was 3 v20 + 2 v21. Caught
only by cross-referencing per-seed trace timestamps against
git blame on irrigation_env.py.
This skill catches that class of failure before submission.
When to Use
Load this skill when the user says any of:
- "Audit this experiment for reproducibility."
- "Prepare this folder for paper submission."
- "Check whether seed X was actually generated by the code we cite."
- "Verify the figure script and the underlying data are consistent."
- "Run a pre-submission reproducibility sweep."
Do NOT use this skill for:
- Generic literature audits →
research-hub.
- Manuscript prose review →
academic-writing-skills.
- Citation verification →
verify-references.
Inputs
The user must supply ONE of:
- A path to a results directory (e.g.,
examples/irrigation_abm/results/production_v21_42yr_seed46/).
- A path to a results-tree root (e.g.,
examples/single_agent/results/JOH_FINAL_v2/gemma4_e4b/) which the
skill will recurse.
- A paper Methods section + figure script paths to cross-validate.
If none supplied, ask for the path. Do not invent one.
Workflow
- Manifest sweep — for every
<run_dir>/reproducibility_manifest.json
under the input path, parse and record: model, seed, git_commit,
temperature, top_p, num_ctx, num_predict, thinking_mode,
governance_profile, agent_types_config, config_hash,
timestamp. Schema reference at
broker/core/experiment_runner.py:281 (_collect_reproducibility_metadata).
- Code-vs-data cross-reference — for each manifest with a
git_commit field, run git log --follow <relevant_code_path> and
confirm the data's trace timestamp falls AFTER the commit hash that
introduced the relied-upon behaviour. The <relevant_code_path> is
inferred from agent_types_config (e.g., irrigation runs depend on
examples/irrigation_abm/irrigation_env.py).
- Sentinel sweep — call
broker/components/analytics/audit.py:detect_audit_sentinels_in_csv
on every audit CSV in the tree. Any column that is constant across
≥80% of rows is flagged unless it is on the documented reserved
list at broker/INVARIANTS.md (Invariant 4).
- Figure-script trace — for every numeric claim in the supplied
paper text or any
paper/nature_water/scripts/gen_*.py, identify the
exact CSV / JSONL the script reads. Confirm the file exists, the
row-count is non-empty, and its mtime post-dates the most recent
relevant code commit.
- Command runnability — read the bat / shell scripts that produced
the data (often referenced in the manifest's notes or beside the
results dir). Verify each command's flags exist in the entry-point
parser (e.g.,
run_experiment.py --help).
- Test-suite gate — run
pytest broker/ tests/ and report any
failures or skipped tests with their messages.
- Write report.
Outputs
Write three files under analysis/reproducibility/ (create if missing):
Refusal Protocol (mandatory)
The skill MUST refuse to:
- Declare a result reproducible without a
git_commit field in the
manifest AND a successful code-vs-data cross-reference. State the
missing field explicitly.
- Treat a constant-column sentinel as benign unless it is on the
broker/INVARIANTS.md Invariant 4 reserved list. Otherwise flag.
- Invent commit hashes, file paths, or seed numbers the user did not
supply. Ask via clarification.
- Generate a "GREEN" verdict if any test in
pytest broker/ tests/
fails. Demote to YELLOW with the failure list.
Output structure contract
Every reproducibility_report.md MUST contain these five sections in
order, with these exact headings:
## Scope (paths surveyed, run counts)
## Verdict (GREEN / YELLOW / RED + one-sentence rationale)
## Findings (numbered list, each with severity + evidence + fix)
## Reproducible Command List (so a peer can re-run this audit)
## Caveats (what was NOT checked, e.g., binary dependencies,
network-only data, private model weights)
Never collapse these sections into prose. Never hide ambiguity.
Bundled resources
references/checklist.md — the full pre-submission checklist with
YES/NO questions per artefact category.
references/manifest_schema.md — the canonical
reproducibility_manifest.json schema with field meanings and
required-vs-optional flags.
references/known_failure_patterns.md — catalogued historical
failures (priority-schema, v21 dir-name mismatch) with detection
recipes. Update this file as new failure patterns are discovered.
scripts/repro_audit.py — runnable Python wrapper that performs the
manifest sweep + sentinel sweep + git cross-reference and emits the
three output artefacts.
Acceptance criteria
The skill is ready when:
- It produces all three output files for any non-empty results dir.
- It correctly flags
examples/irrigation_abm/results/_archive_pre_v21fix_2026-04-24/
as RED with reason "data trace_ts < relevant code commit date".
- It correctly classifies
examples/irrigation_abm/results/production_v21_42yr_seed46/ as
GREEN.
- The five-section output contract is followed in every report.