원클릭으로 Manus에서 모든 스킬 실행

$pwd:

abm-reproducibility-checker

Name: Abm Reproducibility Checker
Author: WenyuChiou

// Verify another researcher can reproduce a WAGF experiment — manifests, seeds, configs, runnable commands, data provenance vs git blame, figure-script outputs match references. Use when the user says "audit reproducibility", "prepare for submission", "check this experiment folder", or any time a results directory needs a pre-publication integrity sweep.

Manus에서 실행

$ git log --oneline --stat

stars:0

forks:0

updated:2026년 4월 26일 03:12

파일 탐색기

5 개 파일

SKILL.md

readonly

name	abm-reproducibility-checker
description	Verify another researcher can reproduce a WAGF experiment — manifests, seeds, configs, runnable commands, data provenance vs git blame, figure-script outputs match references. Use when the user says "audit reproducibility", "prepare for submission", "check this experiment folder", or any time a results directory needs a pre-publication integrity sweep.

WAGF: ABM Reproducibility Checker

Verify that an outside researcher can reproduce a Water Agent Governance Framework (WAGF) experiment from the artefacts on disk.

This skill exists because two real failure patterns hit the project:

Priority-schema confound (2026-04-19): a default-off CLI flag was silently injected into a subset of runs, shifting Y1 action distributions by >40 pp. Caught only after manual spot-check.
v21 dir-naming-vs-code-state mismatch (2026-04-25): 3 of 5 baseline seeds were generated before the v21 code fix landed, despite the directory being named production_v21_*. The paper text claimed n=5 v21 data; the actual disk state was 3 v20 + 2 v21. Caught only by cross-referencing per-seed trace timestamps against git blame on irrigation_env.py.

This skill catches that class of failure before submission.

When to Use

Load this skill when the user says any of:

"Audit this experiment for reproducibility."
"Prepare this folder for paper submission."
"Check whether seed X was actually generated by the code we cite."
"Verify the figure script and the underlying data are consistent."
"Run a pre-submission reproducibility sweep."

Do NOT use this skill for:

Generic literature audits → research-hub.
Manuscript prose review → academic-writing-skills.
Citation verification → verify-references.

Inputs

The user must supply ONE of:

A path to a results directory (e.g., examples/irrigation_abm/results/production_v21_42yr_seed46/).
A path to a results-tree root (e.g., examples/single_agent/results/JOH_FINAL_v2/gemma4_e4b/) which the skill will recurse.
A paper Methods section + figure script paths to cross-validate.

If none supplied, ask for the path. Do not invent one.

Workflow

Manifest sweep — for every <run_dir>/reproducibility_manifest.json under the input path, parse and record: model, seed, git_commit, temperature, top_p, num_ctx, num_predict, thinking_mode, governance_profile, agent_types_config, config_hash, timestamp. Schema reference at broker/core/experiment_runner.py:281 (_collect_reproducibility_metadata).
Code-vs-data cross-reference — for each manifest with a git_commit field, run git log --follow <relevant_code_path> and confirm the data's trace timestamp falls AFTER the commit hash that introduced the relied-upon behaviour. The <relevant_code_path> is inferred from agent_types_config (e.g., irrigation runs depend on examples/irrigation_abm/irrigation_env.py).
Sentinel sweep — call broker/components/analytics/audit.py:detect_audit_sentinels_in_csv on every audit CSV in the tree. Any column that is constant across ≥80% of rows is flagged unless it is on the documented reserved list at broker/INVARIANTS.md (Invariant 4).
Figure-script trace — for every numeric claim in the supplied paper text or any paper/nature_water/scripts/gen_*.py, identify the exact CSV / JSONL the script reads. Confirm the file exists, the row-count is non-empty, and its mtime post-dates the most recent relevant code commit.
Command runnability — read the bat / shell scripts that produced the data (often referenced in the manifest's notes or beside the results dir). Verify each command's flags exist in the entry-point parser (e.g., run_experiment.py --help).
Test-suite gate — run pytest broker/ tests/ and report any failures or skipped tests with their messages.
Write report.

Outputs

Write three files under analysis/reproducibility/ (create if missing):

reproducibility_report.md — one-page summary with:
- Inputs surveyed (paths, run counts, manifest count)
- GREEN / YELLOW / RED verdict
- Numbered findings (each with severity, evidence, recommended fix)
- Reproducible command list (so the next reader can re-run the audit)

artifact_inventory.yml — machine-readable inventory:

runs:
  - path: examples/irrigation_abm/results/production_v21_42yr_seed46
    manifest_present: true
    manifest_git_commit: 4be5092
    data_first_trace_ts: 2026-03-04T05:57:00
    data_last_trace_ts: 2026-03-04T16:05:35
    relevant_code_commit_date: 2026-03-03T19:38:27
    verdict: POST_FIX_OK
sentinel_findings:
  - csv: ...
    column: cog_is_novel_state
    constancy_rate: 1.0
    status: RESERVED  # per INVARIANTS.md Invariant 4

missing_repro_steps.md — TODO list of human actions needed to close any RED finding.

Refusal Protocol (mandatory)

The skill MUST refuse to:

Declare a result reproducible without a git_commit field in the manifest AND a successful code-vs-data cross-reference. State the missing field explicitly.
Treat a constant-column sentinel as benign unless it is on the broker/INVARIANTS.md Invariant 4 reserved list. Otherwise flag.
Invent commit hashes, file paths, or seed numbers the user did not supply. Ask via clarification.
Generate a "GREEN" verdict if any test in pytest broker/ tests/ fails. Demote to YELLOW with the failure list.

Output structure contract

Every reproducibility_report.md MUST contain these five sections in order, with these exact headings:

## Scope (paths surveyed, run counts)
## Verdict (GREEN / YELLOW / RED + one-sentence rationale)
## Findings (numbered list, each with severity + evidence + fix)
## Reproducible Command List (so a peer can re-run this audit)
## Caveats (what was NOT checked, e.g., binary dependencies, network-only data, private model weights)

Never collapse these sections into prose. Never hide ambiguity.

Bundled resources

references/checklist.md — the full pre-submission checklist with YES/NO questions per artefact category.
references/manifest_schema.md — the canonical reproducibility_manifest.json schema with field meanings and required-vs-optional flags.
references/known_failure_patterns.md — catalogued historical failures (priority-schema, v21 dir-name mismatch) with detection recipes. Update this file as new failure patterns are discovered.
scripts/repro_audit.py — runnable Python wrapper that performs the manifest sweep + sentinel sweep + git cross-reference and emits the three output artefacts.

Acceptance criteria

The skill is ready when:

It produces all three output files for any non-empty results dir.
It correctly flags examples/irrigation_abm/results/_archive_pre_v21fix_2026-04-24/ as RED with reason "data trace_ts < relevant code commit date".
It correctly classifies examples/irrigation_abm/results/production_v21_42yr_seed46/ as GREEN.
The five-section output contract is followed in every report.

related-skills.json

같은 저장소

llm-agent-audit-trace-analyzer.md

from "WenyuChiou/WAGF"

Turn raw WAGF audit traces (household_governance_audit.csv + raw/*.jsonl) into paper-ready governance metrics — IBR, EHE, rejection taxonomy, retry outcomes, model-condition comparisons. Use when the user says "analyze these traces", "compute governance metrics", "summarize rejection and retry outcomes", or hands over a results directory and asks "what does this say".

2026-05-260

wagf-domain-builder.md

from "WenyuChiou/WAGF"

Walk a researcher (PhD, collaborator, lab-mate) through building their first single-agent WAGF domain — from "I have a research question + maybe an external model" to "I have a working WAGF experiment producing audit traces." Conducts a structured S0-S7 interview, invokes `broker.tools.scaffold_domain` at S4, guides 4 surgical edits in S5, and runs `broker.tools.validate_prompt` after every change. Hands off to `wagf-coupling-designer` for any coupling work and to `wagf-experiment-designer` / `abm-reproducibility-checker` once the domain runs green. Use when the user says "I want to build a WAGF model for <my domain>", "help me set up a new domain", "I'm new to WAGF and have a research question", or "scaffold a domain from scratch".

2026-05-260

model-coupling-contract-checker.md

from "WenyuChiou/WAGF"

Verify the contract between WAGF/ABM agents and an external model (flood, hydrology, irrigation, seismic, catastrophe) — units, time steps, state mutation direction, feedback-loop double-counting. Use when the user says "check ABM-model coupling", "audit feedback loop", "verify units between WAGF and X model", or asks to confirm an external-model integration is safe.

2026-05-170

wagf-coupling-designer.md

from "WenyuChiou/WAGF"

Walk a researcher through designing the LLM↔external-model interface — decision flow IN, observation flow OUT — for a single-agent WAGF domain. Emits a coupling contract, a working mock adapter, and a pattern-specific real-model adapter scaffold so the WAGF side can be built and smoke-tested BEFORE the real model is wired in. Use when the user says "I want to couple my LLM agents to <my simulator>", "help me design the WAGF↔X interface", "scaffold the external model adapter", "draft a coupling contract", "I have a Python / R / CSV-based model and want WAGF to drive it". Sister skill to `model-coupling-contract-checker` (which AUDITS existing contracts; this one DESIGNS new ones).

2026-05-170

wagf-quickstart.md

from "WenyuChiou/WAGF"

First-time WAGF setup walkthrough — environment check, smoke test, first experiment, and handoff to the four lifecycle skills. Use when the user says "I just cloned WAGF", "set up WAGF", "first WAGF run", "I'm new to this", "where do I start with WAGF", or opens a Claude Code session in a freshly-cloned WAGF repo without a clear task.

2026-04-260

wagf-experiment-designer.md

from "WenyuChiou/WAGF"

Turn a WAGF research question into a reproducible experiment matrix (model × governance × seed × metric × artefact path). Use when the user says "design an experiment", "plan an ablation", "compare strict vs disabled", "set up cross-model evaluation", or wants a runnable matrix written to .research/.

2026-04-260

package.json

"author": "WenyuChiou"

"repository": "WenyuChiou/WAGF"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직15-1253L4

name	abm-reproducibility-checker
description	Verify another researcher can reproduce a WAGF experiment — manifests, seeds, configs, runnable commands, data provenance vs git blame, figure-script outputs match references. Use when the user says "audit reproducibility", "prepare for submission", "check this experiment folder", or any time a results directory needs a pre-publication integrity sweep.

WAGF: ABM Reproducibility Checker

Verify that an outside researcher can reproduce a Water Agent Governance Framework (WAGF) experiment from the artefacts on disk.

This skill exists because two real failure patterns hit the project:

Priority-schema confound (2026-04-19): a default-off CLI flag was silently injected into a subset of runs, shifting Y1 action distributions by >40 pp. Caught only after manual spot-check.
v21 dir-naming-vs-code-state mismatch (2026-04-25): 3 of 5 baseline seeds were generated before the v21 code fix landed, despite the directory being named production_v21_*. The paper text claimed n=5 v21 data; the actual disk state was 3 v20 + 2 v21. Caught only by cross-referencing per-seed trace timestamps against git blame on irrigation_env.py.

This skill catches that class of failure before submission.

When to Use

Load this skill when the user says any of:

"Audit this experiment for reproducibility."
"Prepare this folder for paper submission."
"Check whether seed X was actually generated by the code we cite."
"Verify the figure script and the underlying data are consistent."
"Run a pre-submission reproducibility sweep."

Do NOT use this skill for:

Generic literature audits → research-hub.
Manuscript prose review → academic-writing-skills.
Citation verification → verify-references.

Inputs

The user must supply ONE of:

A path to a results directory (e.g., examples/irrigation_abm/results/production_v21_42yr_seed46/).
A path to a results-tree root (e.g., examples/single_agent/results/JOH_FINAL_v2/gemma4_e4b/) which the skill will recurse.
A paper Methods section + figure script paths to cross-validate.

If none supplied, ask for the path. Do not invent one.

Workflow

Manifest sweep — for every <run_dir>/reproducibility_manifest.json under the input path, parse and record: model, seed, git_commit, temperature, top_p, num_ctx, num_predict, thinking_mode, governance_profile, agent_types_config, config_hash, timestamp. Schema reference at broker/core/experiment_runner.py:281 (_collect_reproducibility_metadata).
Code-vs-data cross-reference — for each manifest with a git_commit field, run git log --follow <relevant_code_path> and confirm the data's trace timestamp falls AFTER the commit hash that introduced the relied-upon behaviour. The <relevant_code_path> is inferred from agent_types_config (e.g., irrigation runs depend on examples/irrigation_abm/irrigation_env.py).
Sentinel sweep — call broker/components/analytics/audit.py:detect_audit_sentinels_in_csv on every audit CSV in the tree. Any column that is constant across ≥80% of rows is flagged unless it is on the documented reserved list at broker/INVARIANTS.md (Invariant 4).
Figure-script trace — for every numeric claim in the supplied paper text or any paper/nature_water/scripts/gen_*.py, identify the exact CSV / JSONL the script reads. Confirm the file exists, the row-count is non-empty, and its mtime post-dates the most recent relevant code commit.
Command runnability — read the bat / shell scripts that produced the data (often referenced in the manifest's notes or beside the results dir). Verify each command's flags exist in the entry-point parser (e.g., run_experiment.py --help).
Test-suite gate — run pytest broker/ tests/ and report any failures or skipped tests with their messages.
Write report.

Outputs

Write three files under analysis/reproducibility/ (create if missing):

reproducibility_report.md — one-page summary with:
- Inputs surveyed (paths, run counts, manifest count)
- GREEN / YELLOW / RED verdict
- Numbered findings (each with severity, evidence, recommended fix)
- Reproducible command list (so the next reader can re-run the audit)

artifact_inventory.yml — machine-readable inventory:

runs:
  - path: examples/irrigation_abm/results/production_v21_42yr_seed46
    manifest_present: true
    manifest_git_commit: 4be5092
    data_first_trace_ts: 2026-03-04T05:57:00
    data_last_trace_ts: 2026-03-04T16:05:35
    relevant_code_commit_date: 2026-03-03T19:38:27
    verdict: POST_FIX_OK
sentinel_findings:
  - csv: ...
    column: cog_is_novel_state
    constancy_rate: 1.0
    status: RESERVED  # per INVARIANTS.md Invariant 4

missing_repro_steps.md — TODO list of human actions needed to close any RED finding.

Refusal Protocol (mandatory)

The skill MUST refuse to:

Declare a result reproducible without a git_commit field in the manifest AND a successful code-vs-data cross-reference. State the missing field explicitly.
Treat a constant-column sentinel as benign unless it is on the broker/INVARIANTS.md Invariant 4 reserved list. Otherwise flag.
Invent commit hashes, file paths, or seed numbers the user did not supply. Ask via clarification.
Generate a "GREEN" verdict if any test in pytest broker/ tests/ fails. Demote to YELLOW with the failure list.

Output structure contract

Every reproducibility_report.md MUST contain these five sections in order, with these exact headings:

## Scope (paths surveyed, run counts)
## Verdict (GREEN / YELLOW / RED + one-sentence rationale)
## Findings (numbered list, each with severity + evidence + fix)
## Reproducible Command List (so a peer can re-run this audit)
## Caveats (what was NOT checked, e.g., binary dependencies, network-only data, private model weights)

Never collapse these sections into prose. Never hide ambiguity.

Bundled resources

references/checklist.md — the full pre-submission checklist with YES/NO questions per artefact category.
references/manifest_schema.md — the canonical reproducibility_manifest.json schema with field meanings and required-vs-optional flags.
references/known_failure_patterns.md — catalogued historical failures (priority-schema, v21 dir-name mismatch) with detection recipes. Update this file as new failure patterns are discovered.
scripts/repro_audit.py — runnable Python wrapper that performs the manifest sweep + sentinel sweep + git cross-reference and emits the three output artefacts.

Acceptance criteria

The skill is ready when:

It produces all three output files for any non-empty results dir.
It correctly flags examples/irrigation_abm/results/_archive_pre_v21fix_2026-04-24/ as RED with reason "data trace_ts < relevant code commit date".
It correctly classifies examples/irrigation_abm/results/production_v21_42yr_seed46/ as GREEN.
The five-section output contract is followed in every report.

abm-reproducibility-checker

WAGF: ABM Reproducibility Checker

When to Use

Inputs

Workflow

Outputs

Refusal Protocol (mandatory)

Output structure contract

Bundled resources

Acceptance criteria

이 저장소의 다른 Skills

WAGF: ABM Reproducibility Checker

When to Use

Inputs

Workflow

Outputs

Refusal Protocol (mandatory)

Output structure contract

Bundled resources

Acceptance criteria

이 저장소의 다른 Skills