Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

llm-agent-audit-trace-analyzer

Name: Llm Agent Audit Trace Analyzer
Author: WenyuChiou

// Turn raw WAGF audit traces (household_governance_audit.csv + raw/*.jsonl) into paper-ready governance metrics — IBR, EHE, rejection taxonomy, retry outcomes, model-condition comparisons. Use when the user says "analyze these traces", "compute governance metrics", "summarize rejection and retry outcomes", or hands over a results directory and asks "what does this say".

Ejecutar en Manus

$ git log --oneline --stat

stars:0

forks:0

updated:26 de mayo de 2026, 14:38

Explorador de archivos

5 archivos

SKILL.md

readonly

name	llm-agent-audit-trace-analyzer
description	Turn raw WAGF audit traces (household_governance_audit.csv + raw/*.jsonl) into paper-ready governance metrics — IBR, EHE, rejection taxonomy, retry outcomes, model-condition comparisons. Use when the user says "analyze these traces", "compute governance metrics", "summarize rejection and retry outcomes", or hands over a results directory and asks "what does this say".

WAGF: LLM Agent Audit Trace Analyzer

Convert raw WAGF audit artefacts into paper-ready governance metrics and diagnostic tables. The skill is a thin orchestration layer over existing, validated production scripts. It never reimplements the formulas.

When to Use

Load this skill when the user says any of:

"Analyze these WAGF audit traces."
"Compute governance metrics from JSONL logs."
"Summarize rejection and retry outcomes by model and condition."
"Give me an IBR / EHE table for these runs."
"What does this audit log say about validator coverage?"

Do NOT use this skill for:

Reproducibility verification → abm-reproducibility-checker.
Designing a NEW experiment → wagf-experiment-designer.
Generic CSV summarization → data-analyst.

Inputs

The user must supply ONE of:

A path to a single run directory (contains household_governance_audit.csv + raw/*.jsonl).
A path to a results-tree root (e.g. examples/single_agent/results/JOH_FINAL_v2/) that the skill recurses to find runs.
An explicit list of (model, condition, seed) triples.

If nothing supplied, ask. Do not invent a path.

Workflow

Discover runs: walk the input path; for each run dir, record path, model, seed, condition (governed / disabled), and confirm the audit CSV is non-empty.
Per-run metrics (delegate to existing scripts; do NOT reimplement):
- IBR = R1+R3+R4 fraction → use examples/single_agent/analysis/gemma4_nw_crossmodel_analysis.py:compute_ibr_components (canonical formula, post-test-fix d8a5da4).
- EHE = normalized Shannon entropy → use examples/irrigation_abm/analysis/nw_bootstrap_ci.py:shannon_entropy (or for irrigation, examples/irrigation_abm/analysis/compute_ibr.py).
- Sentinel sweep → use broker/components/analytics/audit.py:detect_audit_sentinels_in_csv.
- Temporal diagnostics M1/M2/M3 → call examples/single_agent/analysis/compute_temporal_diagnostics.py with --gemma4-pipeline {v1,v2} flag matched to the dataset.
- Retry compliance Pattern A/B/C → call examples/single_agent/analysis/compute_retry_compliance.py.
Aggregate: group by (model, condition); compute mean ± s.d. across seeds; paired t-test for governed-vs-disabled; 95% CI.
Detect anomalies: any sentinel-flagged column not on the broker/INVARIANTS.md Invariant 4 reserved list → flag in caveats.
Write artefacts (see Outputs).

Outputs

Write all under analysis/ (create if missing):

governance_metrics.csv — long-format table with rows = (model, condition, seed) and columns = (n_decisions, ibr, r1, r3, r4, ehe, rejection_rate, retry_rate, m1_rate, m2_rate, m3_rate).
governance_summary.md — structured paper-ready report (see contract below).
rejection_taxonomy.csv — per (model, condition, validator_rule) rejection counts and percentages.
retry_outcomes.csv — Pattern A/B/C distribution per (model, condition).
model_condition_comparison.md — pairwise governed-vs-disabled paired t-tests with ΔIBR, ΔEHE, 95% CI, p-value per model.

Output structure contract

governance_summary.md MUST contain these four sections in order with these exact headings:

## Scope — paths analysed, run counts per (model, condition)
## Headline metrics — one short narrative paragraph (≤120 words) stating the headline finding (e.g., "Validators reduced IBR from X% to Y% across N models; the reduction was significant (p<…) in K of N").
## Metrics table — the model × condition × metric matrix in markdown table form.
## Caveats — explicit list of: missing seeds, sentinel-flagged columns, partial runs, datasets that mix code versions (cross-link to abm-reproducibility-checker if any).

Plus ## Reproducible command list at the end so the next reader can re-run the analysis.

Never collapse these into prose. Never bury caveats inside the narrative.

Refusal Protocol

The skill MUST refuse to:

Invent a metric formula. If a metric the user asks for is not covered by the existing compute_*.py scripts, say so explicitly and ask whether to add a new script (out-of-scope for this skill).
Pool seeds across model versions or code versions without flagging. When pooling, the caveats section must list every distinct git commit observed across the pooled manifests.
Report a paired t-test p-value when n < 3 paired observations without an explicit warning.
Drop sentinel-flagged columns silently. Always list them in caveats.
Compare runs whose agent_types_config paths differ without a warning.

Bundled resources

references/trace_schema.md — the full household_governance_audit.csv column dictionary plus the raw *.jsonl record shape.
references/metrics_definitions.md — exact formulas with file:line refs to the canonical implementations.
references/script_index.md — which existing script computes which metric, with usage examples.
scripts/run_analyzer.py — runnable wrapper that orchestrates the per-run metric computation and emits the five output artefacts.

Acceptance criteria

The skill is ready when:

It produces all five artefacts for any non-empty results dir.
For examples/single_agent/results/JOH_FINAL_v2/gemma4_e4b/Group_C/, the recomputed Table 5 row matches paper SI to ±0.001 on IBR mean.
For a tree containing both governed and disabled subtrees, the paired t-test reproduces SFig 1 numbers within ±0.01 on every ΔIBR.
Caveats section is non-empty for any input that has missing seeds or flagged sentinel columns.
Refusal protocol triggers when fed an unknown metric name (e.g., "compute the WAGF coherence index").

related-skills.json

mismo repositorio

wagf-domain-builder.md

from "WenyuChiou/WAGF"

Walk a researcher (PhD, collaborator, lab-mate) through building their first single-agent WAGF domain — from "I have a research question + maybe an external model" to "I have a working WAGF experiment producing audit traces." Conducts a structured S0-S7 interview, invokes `broker.tools.scaffold_domain` at S4, guides 4 surgical edits in S5, and runs `broker.tools.validate_prompt` after every change. Hands off to `wagf-coupling-designer` for any coupling work and to `wagf-experiment-designer` / `abm-reproducibility-checker` once the domain runs green. Use when the user says "I want to build a WAGF model for <my domain>", "help me set up a new domain", "I'm new to WAGF and have a research question", or "scaffold a domain from scratch".

2026-05-260

model-coupling-contract-checker.md

from "WenyuChiou/WAGF"

Verify the contract between WAGF/ABM agents and an external model (flood, hydrology, irrigation, seismic, catastrophe) — units, time steps, state mutation direction, feedback-loop double-counting. Use when the user says "check ABM-model coupling", "audit feedback loop", "verify units between WAGF and X model", or asks to confirm an external-model integration is safe.

2026-05-170

wagf-coupling-designer.md

from "WenyuChiou/WAGF"

Walk a researcher through designing the LLM↔external-model interface — decision flow IN, observation flow OUT — for a single-agent WAGF domain. Emits a coupling contract, a working mock adapter, and a pattern-specific real-model adapter scaffold so the WAGF side can be built and smoke-tested BEFORE the real model is wired in. Use when the user says "I want to couple my LLM agents to <my simulator>", "help me design the WAGF↔X interface", "scaffold the external model adapter", "draft a coupling contract", "I have a Python / R / CSV-based model and want WAGF to drive it". Sister skill to `model-coupling-contract-checker` (which AUDITS existing contracts; this one DESIGNS new ones).

2026-05-170

abm-reproducibility-checker.md

from "WenyuChiou/WAGF"

Verify another researcher can reproduce a WAGF experiment — manifests, seeds, configs, runnable commands, data provenance vs git blame, figure-script outputs match references. Use when the user says "audit reproducibility", "prepare for submission", "check this experiment folder", or any time a results directory needs a pre-publication integrity sweep.

2026-04-260

wagf-quickstart.md

from "WenyuChiou/WAGF"

First-time WAGF setup walkthrough — environment check, smoke test, first experiment, and handoff to the four lifecycle skills. Use when the user says "I just cloned WAGF", "set up WAGF", "first WAGF run", "I'm new to this", "where do I start with WAGF", or opens a Claude Code session in a freshly-cloned WAGF repo without a clear task.

2026-04-260

wagf-experiment-designer.md

from "WenyuChiou/WAGF"

Turn a WAGF research question into a reproducible experiment matrix (model × governance × seed × metric × artefact path). Use when the user says "design an experiment", "plan an ablation", "compare strict vs disabled", "set up cross-model evaluation", or wants a runnable matrix written to .research/.

2026-04-260

package.json

"author": "WenyuChiou"

"repository": "WenyuChiou/WAGF"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

name	llm-agent-audit-trace-analyzer
description	Turn raw WAGF audit traces (household_governance_audit.csv + raw/*.jsonl) into paper-ready governance metrics — IBR, EHE, rejection taxonomy, retry outcomes, model-condition comparisons. Use when the user says "analyze these traces", "compute governance metrics", "summarize rejection and retry outcomes", or hands over a results directory and asks "what does this say".

WAGF: LLM Agent Audit Trace Analyzer

When to Use

Load this skill when the user says any of:

"Analyze these WAGF audit traces."
"Compute governance metrics from JSONL logs."
"Summarize rejection and retry outcomes by model and condition."
"Give me an IBR / EHE table for these runs."
"What does this audit log say about validator coverage?"

Do NOT use this skill for:

Reproducibility verification → abm-reproducibility-checker.
Designing a NEW experiment → wagf-experiment-designer.
Generic CSV summarization → data-analyst.

Inputs

The user must supply ONE of:

A path to a single run directory (contains household_governance_audit.csv + raw/*.jsonl).
A path to a results-tree root (e.g. examples/single_agent/results/JOH_FINAL_v2/) that the skill recurses to find runs.
An explicit list of (model, condition, seed) triples.

If nothing supplied, ask. Do not invent a path.

Workflow

Discover runs: walk the input path; for each run dir, record path, model, seed, condition (governed / disabled), and confirm the audit CSV is non-empty.
Per-run metrics (delegate to existing scripts; do NOT reimplement):
- IBR = R1+R3+R4 fraction → use examples/single_agent/analysis/gemma4_nw_crossmodel_analysis.py:compute_ibr_components (canonical formula, post-test-fix d8a5da4).
- EHE = normalized Shannon entropy → use examples/irrigation_abm/analysis/nw_bootstrap_ci.py:shannon_entropy (or for irrigation, examples/irrigation_abm/analysis/compute_ibr.py).
- Sentinel sweep → use broker/components/analytics/audit.py:detect_audit_sentinels_in_csv.
- Temporal diagnostics M1/M2/M3 → call examples/single_agent/analysis/compute_temporal_diagnostics.py with --gemma4-pipeline {v1,v2} flag matched to the dataset.
- Retry compliance Pattern A/B/C → call examples/single_agent/analysis/compute_retry_compliance.py.
Aggregate: group by (model, condition); compute mean ± s.d. across seeds; paired t-test for governed-vs-disabled; 95% CI.
Detect anomalies: any sentinel-flagged column not on the broker/INVARIANTS.md Invariant 4 reserved list → flag in caveats.
Write artefacts (see Outputs).

Outputs

Write all under analysis/ (create if missing):

governance_metrics.csv — long-format table with rows = (model, condition, seed) and columns = (n_decisions, ibr, r1, r3, r4, ehe, rejection_rate, retry_rate, m1_rate, m2_rate, m3_rate).
governance_summary.md — structured paper-ready report (see contract below).
rejection_taxonomy.csv — per (model, condition, validator_rule) rejection counts and percentages.
retry_outcomes.csv — Pattern A/B/C distribution per (model, condition).
model_condition_comparison.md — pairwise governed-vs-disabled paired t-tests with ΔIBR, ΔEHE, 95% CI, p-value per model.

Output structure contract

governance_summary.md MUST contain these four sections in order with these exact headings:

## Scope — paths analysed, run counts per (model, condition)
## Headline metrics — one short narrative paragraph (≤120 words) stating the headline finding (e.g., "Validators reduced IBR from X% to Y% across N models; the reduction was significant (p<…) in K of N").
## Metrics table — the model × condition × metric matrix in markdown table form.
## Caveats — explicit list of: missing seeds, sentinel-flagged columns, partial runs, datasets that mix code versions (cross-link to abm-reproducibility-checker if any).

Plus ## Reproducible command list at the end so the next reader can re-run the analysis.

Never collapse these into prose. Never bury caveats inside the narrative.

Refusal Protocol

The skill MUST refuse to:

Invent a metric formula. If a metric the user asks for is not covered by the existing compute_*.py scripts, say so explicitly and ask whether to add a new script (out-of-scope for this skill).
Pool seeds across model versions or code versions without flagging. When pooling, the caveats section must list every distinct git commit observed across the pooled manifests.
Report a paired t-test p-value when n < 3 paired observations without an explicit warning.
Drop sentinel-flagged columns silently. Always list them in caveats.
Compare runs whose agent_types_config paths differ without a warning.

Bundled resources

references/trace_schema.md — the full household_governance_audit.csv column dictionary plus the raw *.jsonl record shape.
references/metrics_definitions.md — exact formulas with file:line refs to the canonical implementations.
references/script_index.md — which existing script computes which metric, with usage examples.
scripts/run_analyzer.py — runnable wrapper that orchestrates the per-run metric computation and emits the five output artefacts.

Acceptance criteria

The skill is ready when:

It produces all five artefacts for any non-empty results dir.
For examples/single_agent/results/JOH_FINAL_v2/gemma4_e4b/Group_C/, the recomputed Table 5 row matches paper SI to ±0.001 on IBR mean.
For a tree containing both governed and disabled subtrees, the paired t-test reproduces SFig 1 numbers within ±0.01 on every ΔIBR.
Caveats section is non-empty for any input that has missing seeds or flagged sentinel columns.
Refusal protocol triggers when fed an unknown metric name (e.g., "compute the WAGF coherence index").

llm-agent-audit-trace-analyzer

WAGF: LLM Agent Audit Trace Analyzer

When to Use

Inputs

Workflow

Outputs

Output structure contract

Refusal Protocol

Bundled resources

Acceptance criteria

Más de este repositorio

Más de este repositorio

WAGF: LLM Agent Audit Trace Analyzer

When to Use

Inputs

Workflow

Outputs

Output structure contract

Refusal Protocol

Bundled resources

Acceptance criteria