一键在 Manus 中运行任何 Skill

diagnose

Root-cause a failing or wrong empirical result with a disciplined reproduce → minimise → hypothesise → instrument → fix loop, instead of guessing-and-poking. Use when the user says "why is my regression wrong", "this number changed", "my script errors out", "the result won't reproduce", "debug this", "this estimate looks wrong", or "it worked yesterday". Tuned for research code (R/Stata/Python): type coercion, NA/merge blow-ups, factor levels, clustering/SE choices, weighting, collinearity/convergence, seeds, package-version drift. Use `--no-fix` to localize the root cause without editing shared or load-bearing files.

在 Manus 中运行

星标1,263

分支2,569

更新时间2026年6月9日 21:33

来源

pedrohcgs

pedrohcgs/claude-code-my-workflow

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

软件开发工程师计算机与数学类职业15-1252L4

SKILL.md

readonly

同仓库更多 Skills

同仓库

submission-disclosures

pedrohcgs/claude-code-my-workflow

Generate the submission-time disclosure block for a manuscript — the AI-use disclosure statement matched to the target journal's policy, CRediT author-contribution roles, conflict-of-interest statement, and data-availability statement. Use when the user says "AI disclosure", "disclosure statement", "do I need to disclose Claude", "CRediT roles", "conflict of interest statement", "data availability statement", or is preparing a submission package. NOT statistical-disclosure screening of restricted-data outputs — that is /disclosure-check.

2026-06-101.3k

audit-reproducibility

pedrohcgs/claude-code-my-workflow

Enforce the replication-protocol.md rule by cross-checking numeric claims in a manuscript against the actual R / Stata / Python outputs. Report PASS/FAIL per claim against tolerance thresholds. Use before submission and before releasing a replication package.

2026-06-091.3k

deep-audit

pedrohcgs/claude-code-my-workflow

Deep consistency audit of the entire repository infrastructure. Launches 4 parallel specialist agents to find factual errors, code bugs, count mismatches, and cross-document inconsistencies. Then fixes all issues and loops until clean. Use when: after making broad changes, before releases, or when user says "audit", "find inconsistencies", "check everything".

2026-06-091.3k

create-lecture

pedrohcgs/claude-code-my-workflow

Create a new Beamer lecture `.tex` from source papers and materials, with notation consistency checks and the project's preamble wired in. Use when user says "create a lecture on X", "new lecture from these papers", "start a deck on topic Y", "scaffold a new Beamer file", "build me a lecture from these PDFs". Scaffolds the full deck — NOT for compiling existing `.tex` (use `/compile-latex`).

2026-06-091.3k

review-paper

pedrohcgs/claude-code-my-workflow

Comprehensive manuscript review with three modes: single-pass (default), --adversarial critic-fixer loop, and --peer [journal] simulated peer-review pipeline (editor + 2 dispositioned referees + editorial decision, calibrated to a target journal). R&R continuation via --peer --r2/--r3; hostile-editor stress test via --peer --stress; reviewer-disposition variance reporting via --peer --variance N. Auto-invokes /review-r + /audit-reproducibility on referenced scripts unless --no-cross-artifact.

2026-06-091.3k

did-event-study

pedrohcgs/claude-code-my-workflow

Run a staggered difference-in-differences / event-study analysis to the Sant'Anna practitioner standard — drives the canonical packages (R `did`/`DRDID`/`didFF`/`contdid`; Stata `csdid`/`drdid`), enforces the doubly-robust default, a mandatory diagnostic + sensitivity suite, uniform-band inference, replicate-and-verify-against-source discipline, and ends in a graded credibility verdict. Use when user says "run a DiD", "event study", "staggered adoption", "Callaway Sant'Anna", "att_gt", "csdid", "did with multiple periods", or points at panel data with a treatment-timing variable. NEVER reimplements an estimator.

2026-06-091.3k

name

diagnose

description

/diagnose — Root-Cause a Wrong or Failing Result

Find why an analysis errors, returns the wrong number, or won't reconcile — with a structured debugging loop rather than scattershot edits. Adapted from the diagnose pattern in mattpocock/skills, reshaped for empirical research code where the bug is usually a silent wrong number, not a crash.

The discipline: never edit before you can reproduce, and never fix before you can explain. A guessed fix that makes the symptom disappear without a named root cause is how a wrong number gets laundered into a published table.

When to use

A regression / estimate returns a value you can't explain, or one that changed when nothing should have.
A script errors out and the stack trace doesn't point at the real cause.
A result "won't reproduce" — different number on re-run, on another machine, or after a package update.
A replication claim fails /audit-reproducibility and you need to localize which step drifted.

Diagnose is symptom-driven and single-target: ONE wrong number / ONE failing run. Use a sibling instead when the job is different:

/audit-reproducibility — verify all numeric claims in a manuscript against current code (claim-driven, whole-paper). If you have one FAILing claim and want to localize which pipeline step produced it, /audit-reproducibility hands off to /diagnose; if you want to re-check every table number, start there.
/review-r — code-quality review with no specific symptom.
/capture-environment — snapshot the environment when version/seed drift is the suspect.

Phases

Phase 0 — Pin the symptom (expected vs. actual)

State the bug as a falsifiable gap before touching anything:

Expected: the value/behaviour you believe is correct, and why (a prior run, a paper table, a hand calculation, a theoretical sign).
Actual: the value/error observed now, copied verbatim (full message, not a paraphrase).
Tolerance: the threshold that separates "same" from "different", keyed to the source of expected — prior run on the same machine → machine-epsilon + display rounding; a published table → rounding + small slack (~1e-3); a hand calculation → ~0.01; a theoretical prediction → an economic-significance band, not a decimal. Don't chase 1e-12 floating-point noise; don't wave away a 5% gap. (See replication-protocol.md.)

If expected/actual can't be stated, the task is understanding, not diagnosis — stop and clarify first.

Phase 1 — Reproduce deterministically (get a reliable red)

A bug you can't reproduce on demand can't be fixed, only hidden.

Fix every source of nondeterminism: set the seed, pin the working directory, record sessionInfo() / pip freeze / Stata version (lean on /capture-environment).
Re-run the smallest unit that exhibits the bug and confirm it fails every time. An intermittent failure is its own hypothesis (uninitialised RNG, order-dependent merge, race in parallel code) — note it and carry it into Phase 3.

Phase 2 — Minimise to an MWE

Shrink until the bug sits in the open:

Data: subset to the smallest rows/columns that still reproduce (often one group, one period, a handful of rows).
Code: strip the pipeline to the shortest path from input to wrong output; comment out everything the symptom survives without.
Each removal that keeps the bug is information; each that kills it is a stronger signal — record which.

The MWE is the deliverable even if the fix is later trivial: it's what makes the root cause undeniable.

Phase 3 — Hypothesise (enumerate, then rank)

List candidate causes before testing any — a written list beats poking because it prevents fixating on the first idea. For research code, walk the usual suspects (all of these run cleanly with no error message — they are silent-wrong-number bugs):

Types & coercion — a numeric read as character/factor; integer overflow; date parsed wrong; TRUE/FALSE ↔ 1/0.
Missingness — NA dropped silently, na.rm flipping a mean, listwise deletion changing the sample mid-pipeline.
Joins & shape — a many-to-many merge inflating rows; duplicate keys; an unbalanced panel where balance was assumed.
Specification — wrong clustering level, fixed effects absorbed twice, a lag/lead off by one.
Bad controls & colliders — a control that is post-treatment, a mediator on the causal path, or a descendant of treatment (adding it induces bias, invisibly). The tell: a coefficient that moves the "wrong way" or shrinks implausibly when a control enters.
Numerical stability & convergence — an optimizer that didn't converge (check the convergence code, not just the estimates), a singular/near-singular Hessian, collinearity (high VIF, a dropped column), tolerance set too loose, under/overflow with very small/large weights or coefficients.
Weighting & aggregation — weights silently dropped/truncated, weights renormalised wrong, frequency vs. probability vs. analytic weights confused, a weight applied after rather than before a transform.
Sample — a filter that runs before vs. after a transform; an outlier rule applied inconsistently.
Environment — a package/Stata version bump that changed a default; a seed that moved; locale/encoding.

For a genuinely ambiguous bug, fan out the top competing hypotheses to parallel Task subagents (one per hypothesis, context: fork), each instructed to try to confirm its own cause on the MWE and report back — the loop-first analogue of asking three colleagues at once (see orchestrator-protocol.md).

Phase 3b — Reduce the hypotheses (so you don't launder a guess)

Each hypothesis (whether tested by hand or by a fan-out Task) returns {hypothesis, evidence for, evidence against, confidence, one-line conclusion}. Then:

One clear winner (high confidence, others refuted) → proceed to Phase 4 to confirm the mechanism.
A near-tie (top two within ~20 percentage points) → do not pick one; go to Phase 4 instrumentation to discriminate.
None above ~50% → report ambiguity and ask the user; do not edit on a coin-flip.

Phase 4 — Instrument & localize (bisect, don't stare)

Test the ranked hypotheses cheaply:

Bisect the pipeline — check the intermediate value at the midpoint of the data flow; the bug is upstream or downstream of it. Repeat. Binary search finds the offending line in log2(n) steps, not n.
Bisect history — if it "worked yesterday", compare against the last-good commit/output to pin the change that introduced it. (git bisect is fine here — it never discards work; the destructive git commands are blocked by git-guardrails.py, this is not one of them.)
Instrument with diagnostic primitives, not guesses — at each stage inspect: str() / summary() for types & NA patterns; row & column counts before and after every transform; table(factor) to catch a silently dropped level; cor() / VIF for unexpected collinearity; weight diagnostics range(w), sum(w), table(is.na(w)); and the regression's convergence flag. The stage where a count drops unexpectedly, a factor level vanishes, correlation jumps, or weights go sparse is the culprit stage.

End Phase 4 with a one-sentence root cause naming the exact line/step and mechanism.

Phase 5 — Fix & verify (then guard against regression)

Confidence gate (the anti-laundering rule): do not apply a fix unless the root cause is named and its mechanism is explicit. If Phase 3b left a near-tie, behave as --no-fix: report the candidates and ask. Editing research code on an unproven hypothesis is exactly the laundering this skill exists to prevent.

Unless --no-fix is set:

Apply the minimal fix at the root cause — not a downstream patch that masks it (prefer fixing the bad merge over filtering its duplicate rows afterward).
Re-run the MWE → confirm actual == expected within the Phase-0 tolerance.
Re-run the full unit and any dependent step → confirm the fix didn't move another number. If the result feeds a manuscript claim, re-check it (cross-ref the passport in /audit-reproducibility).

Note a prevention — the assertion/check that would have caught this earlier. One concrete guard per bug class:

Bug class	One-line guard
Types & coercion	`stopifnot(is.numeric(x))` after read
Missingness	explicit `na.rm = FALSE`; `stopifnot(sum(is.na(x)) == 0)`
Joins & shape	record `nrow` pre-merge; `stopifnot(nrow(out) == nrow(left))` for a 1:1 join
Weighting	`stopifnot(abs(sum(w) - 1) < 1e-8)` or `!anyNA(w)`
Convergence	assert the optimizer/model convergence flag is OK before using estimates
Sample	one explicit `filter()` with a stated reason, not a mid-pipe drop
Environment	pin versions in `renv.lock`; `set.seed()` at the top of each script

Propose the guard; don't silently install a test suite.

With --no-fix, stop after the root cause is named and report it for the user to fix by hand.

Worked example

A staggered-DiD ATT jumped from −0.043 to −0.071 after a data refresh; nothing in the spec changed.

# Phase 1 — reproduce: set.seed(1); same script, same number every run. Red is stable.

# Phase 2 — MWE: one cohort, two periods still shows the jump.
#           Strip to: read panel -> merge covariates -> feols(). Bug survives the merge step.

# Phase 4 — instrument: row counts before/after each step
nrow(panel)                         # 12,400  (expected)
nrow(merge(panel, covars, by="id")) # 12,933  <-- inflated! a many-to-many merge

# Root cause: the refresh left duplicate covars rows for a subset of ids; the
# join fans those ids out, 12,400 -> 12,933 (+533 rows), re-weighting the ATT
# toward the duplicated units.

# Phase 5 — minimal fix at the root (dedup the key), NOT a downstream row filter:
covars <- covars[!duplicated(covars$id), ]
# re-run: ATT back to -0.043 within tolerance; full pipeline re-checked, no other number moved.

# Prevention (Joins & shape guard):
stopifnot(nrow(merge(panel, covars, by = "id")) == nrow(panel))

Output / report format

Write a short diagnosis to quality_reports/diagnoses/YYYY-MM-DD_<slug>.md (create the directory first: mkdir -p quality_reports/diagnoses). These reports may contain real data values and file paths — they are project-internal and gitignored, like session logs. Include:

Symptom: expected vs. actual (+ tolerance).
MWE: the minimal input/code that reproduces it.
Root cause: the exact line/step and mechanism.
Fix: the diff applied (or, with --no-fix, the recommended change).
Verification: MWE + full-run re-check results.
Prevention: the guard that would have caught it.

Plus a chat summary leading with the one-line root cause.

Cross-language notes

The usual-suspects model is illustrated in R but the bug classes are language-neutral; the diagnostic idioms differ:

R — anyNA() / table(is.na(x)); factors silently drop unused levels; set.seed(); sessionInfo().
Stata — tab v, missing and explicit ./.a–.z extended missing; set seed; version; weights as [fw=] vs [pw=] vs [aw=] is a frequent silent bug.
Python — df.isnull().sum(); numpy.nan ≠ None; pandas vs numpy NaN handling differ; np.random.seed() / a passed random_state; pip freeze.

(Forkers in other fields: the five structural classes — Types, Missingness, Joins, Sample, Environment — are discipline-neutral; the econometric suspects above are the worked instance.)

Exit behavior

Outcome	Action
Root cause NAMED (high confidence), fix applied, re-verified	report root cause + diff + prevention
`--no-fix`	stop at a named root cause; write the report, make no edit to source
Phase 0 blocked (no statable expected/actual)	halt, ask for the expected value — diagnosis needs a target
Phase 1 blocked (cannot reproduce / nondeterminism)	report the nondeterminism as the finding (it is the bug class) + how to make the analysis deterministic; do not edit blindly
Phase 3b near-tie / <50%	report the competing hypotheses and ask the user; do not apply a fix

Flags

--no-fix — Diagnose only: run through naming the root cause (Phases 0–4) and write the report, but make no edit to source. Use when you want to apply the fix yourself, or when the file is shared/load-bearing and an automated edit is inappropriate.

Cross-references

.claude/skills/review-r/SKILL.md — code-quality review with no specific symptom (diagnose is symptom-driven).
.claude/skills/audit-reproducibility/SKILL.md — verify all numeric claims against code; diagnose localizes a single failing one (and is the natural hand-off from a FAIL).
.claude/skills/capture-environment/SKILL.md — snapshot the environment when version/seed drift is the suspect.
.claude/rules/replication-protocol.md — the tolerance contract that defines "same number", and the "If Mismatch" hand-off to this skill.
.claude/rules/orchestrator-protocol.md — the fan-out primitive used for competing-hypothesis testing in Phase 3.

What this skill does NOT do

Review code with no symptom — that is /review-r. Diagnose needs an expected-vs-actual gap to chase.
Re-audit every claim in a paper — that is /audit-reproducibility. Diagnose fixes one bug deeply.
Build a test suite — it proposes the single guard that would have caught this bug; standing test infrastructure is separate dev work.
Commit the fix — branching / committing is /commit's job.