Run any Skill in Manus with one click

rigorous-experiments

This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

Run Skill in Manus

Stars254

Forks36

UpdatedJune 8, 2026 at 17:46

Source

glebis

glebis/claude-skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

File Explorer

16 files

SKILL.md

readonly

name

rigorous-experiments

description

Rigorous Experiments

Run statistical experiments on observational/personal time-series data that survive scrutiny. Distilled from a 54-experiment n-of-1 program in which sampled permutation tests, missing-data artifacts, app-categorization bugs and collinear mechanisms repeatedly manufactured — and then destroyed — "findings". Every rule here exists because its absence once produced a wrong conclusion.

Modes

Pick the mode matching the request; chain them for a full study.

Mode	When	Reference
design	New hypothesis or study	`references/design.md`
conduct	Implementing + running the experiment	`references/statistics.md`
validate-data	Before trusting ANY new data source	`references/data-validation.md`
cross-validate	Findings worth defending; code review; external model review (e.g. GPT Pro)	`references/cross-validation.md`
investigate-leads	A sweep/run produced leads (p<0.06, not FDR-confirmed)	`references/lead-investigation.md`
audit	Re-examining past claims, registries of findings	`references/statistics.md` §Audit

Non-negotiable core (all modes)

Pre-register before computing. Hypotheses, exact tests, family size m, and the acceptance threshold go in the script docstring BEFORE the first run. Post-hoc tests are reported as descriptive, never promoted.
Exact permutation, never sampled, on small n. A session sequence of n=19 has 18 circular shifts: the minimum honest p is ~1/19≈0.05. Sampling 2000 shifts with replacement fabricates precision (this killed a flagship "q=0.028" finding). Use scripts/perm_stats.py.
Permute over the full calendar, not the compressed series. Shifting a gap-compressed series breaks the timeline; keep missingness as NaN masks re-applied per shift. Event indicators must be pure 0/1 with no gaps — missingness lives only in the outcome series.
BH with FIXED family size m, a LITERAL CONSTANT declared at design time — never len(tests) (that defeats pre-registration; the linter rejects it). Assert the run matches the declared m. Confirmatory families small and separate from exploratory sweeps; pooling everything into one BH buries true effects, cherry-picking families manufactures them. Plain BH assumes independent/positively-dependent tests; for strongly dependent lag families use BH-Yekutieli or maxT resampling.
Stationarity check before correlating trending series. Exact circular shift on a trending series is "exactly, reproducibly wrong": report prewhitened-r (AR1 residuals) and stationary bootstrap alongside.
Stratify before pooling (Simpson check): within group (e.g. therapy/coaching) and within regime (pre/post known breaks). A pooled r=−0.25 once hid therapy −0.64 vs coaching +0.53.
Controls can re-describe a finding, not just kill it. When a control collapses an effect, check collinearity of control and predictor — r(self-focus, session-length)=0.79 meant "mechanism ambiguous", not "effect fake". Report the decomposition.
Honest statuses: confirmed (q<0.10 exact) ≠ lead (p<0.06) ≠ null ≠ descriptive. Status flips are recorded, never silently edited. Nulls with adequate power are findings. Robust ≠ significant: a lead surviving leave-one-out at small n is still underpowered — a candidate for prospective test, not a finding. 8b. Series scope is part of the test. A lagged "[t+1]" means the next unit in the series the hypothesis is about, not the next pooled row; define scope before lagging (it once flipped a sign). When recomputing a prior result, reproduce a stored artifact on that scope first.
Privacy: raw text/audio never enters output files or external uploads — statistics, rates and embedding-derived scores only.
Plain-language reporting: every statistic carries its practical meaning inline; define r/p/q/n once per report; no untranslated jargon calques. Narrative first, numbers as support.

Workflow (full study)

validate-data gate on any new source (see reference — the checklist has caught: zero-vs-missing conflation, dedup semantics, substring category bugs, rolling purge windows, timezone conventions).
design: pre-registered hypotheses + family + power sanity.
conduct: implement with scripts/perm_stats.py; run; write results JSON with tests, statuses, and caveats including known limitations.
cross-validate: adversarial code review (e.g. Codex read-only) BEFORE trusting results; fix findings; re-run. For major claims, external model review with a privacy-screened archive.
investigate-leads on anything that surfaced as a lead (not at the same scale — the triage battery: LOO, directionality, detrend-vs-step, within-cycle, prewhiten+bootstrap; consolidate same-direction leads into one composite). Mark diagnostic runs descriptive_only: true.
Verdicts in honest prose (mixed/rejected allowed); report; registry update with status provenance.

Viewing results

Launch the bundled explorer over any directory of results JSONs:

python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]

Generates explorer.html in the directory, starts (or reuses) a loopback http server on the port, and opens the browser: experiment list with confirmed/lead badges, filter, sortable test tables color-coded by status, verdicts, caveats, raw JSON. The page fetches result files live — re-running experiments updates the view; re-run the script only when new result files appear. Serve over localhost, never file:// (CDN fonts) and never on a non-loopback interface (results may contain personal statistics).

Evals

Run python3 evals/run_evals.py (from the skill directory) to lint an experiment script/results pair against the standards (pre-registration present, fixed literal m, exact perm usage, caveats, no raw text in outputs). A diagnostic/triage run that intentionally mints no new tests sets descriptive_only: true in its results JSON to satisfy the "has tests" check. Eval cases in evals/cases/ document expected pass/fail examples.

More from this repository

same repository

tufte-report

glebis/claude-skills

Create Tufte-inspired data reports and infographic dashboards as standalone HTML files. Uses EB Garamond for text, Monaspace Argon for numbers, Chart.js for interactive charts, and inline SVG sparklines. Produces publication-quality reports with 2-column narrative+data layouts, status dashboards, scroll animations, and responsive mobile support. Use this skill whenever the user wants to create a data report, activity dashboard, infographic, personal analytics page, health tracker visualization, or any document that combines narrative text with interactive charts and tables. Also triggers for "make a report like Tufte", "create an infographic", "build a dashboard", "visualize my data", or requests for beautiful data-driven documents.

2026-06-05254

annotate

glebis/claude-skills

Build and verify a PII gold set with HUMAN annotators (first-class). Launch the browser annotator, label spans per the codebook, export per-annotator label files, then compute inter-annotator agreement (Cohen's/Fleiss' kappa) and draft an adjudicated gold. Use when the user says "annotate PII", "label this transcript", "build a gold set", "inter-annotator agreement", "review annotations", "adjudicate labels", or wants to measure/defend a de-identification gold standard. Local-only: synthetic or consented data only; annotators' names and transcript text stay on the machine — only labels/stats are collected, nothing PII is re-shared.

2026-06-03254

anon

glebis/claude-skills

De-identify a session transcript (file or folder) by redacting PII LOCALLY before any sharing or cloud use. Produces a redacted GREEN copy with unique reserved-sentinel placeholders ([CONFIDE_PERSON_0001], [CONFIDE_EMAIL_0001], [CONFIDE_DATE_0002]...) plus a counts-only stats summary, and a local secret <name>.map.json (0600, gitignored) that enables confide:rehydrate to restore real values after a cloud analysis. Use when the user says "anonymize this transcript", "redact PII", "de-identify session", "make safe to share", "strip personal data", "anonymize notes before sending to an LLM", or points at a transcript/folder that should be scrubbed. Local-only by default — raw text never leaves the machine; the map is the only artifact with originals and stays local; nothing printed is PII; human review is still required before sharing.

2026-06-03254

audit

glebis/claude-skills

Run a corpus-scale, STATS-ONLY PII audit over a folder of session transcripts LOCALLY and produce an aggregate report — counts by type and by layer, the per-session redaction-rate distribution, document lengths, and a coarse residual proxy. Use when the user says "audit my sessions", "scan folder for PII", "how much PII across these transcripts", "PII stats for my corpus", "is my redaction holding at scale", or points at a directory of transcripts and asks how much personal data it contains. Fully local — raw text never leaves the machine; the report carries ZERO PII values, transcript substrings, or filenames (only anonymized own-NN ids and counts), so the aggregates are safe to surface. Run it on a RED (raw) corpus to size the PII, or on a GREEN (already-redacted) corpus to check residual leakage.

2026-06-03254

rehydrate

glebis/claude-skills

Put the real values back into an analysis that was produced from GREEN (placeholder) text — LOCALLY, using the user's own reversible map. Completes the confide round-trip (redact -> cloud-analyze the green -> rehydrate locally). Use when the user says "rehydrate", "restore real names", "unmask the analysis", "put the names back", "de-redact this output", "reverse the placeholders", or hands you an analysis full of [CONFIDE_PERSON_0001]/[CONFIDE_DATE_0002] plus a *.map.json. Runs only on the user's own map; the map never leaves the machine; nothing fetched or transmitted. Prints counts only — never echoes restored PII. Warns on placeholders not in the map (possible LLM hallucination).

2026-06-03254

vault

glebis/claude-skills

Set up and verify the CONFIDE THREE LOCKS for storing RED (real, identifiable) session data at rest — device FileVault, a dedicated encrypted store, and per-file sops/age encryption. Use when the user says "set up confide vault", "encrypt my session data", "three locks", "secure store for transcripts", "sops/age for RED data", or asks how to store real therapy/coaching transcripts safely. NON-DESTRUCTIVE: it CHECKS each lock's status and prints the EXACT command to fix any gap; it never moves, deletes, or encrypts data, and never runs `fdesetup enable`/`hdiutil`/`age-keygen` without an explicit flag and your confirmation. Probes are read-only (`fdesetup status`, which sops/age, key path).

2026-06-03254