| name | disclosure-check |
| description | Pre-screen analysis outputs (tables, figures, logs) built on restricted or confidential data for statistical-disclosure-limitation problems before any release. Scans for small cell counts, complementary-suppression gaps, dominance (p-percent / (n,k)), re-identifiable exact counts, PII leakage, and unrounded sensitive statistics; classifies each finding CRITICAL / WARNING / OK and gates on any CRITICAL. Use before depositing or sharing restricted-data results, or when the user says "disclosure check", "SDL scan", "is this output safe to release", "check for small cells", "disclosure avoidance", "pre-screen for the RDC", or "can I export this from the enclave". |
| argument-hint | [outputs-dir] [--provider census|irs|irb|generic] [--threshold N] (outputs-dir defaults to scripts/R/_outputs/) |
| disable-model-invocation | true |
| allowed-tools | ["Read","Grep","Glob","Write","Bash"] |
| effort | high |
/disclosure-check — Statistical-Disclosure-Limitation pre-screen
Scan analysis outputs built on restricted or confidential data (Census FSRDC, IRS SOI, administrative registers, linked health records, proprietary firm panels) for the disclosure-avoidance problems that get an export request rejected — before it reaches the data provider's official disclosure review. The skill is a pre-screen, not a substitute for that review.
Core principle: A single un-suppressed n=3 cell, an exact count that pins down one firm, or a p-percent dominance failure can re-identify a person or establishment. Catch it on your machine, not in the rejection email from the RDC analyst.
When to use
- Before requesting an export from a Census FSRDC / secure data enclave / RDC.
- Before depositing restricted-data results to openICPSR, a journal, or a co-author outside the enclave.
- Before sharing any figure, table, or log derived from confidential microdata.
- As a release gate. Pair with a pre-commit / pre-deposit invocation so no restricted-data output ships un-screened. This is the foundation of the data-management plan for any restricted-data project.
Inputs
$0 — outputs directory to scan. Defaults to scripts/R/_outputs/. Recognised siblings: scripts/stata/_outputs/, scripts/python/_outputs/, or any export-staging directory (e.g., a to_review/ folder the analyst stages for the RDC).
--provider — selects which disclosure-rule profile to load (Phase 0). One of census / irs / irb / generic. Providers differ — thresholds and rules are not interchangeable; default generic is deliberately conservative.
--threshold N — override the minimum cell count (default n<10). Census FSRDC commonly uses 10 for establishments; IRS and many IRBs differ. Always reconcile with your provider's written rules.
Workflow
Phase 0: Load the provider's disclosure rules
- Read
.claude/rules/confidential-data.md for the project's restricted-data handling contract and the rule-profile placeholder.
- Load the
--provider profile (a placeholder config the forker fills in from their signed agreement — Census, IRS, and IRB rules differ and supersede any default here):
- min cell count (default
n<10),
- dominance rules:
p-percent (a cell is unsafe if the largest respondents contribute > p% of the total) and (n,k) (top n units > k% of total),
- rounding required for sensitive statistics (counts, totals, ratios),
- top-coding / bottom-coding thresholds for extreme values,
- geographic minimum population for any geocoded statistic.
- If no signed-rule values are recorded, fall back to the conservative
generic profile and flag prominently in the report that real provider thresholds must be substituted.
Phase 1: Scan the outputs directory
Glob the outputs dir for .tex, .csv, .txt, .log, .smcl, .out, .md tables and figure-data files. For each:
- Cell counts — parse table cells / frequency columns; flag any count
0 < n < threshold that is not already suppressed.
- Complementary-suppression gaps — if one cell in a row/column is suppressed but the margin total and the other cells let a reader back it out by subtraction, the suppression is incomplete.
- Dominance — for any total/mean cell where unit-level contributions are available (or inferable), apply the
p-percent and (n,k) rules.
- Exact re-identifying counts — small exact integers (e.g., "4 hospitals", "1 firm", a max/min that is a single observation) that single out a unit.
- PII leakage — regex for names, SSNs (
\d{3}-\d{2}-\d{4}), exact dates of birth, addresses, exact lat/long or fine geocodes, record IDs that survived into an output.
- Unrounded sensitive statistics — exact unrounded counts/totals where the provider requires rounding.
Phase 2: Classify each finding — CRITICAL / WARNING / OK
| Disposition | Meaning | Examples |
|---|
| CRITICAL | Would fail the provider's disclosure review; blocks release. | n=3 cell un-suppressed; complementary-suppression hole; p-percent dominance failure; any PII; an exact count identifying ≤2 units. |
| WARNING | Plausibly safe but needs a human judgment call. | Cell at exactly the threshold; unrounded total just over a rounding base; geographic statistic near the min-population floor. |
| OK | Within the loaded rules, no action needed. | Counts ≥ threshold and rounded; dominance passes; no PII. |
When two findings interact (a suppressed cell + a recoverable margin), report them together — the gate cares about the joint disclosure risk, not each cell in isolation. Be economics-aware: DiD / event-study cell counts per (cohort × period), IV first-stage subsamples, RCT arm × stratum balance tables, and panel firm-counts are the usual offenders.
Phase 3: Suggest remediation
For each CRITICAL / WARNING, propose the standard SDL fix, in order of preference:
- Suppress the offending cell (and its complement, if a margin allows back-out).
- Round counts/totals to the provider's base (e.g., nearest 10 or 15).
- Top-code / bottom-code extreme values.
- Aggregate — collapse thin categories, coarsen geography, widen bins until every cell clears the threshold.
- Drop the statistic if no remediation preserves both safety and meaning.
Each suggestion names the file, the cell/location, the rule it violates, and the concrete edit — never auto-applies it (the analyst owns the disclosure decision).
Phase 4: Gate
Exit non-zero on any CRITICAL. WARNINGs surface but do not block. See Exit behavior.
Output / Report format
Write quality_reports/disclosure_check_[outputs-dir-slug].md:
# Disclosure Check: [outputs dir]
**Date:** [YYYY-MM-DD]
**Provider profile:** census | irs | irb | generic (rules source: confidential-data.md)
**Min cell count:** [N] **Dominance:** p=[p]%, (n,k)=([n],[k]%) **Rounding base:** [b]
## Summary
| Disposition | Count |
|---|---|
| CRITICAL | M |
| WARNING | W |
| OK | P |
| **Verdict** | **PASS / FAIL** (FAIL iff M > 0) |
## CRITICAL (blocks release)
| File | Location | Rule violated | Observed | Suggested remediation |
|---|---|---|---|---|
| tab3_by_cohort.tex | row "2008", col "n" | min cell (n<10) | n=4 | suppress cell + suppress complement in margin |
## WARNING (human judgment)
| File | Location | Concern | Suggested action |
|---|---|---|---|
## OK
[counts only, or a short list]
## Next steps
1. Resolve every CRITICAL — suppress / round / top-code / aggregate, then re-run.
2. Review WARNINGs with the agreement's written rules in hand.
3. Re-run until zero CRITICAL, THEN submit to the provider's OFFICIAL disclosure review.
Exit behavior
- Zero CRITICAL: exit 0; report printed. (WARNINGs allowed — they are surfaced, not blocking.)
- Any CRITICAL: exit 1; summary to stderr. This makes the skill usable as a release / pre-deposit gate. Mirrors
/audit-reproducibility's gate semantics: WARNING ≠ FAIL, only CRITICAL blocks.
- No rules loaded (generic fallback): exit 0 with a prominent warning that real provider thresholds were not supplied — the pre-screen ran but at conservative defaults, not the actual agreement.
Flags
--provider <name> — Load that data provider's disclosure rules (e.g. census-fsrdc, irs, irb). Default: the generic small-cell ruleset.
--threshold <n> — Override the minimum cell-count threshold (default n<10); match your data-use agreement's actual rule.
Cross-references
What this skill does NOT do
- It does not replace the data provider's official disclosure review. Census/RDC, IRS, and IRB analysts run the authoritative review; this skill pre-screens so the official review is more likely to pass on the first pass. A PASS here is not clearance to release.
- It does not certify your rules are correct. It applies the thresholds you load from your signed agreement; if the loaded
--provider profile is wrong, the scan is wrong. Reconcile with the written agreement, not a default.
- It does not move, encrypt, or transmit data, never exfiltrates microdata from the enclave — it reads only the staged outputs you point it at.
- It does not catch every disclosure risk. Differencing across released tables, longitudinal re-identification, and model-based inferential disclosure can evade a per-file scan. A clean run is necessary, not sufficient.