| name | dark-code-audit |
| description | Audits a codebase for dark code risk: code that was generated, passed automated checks, and shipped without anyone understanding it. Produces a structured audit report with a hotspot map, comprehension debt scorecard (spec coverage %, context layer coverage %, review depth), ownership gap analysis, top failure scenarios, and a prioritized action plan. Use this skill before a security review, compliance review, or major refactor; when new engineers join and the codebase feels opaque; after a period of high AI-assisted development velocity; quarterly as a health check; or any time you hear "audit for dark code", "comprehension debt", "dark code risk", "what do we not understand about this codebase", "knowledge gap analysis", "who owns what", or "we've been shipping AI code really fast lately". This skill does not recommend "add more monitoring" — it identifies where human comprehension is missing and prescribes structural fixes.
|
Dark Code Audit
You are conducting a dark code audit. Dark code is code that was never understood by anyone:
AI-generated, passed automated checks, shipped — and the comprehension step never happened.
This is different from technical debt. Technical debt is code the author understood but cut
corners on. Dark code is code that no one has ever fully understood. The distinction matters
because the fix is different: you can pay down technical debt by refactoring; you address
dark code by building comprehension artifacts, context layers, and comprehension gates.
Never recommend "add more monitoring" or "add a supervisory layer" as a primary fix. These
are the wrong answers — monitoring tells you when something broke, not why, and a supervisory
layer just adds another dark layer on top. The fixes must happen upstream of the code running.
Step 1: Automated Discovery
Before the interview, gather everything the codebase can tell you automatically. Run in parallel:
Context layer coverage:
- Glob for
MODULE_MANIFEST.md across the codebase — list all directories that have one vs. don't
- Glob for
BEHAVIORAL_CONTRACTS.md — same
- Glob for
DECISION_LOG.md — same
- Glob for
COMPREHENSION_ARTIFACT.md in .claude/comprehension/ — what PRs had gate reviews?
Velocity indicators:
- If git available:
git log --oneline -50 to see recent commit velocity and message patterns
git shortlog -sn --no-merges to see contributor distribution (high concentration in few names, or many one-time contributors, are both risk signals)
- Look for commit messages that are obviously AI-generated (comprehensive, impersonal, covering every edge case in the message)
Structural risk patterns:
- Grep for shared resource patterns:
redis\|cache\|REDIS, database client imports, queue client imports
- Glob for
*.env.example or similar — what external services does this system touch?
- List top-level directories and identify which represent distinct modules or services
Existing documentation:
- Read root README and any CLAUDE.md
- Check for existing ADRs (
docs/adr/, decisions/, architecture/)
- Check for spec documents (
specs/, requirements/, *.spec.md)
Present a discovery summary before starting the interview:
Auto-discovered:
• Modules/directories: N
• With MODULE_MANIFEST.md: N/N (X%)
• With BEHAVIORAL_CONTRACTS.md: N/N (X%)
• With DECISION_LOG.md: N/N (X%)
• Comprehension artifacts (prior gate reviews): N
• Shared infrastructure clients found in: [list of modules]
• Contributor concentration: [N engineers touched N% of commits]
• ADRs / spec docs: found / not found
Starting interview. Correct anything that looks wrong.
Step 2: Interview (Four Groups)
Conduct the interview in four groups. Ask each group's questions together — don't ask one at a time. Make it clear that short answers are fine; you're building a picture, not writing an essay.
Group 1: Architecture and system scope
- What are the main services or modules in this system? (if not clear from directory structure)
- Which services communicate with each other? Any cross-service data flows not obvious from imports?
- Are there any services that were built primarily by AI tools in the last 6–12 months?
- Is there any part of the system where "nobody really knows how it works"?
Group 2: AI tool usage
- Are AI coding tools (Claude, Copilot, Cursor, etc.) actively used on this codebase?
- Is there a requirement for spec/design approval before generating code, or does code go directly from prompt to PR?
- Are AI-generated PRs reviewed differently than human-written PRs — more scrutiny, less, or the same?
- Has the team ever shipped something from an AI that caused an incident? What happened?
Group 3: Team and ownership
- Who owns each major module or service? Is that documented anywhere?
- Have there been significant team changes in the last year — layoffs, departures, team restructuring?
- Are there modules where the original author has left and knowledge transferred? Or didn't?
- Is there any code that the team is afraid to touch? ("We don't change that file.")
Group 4: Development and deployment practices
- Does the team write specs or requirements before implementing features, or is the ticket the spec?
- Are there pre-merge reviews that specifically check for comprehension (blast radius, side effects) — not just "does it work"?
- Is there a process for capturing architectural decisions before they're implemented?
- Are there automated tests? What do they cover? What don't they cover?
Step 3: Analyze
With auto-discovery data and interview answers, analyze across three categories:
Velocity dark code (code that moved too fast to understand):
- Services with high AI usage + no spec requirement + no comprehension gate review
- Modules modified by many different engineers with no central owner
- High commit velocity with low test coverage
- "We just got it working" modules with no documented decisions
Structural dark code (emergent behavior nobody designed):
- Cross-service data flows identified in discovery that have no documenting schema or contract
- Shared resources (Redis, databases) accessed by multiple services with no ownership clarity
- Non-engineer-accessible workflows (scheduled jobs, event triggers, webhooks) with no manifest
- Services where the answer to "what does this do?" is "it depends on what called it"
Compounding factors:
- Knowledge loss: original authors of key modules have left
- Ownership dissolution: modules with no clear owner
- The "observability trap": the team believes Datadog/logs/metrics = comprehension (it doesn't)
- Regulatory exposure: services handling PII/payments/user data without behavioral contracts
Step 4: Write the Audit Report
Write the report to docs/dark-code-audit/YYYY-MM-DD-audit.md (create directory if needed).
Use today's date. If a previous audit exists in that directory, note what has changed.
Use the template in references/audit-template.md and the scoring guide in
references/scoring-rubric.md.
The report must include:
-
Executive summary — 3–5 bullet points: what the audit found, the risk level, and the single most important thing to do next
-
Hotspot map — a table of modules rated by dark code risk, with columns: Module, Risk level, Context layers, Owner known, AI-heavy, Notes
-
Top 3 failure scenarios — specific, plausible incidents that the current state makes possible. Not abstract risks. Write them as brief narratives: "Service X calls Service Y's /compute endpoint. Y has no BEHAVIORAL_CONTRACTS.md. If Y's team changes the retry behavior, X's caller will [specific consequence]."
-
Ownership gap analysis — modules with no identified owner, and the risk each poses
-
Comprehension debt scorecard — using references/scoring-rubric.md:
- Spec coverage % (what % of features/modules have a written spec)
- Context layer coverage % (what % of modules have MODULE_MANIFEST.md + BEHAVIORAL_CONTRACTS.md + DECISION_LOG.md)
- Review depth score (0–3: no review / existence check / functional review / comprehension gate)
- Explainability score (can the team answer: what does X do with customer data on day Y?)
- Overall comprehension debt level: LOW / MEDIUM / HIGH / CRITICAL
-
Prioritized action plan — 3–5 specific, executable actions. Each action should: name the specific module or practice to address, say exactly what to do (run /context-layer-generator on X, set up /comprehension-gate for PRs to Y), and explain what risk it reduces.
Guardrails
Do not recommend:
- "Add more monitoring or logging" as a fix for dark code. Monitoring tells you when something broke. It does not give anyone comprehension of why.
- "Add a supervisory AI layer." A layer that watches dark code is itself dark if it wasn't understood when built.
- "Rewrite it." Rewriting dark code without first building comprehension just creates new dark code faster.
Do distinguish:
- Confirmed findings (from auto-discovery or direct interview answer) vs. risks (inferred from patterns)
- Velocity dark code (addressable now with context layers + gate) vs. structural dark code (requires runtime tooling beyond Claude Code's scope)
After Writing
Report:
Dark code audit complete.
Comprehension debt level: [LOW / MEDIUM / HIGH / CRITICAL]
Hotspots: [N modules at HIGH risk]
Top action: [single most important next step]
Full report: docs/dark-code-audit/YYYY-MM-DD-audit.md
Recommended next steps:
• /context-layer-generator [highest-risk module] — build context layers
• /comprehension-gate — run on any pending PRs touching hotspot modules