| name | generate-data-lineage |
| description | Assembles a data flow narrative from MODULE_MANIFEST.md and BEHAVIORAL_CONTRACTS.md context files, answering the explainability question: "What does the system do with [data type] for [user journey]?" Use before a compliance or security review, when a dark-code-audit flags "Explainability: Partial", when onboarding a new engineer who needs to understand data flows, or when preparing for GDPR, EU AI Act, or SOC 2 review. Reads context layers across the codebase, interviews for gaps, and writes docs/data-lineage/YYYY-MM-DD-<name>.md with a confidence rating. Invoke as: /generate-data-lineage (all PII-touching flows in the codebase) /generate-data-lineage --journey user-signup (specific user journey) /generate-data-lineage --module path/to/mod (flows for a specific module) /generate-data-lineage --type payment (specific data type)
|
/generate-data-lineage
Assembles a data flow narrative from context layers — answering the question any compliance reviewer, security engineer, or new hire will eventually ask: "What does the system actually do with this data?"
What this is and what it isn't
This skill produces a structural lineage based on documented behavior. It describes what the system should do based on MODULE_MANIFEST.md and BEHAVIORAL_CONTRACTS.md. It is not a runtime trace — it does not tell you what happened to a specific customer's data on a specific date. For incident response requiring that level of precision, you need application-level audit logs in addition to this document.
For most purposes (compliance prep, engineer onboarding, GDPR ROPA prep, security review), the structural lineage is sufficient. A LOW-confidence lineage document that makes gaps explicit is more useful than no document at all.
Arguments
- (none) — traces all PII-touching data flows across the whole codebase
--journey <name> — traces a specific user journey (e.g., "user-signup", "payment", "account-deletion")
--module <path> — traces data flows for a specific module only
--type <category> — traces a specific data type (e.g., "payment", "PII", "credentials", "health")
Phase 1: Discover data-touching modules
Use Glob to find all MODULE_MANIFEST.md files in the repo. For each, look for:
- Data classification fields mentioning: PII, personal data, credentials, payment data, health data, financial data, user identity, email, name, address, phone number
- Shared resource entries (databases, cache, queues) associated with user data
- Data flows sections showing user-sourced input
Build a candidate list of data-touching modules.
If --module or --type was specified, filter to the relevant subset. If --journey was specified, note it — you'll use it to filter which modules are relevant (a "user signup" journey involves the auth module, profile module, and notification module, but probably not the analytics aggregation module).
Report the candidate list before proceeding:
Data-touching modules found: [N]
- auth/ — handles credentials, session tokens
- profile/ — handles PII (name, email, address)
- billing/ — handles payment data
- notifications/ — handles email addresses
Modules without context files (gaps):
- recommendations/ — appears to handle user_id but has no MODULE_MANIFEST.md
Proceeding with documented modules. Gaps will be noted in Open Questions.
Phase 2: Read behavioral contracts for each module
For each data-touching module, read BEHAVIORAL_CONTRACTS.md. For each interface that handles the target data type, extract:
| Field | Source |
|---|
| Interface name and signature | BEHAVIORAL_CONTRACTS.md interface section |
| Transformations applied | Side effects section — what changes about the data |
| Where data is written | Side effects — DB tables, cache keys, queues |
| What leaves the service | External calls, events emitted, responses returned |
| Data sensitivity | Data classification field |
| Retention/expiry | Retention constraints in MODULE_MANIFEST.md |
Order the modules by data flow: entry point → processing → storage → egress. For a user signup journey: API handler receives data → validation module transforms it → profile module stores it → notification module sends it to external email service.
Phase 3: Interview for gaps
For each module with a data classification but incomplete contracts, ask targeted questions. Batch them — don't ask per-field per-module:
I need to fill in some gaps for these modules. For each:
auth/ — I can see it stores session tokens but the retention policy is not documented.
- How long are sessions retained? What triggers expiry?
- Does any external service receive the session token (e.g., an analytics or logging service)?
recommendations/ — This module appears to use user_id but has no context files.
- What personal data does this module process?
- Where is it stored and who else can read it?
- Is there a deletion or expiry path?
If the user can't answer, mark the gap in Open Questions and proceed with LOW confidence.
Phase 4: Determine confidence level
Before writing:
- HIGH — all modules in the flow have complete MODULE_MANIFEST.md and BEHAVIORAL_CONTRACTS.md; no interview gaps
- MEDIUM — some modules have partial context; gaps are documented in Open Questions but the main flow is clear
- LOW — significant gaps; one or more key modules in the flow are missing context files
Phase 5: Write the lineage document
Write to docs/data-lineage/YYYY-MM-DD-<name>.md. Derive <name> from the argument provided (journey name, data type, or module name). Create the directory if it doesn't exist.
Use this structure:
# Data Lineage: [Journey/Type/Module]
**Scope:** [what data types and user journey this covers]
**Last updated:** [today's date]
**Confidence:** HIGH / MEDIUM / LOW
**Confidence note:** [brief explanation — e.g., "recommendations/ module lacks context files"]
> This document describes the system's *documented* behavior based on context layer files.
> It is not a runtime trace. For incident response requiring evidence of what actually happened
> to specific data, application-level audit logs are required in addition to this document.
---
## Entry Points
[Where this data type enters the system — which interfaces, which modules, from which callers]
| Entry point | Module | Interface | Data received |
|-------------|--------|-----------|---------------|
---
## Flow Narrative
[Step-by-step trace through the system]
### Step 1: [Module Name] — [what happens here]
**Interface:** `[interface name]`
**Input:** [what data arrives]
**Transformation:** [what changes about the data, if anything]
**Output:** [what leaves this module and where it goes]
**Written to:** [DB table, cache key, queue — or "nothing persisted at this step"]
### Step 2: ...
[Continue for each module in the flow]
---
## Storage Locations
Where this data is at rest.
| Location | Module | Data stored | Retention period | Who can read |
|----------|--------|-------------|-----------------|--------------|
---
## Egress Points
Where data leaves the system.
| Destination | Module | Interface | Data sent | Mechanism |
|-------------|--------|-----------|-----------|-----------|
[If none: "No documented egress points for this data type."]
---
## Deletion / Expiry Path
How data is eventually removed from each storage location.
[If unknown for any location: note it in Open Questions]
---
## Open Questions
Fields or modules where context was incomplete. These are investigation targets before this
lineage document can be relied upon for compliance purposes.
| Module | Gap | Impact |
|--------|-----|--------|
---
## Modules With Missing Context Files
The following modules appear to process this data type but have no context files.
Their behavior is not reflected in this lineage document.
[List — or "None. All data-touching modules have context files." for HIGH confidence]
**To fill these gaps:** Run `/context-layer-generator` on each module listed, then re-run
`/generate-data-lineage` to update this document.
Phase 6: Update MODULE_MANIFEST.md data flow sections
If the lineage process reveals that any existing MODULE_MANIFEST.md data flow sections are incomplete or inaccurate (e.g., a downstream consumer not listed, a data type not mentioned), update those files and note the changes.
After writing
Report:
- Journey/type/module documented
- Confidence level and reason
- Number of open questions
- Number of modules missing context files
- Path to the generated document
- Whether any MODULE_MANIFEST.md files were updated
If confidence is LOW, suggest the priority order for running /context-layer-generator — start with the module that handles the most sensitive data or sits at the entry point of the flow.