| name | gitnexus-taint-analysis |
| description | Use when working on, reviewing, or extending GitNexus's CFG/taint/PDG subsystem (the `--pdg` layers), or when reasoning about source→sink data-flow findings. Examples: "How does taint analysis work here?", "Why didn't explain find this flow?", "Add a new sink/source", "Review the interprocedural taint code". |
CFG & Taint Analysis with GitNexus
Expert knowledge for the opt-in --pdg program-analysis subsystem: control-flow
graphs, reaching definitions, and intra- + inter-procedural taint. Read this
before touching gitnexus/src/core/ingestion/cfg/** or
gitnexus/src/core/ingestion/taint/**, or when explaining a finding.
When to Use
- "How does the taint engine work / why is this flow (not) reported?"
- Adding a source, sink, or sanitizer to the model.
- Extending or reviewing the CFG / reaching-defs / taint / summary code.
- Understanding the
explain MCP tool's findings (intra- vs inter-procedural).
- Debugging a false positive or false negative in
--pdg output.
The layered substrate (build order)
Taint runs on the graph, not beside it. Each layer is opt-in behind --pdg
and a default analyze run is byte-identical (the golden parity gate is the
hard floor for every change here).
L1 CFG per-function basic blocks + control-flow edges (M1 #2081)
L2 REACHING_DEF GEN/KILL def→use data dependence (pure solver) (M2 #2082)
L3 Taint (intra) source→sink over RD facts, minus sanitizers (M3 #2083)
L4 Taint (inter) per-function summaries composed over CALLS (M4 #2084)
- Worker-built, main-thread-solved. The parse worker builds each function's
CFG + harvests def/use + call-site facts onto
ParsedFile.cfgSideChannel
(plain, structured-clone-safe data — never AST nodes). The main thread runs
the pure solvers. NEVER re-parse on the main thread (re-introduces the #1983
OOM).
- In-phase emit (KTD1). L1–L4-harvest all run INSIDE the scope-resolution
pdg window (
scope-resolution/pipeline/run.ts, gated input.pdg === true),
because the disk-backed ParsedFile store is cleared when that phase ends — a
standalone post-mro phase would read empty data. The cross-function fixpoint
(L4) is the exception: it runs in its OWN registered phase (taintSummaries)
AFTER scope-resolution, because it needs the COMPLETE call graph, and consumes
small plain summary data threaded out via ScopeResolutionOutput.
- Pure-solver contract.
computeReachingDefs, computeTaintFlows,
harvestFunctionSummary, and solveInterprocTaint are pure and deterministic
(no graph, no I/O, no logger; sorted outputs). Snapshot tests and
content-derived edge ids depend on it.
Intra-procedural taint (L3)
Forward reachability over RD facts from matched sources to matched sinks,
killed by sanitizers. Key design points worth internalizing:
- Occurrence-tagged sites. A flat per-arg binding set cannot tell
exec(escape(x)) (safe) from exec(x) (finding); the harvest records nested
call structure (SiteRecord.parent/via-tags) so sanitizer interposition is
precise.
- Kind-set sanitizer model. A taint carries a set of neutralized
SinkKinds; a sink fires unless its kind is in the set. So escape(req.body)
suppresses res.send (xss) but STILL fires db.query (sql) — a kind-blind
kill would be a suppressed live injection (the forbidden FN direction).
path.basename(t) neutralizes path-traversal only, not command-injection.
- Statement-level finding identity. NOT block-pair (block conflation drops
distinct findings;
exec(req.body, req.query) is two findings).
- Persisted as
TAINTED edges (BasicBlock→BasicBlock); the path rides the
reason column via the shared versioned codec (taint/path-codec.ts).
Interprocedural taint (L4) — the functional/summary method
The production approach (Sharir-Pnueli 1981; the same shape as Meta's Pysa and
Mariana Trench, and FB Infer) — NOT full IFDS tabulation. Each function is
reduced to a compact summary, and summaries are composed over the already-
resolved CALLS graph.
Summary shape (taint/summary-model.ts, whole-parameter granularity):
| Edge | Meaning | Analogue |
|---|
param→return | a param flows to the return value | TITO — reserved (the floor already covers its recall; precision pass deferred) |
param→callee-arg | a param flows into arg j of a call (carries the path's neutralized sink kinds) | TITO into callee |
param→sink | a param reaches a modelled sink | partial/triggered sink |
source→return | the function generates+returns a source | generative — composed via the caller's callResults |
source→callee-arg | a generated source flows into a call | fixpoint SEED |
callResults | a user-function call's result flows to a sink/return/callee-arg in the caller | composes with callee source→return |
The fixpoint (taint/interproc-solver.ts): the unit is (function, parameter, source). Seed from source→callee-arg, propagate via
param→callee-arg, fire a finding when a tainted param meets param→sink.
- Cycle-safe by monotonicity. The tainted-set is monotone over a finite
lattice (
fn × param × source), so the worklist converges — a recursive call
just re-proposes an already-visited entry. SCC condensation would only refine
processing order; correctness/termination don't require it.
- Source-discriminated state (load-bearing). Key the state by the SOURCE
too. Keying only by
(fn, param) collapses multi-source flows: a sink param
tainted by source A is marked visited and a later flow from source B is dropped
before firing — the recurring multi-source bug class. (Bit M3; bit M4 U9.)
- Name-based call join. Match a summary's call-arg edge to a
CALLS edge by
CALLEE NAME, not call-site line — line-base parity (CFG 1-based vs reference
site) is fragile; the callee identity is exact and context-insensitivity
taints the callee's param identically at every call site.
- Persisted as
TAINT_PATH edges (Function→Function), function-level hop chain
in reason via the same codec; confidence < the intra-procedural 1.0.
Context-insensitivity is the accepted trade-off at this tier: one summary
per function, return/call-site merging accepted (security-conservative). Expect
some FP from merging; the bigger FN sources are unmodeled features (below).
Known false-negative classes (documented, deferred)
The largest is closures/callbacks (arr.forEach(() => sink(y))) — taint
into a callback is dropped without per-library models (true of CodeQL's JS libs
too). Also deferred: field/property flows (obj.x = taint; sink(obj.y)),
field-sensitive access paths, guard-style sanitizers, implicit/control-dependence
flows, promise/async-await threading, and destructured/rest params before a
tainted simple param (the summary port index is the binding ordinal, not the
formal arg position — needs a formal-param index threaded from the worker
BindingEntry). The interprocedural join is also context-insensitive: when one
caller invokes two distinct same-named callees, a flow into one
over-attributes to both (sound — over-report, never a missed flow). Absence of a
finding is NOT proof of safety.
GitNexus-specific gotchas
- Function↔CFG join.
FunctionCfg.functionStartLine is 1-based; Function/
Method node startLine is 0-based — join at startLine - 1. Function nodes
have no column, so same-line functions ({a:()=>x(), b:()=>y()}) are
ambiguous → drop (the summary driver counts unresolved) rather than
cross-wire.
- No rel-property index (S1). Kuzu has no secondary index on relationship
properties, and unanchored
[:TAINTED*]/[:TAINT_PATH*] queries explode.
TAINT_PATH is therefore MATERIALIZED + anchored at analyze time, never
traversed live; explain reads it source-anchored + LIMIT-guarded.
explain is the only discovery surface. TAINTED/TAINT_PATH are
deliberately OUT of VALID_RELATION_TYPES (impact's allow-list) and the web
schema (pinned in security.test.ts). explain enumerates both layers
(cross-function findings carry interprocedural: true).
- One shared codec. Both the emit path and
explain import
taint/path-codec.ts. Two hand-rolled copies of a wire format drift — never
fork it. New metadata extends the format WITHIN the version when writer +
reader ship together.
- Cache versioning. A worker-harvest shape change bumps the parse-cache pdg
NAMESPACE (
pdg:N), NOT SCHEMA_BUMP (which cold-invalidates every user).
Persisted-graph/config changes ride RepoMeta.pdg's key-union mismatch →
full writeback. Model content rides taintModelVersion.
Adding a source / sink / sanitizer
Edit the language model in taint/typescript-model.ts (registered via the
explicit registerBuiltinTaintModels seam, keyed by SupportedLanguages). The
spec is hashable data (no functions). A sanitizer's neutralizes lists the
EXACT sink kinds it defends — never a blanket kill. Add a fixture + assert the
finding (or its absence) in test/unit/taint/ (real-source harness:
test/helpers/ts-cfg-harness.ts); the end-to-end proof is
test/integration/cfg/.
Validation checklist for any --pdg change
1. tsc clean (schema additions are exhaustiveness-checked; watch the
api.ts getNodeQuery runtime read-path if a node label is added).
2. Targeted vitest by directory (test/unit/taint, test/unit/cfg,
test/integration/cfg) — verify by ISOLATION, not full-suite exit
(known load-flakes). `node scripts/build.js` before worker/integration runs.
3. Flag-off golden byte-identical (pipeline-graph-golden.test.ts).
4. bench/cfg/measure.mjs --check (no fingerprint drift / budget regression).
5. detect_changes() before commit; impact({direction:'upstream'}) before
editing shared symbols (KnowledgeGraph, RepoMeta, RelationshipType, codec).
Prior art (for deeper design questions)
Sharir & Pnueli 1981 (functional approach); Reps-Horwitz-Sagiv IFDS (POPL 1995);
FlowDroid/StubDroid (access-path summaries); Pysa & Mariana Trench (TITO /
propagations, parallel SCC fixpoint); CodeQL Models-as-Data (the richest port
notation, incl. callback ports); Infer (content-keyed incremental summaries).