| name | vulnerability-triage |
| description | Reference vocabulary for interpreting vulnerability findings — detector-vs-impact distinction, severity anchoring on demonstrated evidence, the eleven-item interpretation rubric, delegation transparency, primitive-extent scaling, the disqualifier taxonomy (D-0..D-4), CVSS Achievable / Environmental framing, hedging-phrase elimination, and falsification asymmetry. Read when interpreting a finding to decide whether the demonstrated evidence supports the severity it would warrant. Surface-neutral; pulls in the relevant surface skill (e.g. `memory-safety-c-cpp`) for bug-class-specific exploitability factors. |
Vulnerability Triage
Reference vocabulary for interpreting what an investigation actually demonstrated and translating it into threat-model language. Catalogs the rubric, disqualifiers, scoring conventions, and pitfalls a triage decision needs to reason about.
Core rule
Severity is anchored on what was actually demonstrated, never on the theoretical maximum implied by the vulnerable code. A detector firing on an isolated harness — sanitizer, fuzz crash, static analyzer hit, tainted-flow report, schema-validator alert, assertion failure — proves the detector caught something. It does NOT prove attacker-visible impact. Confidentiality, integrity, and availability claims require the corresponding effect to have been observed in production-equivalent conditions. When the demonstrated observation is weaker than the theoretical ceiling, score the demonstrated level and label the ceiling as such.
Detector evidence vs impact evidence
These two are routinely confused. Hold the line between them.
Detector evidence is anything a tool flagged: ASAN / UBSAN / TSAN / MSan crash, libFuzzer reproducer, static-analyzer warning, taint-flow trace, assertion site, schema-validator alert, AST pattern match. Detector evidence proves the detector caught an anomaly at the named site. It says nothing on its own about whether an attacker can reach that site, supply attacker-useful values, or observe the result.
Impact evidence is an observation of an attacker-visible CIA effect under production-equivalent conditions: reading bytes the attacker did not previously hold, mutating state the attacker did not previously control, denying service to a user other than the attacker, gaining a capability the attacker did not previously have. Impact evidence is anchored on observable artifacts — a crash, sanitizer trace, changed output, callback fired, file read, error message, measurable state change.
"Ran without error" is not impact evidence. If the expected effect was not observed, either the experiment was wrong or the bug is not triggered — diagnose which; do not paper over. A finding supported only by detector evidence is a detector hit awaiting demonstration, not a confirmed vulnerability.
Interpretation rubric
For each finding, the rubric below has eleven items. Skipping any one of them is a triage failure. Each item names a question, what counts as a complete answer, and what an incomplete answer looks like.
-
Reproduction. Re-run the exact input. Confirm the same observation. Complete answer: same input, same observation, same artifact (crash signature, sanitizer trace, observed output). Incomplete: "should reproduce", "did not retry", or a different observation than the original. If reproduction fails, the finding is not yet triageable.
-
Effect realism. Classify the observation against the four categories below; the first two support an impact claim, the last two refute it.
- Attacker-uncontrolled target state observed (genuine impact).
- Attacker-controlled effect on state the attacker did not previously hold (genuine impact).
- Attacker's own input echoed or round-tripped back, or constant / neutral state (zeroed memory, schema-sanitized placeholder, default config, empty collection) — no impact.
- Non-exploitable error path the attacker cannot steer — no impact.
Reject impact claims when the observation falls in the last two categories.
-
Attacker control surface. Is the effect's extent adversary-chosen, or bounded by structure the attacker cannot influence? Complete answer: explicitly state which input fields drive the effect's size, location, or content, and which are fixed by the surrounding code. Incomplete: "the attacker controls the input" without naming which subset of the input controls which dimension of the effect.
-
Adjacency / payload model. What attacker-useful content plausibly reaches the affected location in production state? For memory-read findings, what target state sits adjacent to the affected region. For injection findings, what sink does the tainted value reach. For auth-bypass findings, what resources become reachable. State this as a reasoned model unless the experiment measured the specific payload.
-
Channel reality. In deployed configurations, does the attacker actually observe the effect? Some channels realize impact (metadata APIs, attacker-rendered output, error messages echoed back). Others absorb or drop it (batch jobs, log-only sinks, sandboxed processes, observation channels not exposed to the caller). Complete answer: name the production channel through which the effect would be observed; trace it from the bug site to where the attacker sees it.
-
Production hardening. What mitigations present in production (sanitizers, schema validators, CSP, ASLR, stack canaries, auth middleware, sandbox, rate limits, RELRO, FORTIFY_SOURCE) change the attacker's success rate? Did the experiment use the production profile or bypass it? A harness with hardening stripped tells you nothing about production outcomes — score what production does, not what the harness did.
When a mitigation table asserts a runtime cap, flag, or mode is at a particular default value in production, cite the line that registers that default (GObject property spec, struct initializer, constant read at startup) — not a call site that happens to leave it unset. A harness that overrides the default via the library's public API (set_max_X(0), set_strict(false)) looks identical at the bug site to a real default of 0; the registration cite is what distinguishes them.
-
External exploitability trace. Does the input originate from an external / untrusted source, and through what delivery channel? Name it concretely — network protocol, file upload, queue consumer, environment-variable propagation, IPC. Complete answer: a named delivery channel from an untrusted boundary to the bug site. Incomplete: "an attacker could supply input" without naming how.
-
Duplicate / CVE check. Search the project's issue tracker, CVE databases, and changelog for the code location. Note overlap with existing fixes even when partial — a fix for one variant does not necessarily cover sibling variants.
-
Severity score. CVSS 3.1 anchored on demonstrated evidence under the Achievable / Environmental framing below. Justify each metric against observations. If a metric is ambiguous, pick the lower value and note what work would support the higher.
-
Delegation transparency. Skill section: Delegation transparency. If the finding depends on a delegate library, both linkage realism and delegate-side validation analysis are present. If either is missing, the score caps at Low regardless of other rubric items.
-
Primitive extent. Skill section: Primitive extent. The scaling axes for this finding's primitive class are enumerated. For each, discover either demonstrated the axis or it is documented as un-explored. Un-explored axes that would change severity drive an insufficient verdict with directive — they do NOT raise the achievable score on extrapolation alone. For memory-safety findings, impact surface and heap layout reachability are explicitly analyzed.
Delegation transparency
When the violation site depends on values produced by a delegate library — any upstream parser, validator, decoder, or runtime not under the project's source control — severity cannot anchor beyond "latent code smell" until the delegate's own validation has been confirmed to allow the trigger values through. A harness that links against a project-internal stub of the delegate falsely demonstrates reachability: the trigger reaches the violation site because the stub's permissive behavior bypassed the real delegate's validation, not because the real delegate would.
Before scoring any severity above Low (CVSS ~3.x) on a finding that depends on a delegate library, two confirmations are required:
- Linkage realism. Cite evidence that the harness links against the real delegate library version production would link — not a project-internal stub. Acceptable evidence: dynamic linker output (
ldd, otool -L), build manifest, package version pin. Project-internal stub directories (typically named *-stub/, *_mock/, mock-*/, fake-*/) resolved by the linker disqualify the evidence.
- Delegate-side validation analysis. Cite the specific delegate-side check that the trigger must pass through (function name, file, validation expression) and explain why it allows the trigger values through. If no such check exists in the delegate, state that explicitly with citation to the delegate's source.
Without both, the finding is a hardening opportunity at most. Score Low, recommend the upstream-side guard as defense-in-depth, and flag for harness redesign.
This rule sits above the rubric — failing it caps the achievable score regardless of the rubric's other items.
Primitive extent
The achievable CVSS score anchors on what discover actually demonstrated, including the scaling extent discover exercised — not on a larger scale the source suggests is reachable but no experiment ran. A primitive's theoretical ceiling implied by the surrounding code is not impact evidence; that is the same projection the Core rule forbids.
Triage's primitive-extent job is enumeration and gap-checking, not extrapolation. For each finding, list the scaling axes that apply to the primitive class. For each axis, classify whether discover demonstrated it (and at what value) or left it un-explored. An un-explored axis whose value would change severity if exercised is a triage gap, not a license to score on the projected value.
When an un-explored axis would change severity, the verdict is insufficient with a directive that names the axis, the boundary value to push toward, and what discover should run to characterize it. This is one of the concrete triggers for the insufficient-verdict discipline elsewhere in this skill — the achievable score does NOT rise on extrapolation alone. If every axis that matters has been demonstrated by discover (even at a non-maximum value), score on the demonstrated extent and note which axes were sampled where.
Common scaling axes per primitive class — for each, the entry names what discover should push to characterize the axis:
- Out-of-bounds write/read. Discover should push distance past the allocation, total bytes accessed, attacker control over the written/read byte values, and stride between accesses. The minimum trigger may demonstrate one byte past one allocation with a fixed value; the axes ask whether that extends to attacker-chosen bytes across a larger span at a controllable stride.
- Use-after-free. Discover should push reuse delay against any deferred-free queue or generation counter, and characterize the type set that can land in the freed slot under attacker-influenceable allocation pressure.
- Type confusion. Discover should push the type-pair set the confusion lands on across the surrounding dispatch, and the depth of polymorphic dispatch reached in the confused state before the program faults or returns.
- Integer overflow into undersized allocation. Discover should push the undersize ratio between requested and allocated size, and the extent of the downstream write or read that consumes the un-truncated value.
- Unbounded iteration. Discover should push the iteration count against the buffer or frame the iteration writes into, and the attacker control over the value written at each step.
- Resource exhaustion / DoS. For primitives whose impact is allocation/CPU/IO exhaustion rather than memory corruption, discover should push file-size-to-effect amplification (peak RSS, CPU time, or FD count divided by input file size on disk); wire-body-to-effect amplification when transport compression is in scope (same numerator divided by compressed POST body size — material for AV:N primitives, where gzip/br/zstd handed by upstream proxies can dominate cost-to-attacker); per-item-size × item-count multiplication when the bug multiplies via list/array containers (both ceilings are attacker-controlled — document independently); and the allocation-timing locus (eager parse before any application code runs vs. lazy on-demand — determines where in the call graph mitigations can be sited).
Score the achievable severity on the maximum extent discover demonstrated across these axes. If discover only ran the minimum trigger and an axis above plausibly changes severity, return insufficient with the axis named in the directive — do not silently raise the score, and do not silently cap it without flagging the gap.
When evaluating a heap OOB primitive, two additional analyses feed the impact axis. These reason over facts already in evidence (struct layouts, allocation order in the source, the crafted trigger's input shape) rather than projecting from one demonstrated trigger to a different one — they are static-analysis triage, not extrapolation:
- Impact surface. Enumerate the adjacent allocation classes within the write stride: object vtables, library-internal struct fields, allocator metadata, and similar. Name what corruption produces, and which CVSS axis each corruption shape feeds (I, A, C). The enumeration must be concrete (named struct types, named fields) — "memory corruption could affect anything" is not enumeration.
- Heap layout reachability. Is the heap state attacker-influenceable from the same input that triggers the bug? If yes, attack complexity drops a step (heap grooming is in-band rather than requiring a separate vector). If no, name the limiting factor (out-of-arena allocations, fresh process per request, deterministic allocator).
These two analyses are answerable from artifacts the workflow already produces (struct layouts, allocation traces, the crafted trigger input). They drive the CVSS Integrity and Attack Complexity scores; they are not optional for memory-safety findings.
Mitigation reachability. The abort site (function, file, line) where discover's demonstrated trigger actually faults or aborts is part of the primitive's characterization — record it alongside the scaling axes. Any mitigation triage recommends or accepts must sit at or upstream of that abort site in the call graph. A mitigation in code reachable only after the abort fires is unreachable in the demonstrated attack path and is not a valid recommendation, even if the static analysis of the surrounding code suggests it would catch the trigger. This check is most often violated when the abort fires inside a delegate library during eager parse and the proposed fix sits in application-side code that runs after parse completes.
Disqualifier taxonomy
When the verdict is insufficient, the supporting notes carry one of the labels below. The labels disambiguate the gap and let downstream routing target it. The taxonomy is for justifying gaps in already-presented findings — not for seek-and-find dismissal of findings that meet the rubric.
-
D-0 — Evidence-synthesis failure. A prior experiment or analysis disproved the hypothesis the finding now rests on, but the finding was assembled without integrating that result. The contradiction is in the artifacts; the finding ignored it.
-
D-1 — Test, mock, example, or documentation code. The bug site lives in non-production code: unit test fixtures, example harnesses, sample inputs in documentation, mock implementations, debug-only utilities. Production callers do not reach it.
-
D-1.5 — Privilege tautology. The "attack" requires the attacker to already hold the capability the bug supposedly grants. A vulnerability must harm someone other than the attacker. Root reading a root-owned file, an admin reading admin-only data through an admin-only path, or a process reading its own memory through its own debug interface is not a vulnerability.
-
D-2 — Disqualifying preconditions. The bug requires conditions the attacker cannot supply: chaining with a separate unconfirmed vulnerability, a victim taking a specific action, physical access to the host, prior authentication where the bypass is not itself the finding, a non-default configuration the deployer would not reasonably enable.
-
D-3 — Hedged claim that did not survive verification. The finding's severity claim depends on language that hedges over the evidence rather than stating it (see "Hedging-language elimination" below). When the hedge is stripped, the claim collapses or contradicts the artifacts.
-
D-4 — No security impact. The bug class is real, but no CIA effect follows. Memory leak without amplification, resource exhaustion without disproportion, cosmetic state corruption that no consumer reads, an observation channel that absorbs the effect before any caller sees it, debug-only side effects.
CVSS Achievable / Environmental scoring
Triage emits one CVSS 3.1 vector and score per finding. That score is the Achievable / Environmental number — it reflects what was demonstrated under production-equivalent hardening, not the inherent worst case the bug class could imply on a hardening-stripped target.
Conventions for the metrics:
-
AV (Attack Vector). Decomposes how the attacker reaches the vulnerable code, not how they reach the host. A network-reachable service whose vulnerable function requires a local side-channel to trigger is AV:L for that finding, even though the service is on the network.
-
PR (Privileges Required). Decomposes the privileges needed to reach the vulnerable path, not the privileges needed to log into the system at all. If the vulnerable path is gated behind admin-only middleware, PR:H — even if the broader service has anonymous endpoints elsewhere.
-
UI / S / C / I / A. Anchored on what was observed under the production hardening profile. If production hardening absorbs the effect (a schema validator strips the payload; padding zeros the leak; a sandbox contains the side effect), the impact metric drops to None for the absorbed dimension and the absorption is noted.
-
Hardening-stripped harness contributes nothing here. A T1 harness compiled without canaries, ASLR, FORTIFY_SOURCE, or with custom permissive allocators may score very high in the abstract; that score is not the Achievable / Environmental score. If the production profile was not exercised, score what the production profile would do and call out the gap.
Hedging-language elimination
Hedged claims drift severity upward without evidence. Every hedge below must either be verified to a value-level statement or struck from the finding entirely. They are claims-to-verify, not claims-to-include.
- "could potentially"
- "may be exploitable"
- "in theory"
- "subject to specific conditions"
- "if configured incorrectly"
- "low likelihood"
- "informational risk"
- "cannot be ruled out"
- "depending on environment"
- "under certain circumstances"
- "in some configurations"
A finding that depends on hedging language to assert its severity is a D-3 disqualifier candidate. Verified replacements name the specific value, configuration, or condition that turns the hedge into an observation.
Falsification asymmetry
One observed exploitation refutes "no bug." One failed exploitation does NOT refute "this bug." The burden is asymmetric: a single positive observation outweighs many failed attempts against the same evidence. Practical consequences:
- If any approach succeeds in demonstrating the attacker-visible effect, the finding has impact evidence — even if other approaches against the same hypothesis failed.
- If all plausible approaches failed against the same evidence, the finding is
insufficient — not "ruled out." A different approach, a different harness tier, or a different production profile may still demonstrate impact.
- Negative results constrain the demonstrated severity; they do not prove the bug-class is absent. State what was tested and what was not, and what range of severities the negatives rule out.
Common pitfalls
Patterns the rubric catches that get missed in practice. These are surface-neutral; per-surface restatements (e.g. heap-OOB-read adjacency rules, sanitizer-on-padding behavior) live in the relevant surface skill.
-
Detector evidence rebadged as impact evidence. A sanitizer fired on a Tier-1 isolated function. The finding writeup translates that into "remote code execution" without ever naming where the production code observes the effect. The detector says the detector caught something; that is not the same sentence.
-
Round-tripped attacker input mistaken for leaked target state. The bytes the bug "leaks" are the bytes the attacker just supplied, with possibly some encoding transformation. That is not a confidentiality finding. Any leak claim must demonstrate adjacency: what target state sits next to the affected region, and that the observation contained that target state.
-
Padding, zeroed memory, or sanitized placeholders mistaken for sensitive content. An OOB read returns NUL bytes, schema-sanitized fields, or default-config values. The detector fired; impact has not been demonstrated.
-
Hardening-stripped harness scored as production. The harness ran without ASLR, without canaries, without FORTIFY_SOURCE, with a permissive custom allocator, or with the validating middleware bypassed. The production binary intercepts the effect; the harness did not see that interception. The Achievable / Environmental score is what production does, not what the harness did.
-
Harness API override mistaken for production default. The harness called the library's public configuration API to lift a safety valve (set_max_X(0), disable strict mode, suppress validation) so a latent bug becomes reachable — a legitimate research move. The mitigations table then asserts that the production default is the lifted value, conflating "the harness set it to 0" with "the default is 0". Distinct from compiler-hardening stripping: this is the library's own runtime configuration, and the registration site (property spec, initializer) is the ground truth, not the call-site override.
-
Theoretical ceiling promoted to demonstrated severity by rewording. "This pattern could yield RCE" gets quoted, then rewritten without the hedge, then cited as "demonstrated RCE" two paragraphs later. The rubric's reproduction and effect-realism items refute this; the hedging-elimination rule names it.
-
Variant tunnel-vision. The harness reproduced one variant of the underlying pattern; the finding reports only that variant. When the same arithmetic / lifetime / dispatch shape reaches sibling variants on the same surface, those are findings in their own right. Document each, even when the experiment exercised only one.
-
Channel that absorbs the effect. The bug site exists, the value is attacker-controlled, but the production channel between the bug site and any observer is a logging path, a batch job that runs offline, a sandbox whose violations never escape. The effect happened; the attacker cannot observe it. Score impact at None for the dimension the channel absorbs.
-
Privilege tautology dressed as a finding. The "attacker" is the same principal as the "victim." A user reading their own data through their own session, an admin invoking an admin-only API, a process inspecting its own memory. D-1.5; not a vulnerability.
Insufficient-verdict discipline
When evidence does not support the severity the finding would warrant, the verdict is insufficient. The notes accompanying that verdict MUST name the specific gap concretely:
- The parameter to sweep, with its named range and the boundary values not yet exercised.
- The mitigation to exercise, with the production profile to rebuild under and the hardening flags to enable.
- The adjacency to verify, with the target-state class that must be observed leaked rather than padding or attacker-input.
- The hardening profile to rebuild under, with the specific compile flags or runtime configuration the prior experiment skipped.
- The disqualifier label (D-0..D-4) that the finding currently triggers, when applicable.
Generic gap statements ("do more discovery", "needs more evidence", "investigate further") are forbidden. The downstream consumer of an insufficient verdict needs a directive concrete enough to act on without rederiving the gap.
A finding's insufficient verdict is not a closure — it is a redirection. The finding remains open; the gap statement is what closes it on the next round.