| name | fact-extraction-pipeline |
| description | Patterns for adding extractors and write surfaces to the PR 12 fact-extraction pipeline (`apps/erify_api/src/orchestration/fact-extraction/`). Use BEFORE implementing any new `IngestionExtractor`, paired-atomic write, `SystemFactKey`, or hydrated-scope target type. Required reading before PR 12.3.2 (violations), 12.4 (review surface), or any follow-on extractor work — every Codex finding on PR 12.1.2 (#103) and PR 12.2 (#104) is encoded here. The "State-transition handoff between co-submitted facts" section is mandatory before adding ANY fact whose write semantics depend on another fact's value in the same submission. |
Fact Extraction Pipeline
The extraction pipeline routes COMPLETED task content into typed indexed columns on Show / ShowCreator / ShowPlatform / ShowPlatformViolation via IngestionExtractor strategy objects. It is high-concurrency and audit-critical — almost every correctness bug surfaces as either silent data loss or misclassified audit rows.
Canonical files
Outcome routing order (HARD RULE)
The per-fact loop MUST run filters in this exact order. Each later filter assumes the earlier ones short-circuited.
1. handledPairedKeys / handledPairedPlatformContentKeys (paired path consumed)
2. isFactValueAbsent(rawValue) → noop:value_absent (no audit)
3. !extractorRegistry.resolve(key) → skipped_no_extractor (no audit)
4. Per-target stale check (hydrated) → skipped_stale_target (no audit)
5. isFactColliding(fact, colliding) → skipped_collision (SKIPPED audit)
6. processor.applyAndAudit(...) → written / skipped_lower_priority (write + audit)
Why skipped_stale_target must precede skipped_collision: a stale target is unwritable by definition; emitting a SKIPPED_LOWER_PRIORITY audit for it would mislabel an unwritable row as a contested write. (Codex P2 on PR #103.)
Why skipped_no_extractor must precede the collision check: unregistered keys are silent no-ops by registry contract — emitting a SKIPPED audit for a key nothing in this binary can write is fictional. (Codex P2 on PR #98.)
Predicate-on-write rule (race-safe update*Actuals)
Every update*Actuals helper on a model service MUST scope its write predicate by:
{ uid, <scopeParentId>, deletedAt: null }
Use updateMany, check result.count === 0, throw HttpError.notFound(...) when no row matched. Reference: ShowPlatformService.updateActuals.
async updateActuals(uid: string, showId: bigint, payload: { ... }): Promise<void> {
const result = await this.repo.updateMany(
{ uid, showId, deletedAt: null },
{ ...payload },
);
if (result.count === 0) {
throw HttpError.notFound('ShowPlatform', uid);
}
}
Why all three filters:
uid — primary key
<scopeParentId> — guards cross-scope reassignment race (Codex P1 #8 on PR #103: platform reassigned to another show between read and write would be mutated under the wrong audit context)
deletedAt: null — guards concurrent soft-delete race (Codex P2 #7)
The extractor / paired processor MUST then catch (err) { if (err instanceof NotFoundException) return { kind: 'noop', reason: 'target_stale' }; throw err; } around the call. Same pattern goes around the initial getByUid read — both ends of the read-write race must collapse to target_stale.
Per-target collision detection (hydrated scopes)
CollisionTracker has two tiers — never collapse to one:
type CollisionTracker = {
showScope: Set<SystemFactKey>;
perTarget: Set<string>;
};
- Show scope (
target: 'show'): schema-only match on the sibling is sufficient (the sibling will write to the same one target).
- Hydrated scope (
target: 'show_creator' | 'show_platform'): collision is per (factKey, targetUid). Walk sibling content (NOT schema). Mark collision only when:
- The sibling value passes
!isFactValueAbsent(value) (blank / cleared values aren't competing writes — Codex P1 #6)
- The parsed content key's
(factKey, targetUid) is in the current task's writing set
Wrong (PR 12.1.2 pre-fix): per-fact-key collision blocked unrelated platform paired writes whenever any sibling shared the fact key, leaving valid actuals stale. (Codex P1 #5.)
Construct the helper map per sibling: factKeyByFieldId: Map<fieldId, SystemFactKey>. Sibling field IDs differ from the current task's — match on the canonical fact key, not the field id.
Narrow error-class collapsing
Never catch {} or catch (err) { return fallback } blindly. Only collapse known error classes; rethrow unknowns.
try {
row = await this.svc.getByUid(uid);
} catch (err) {
if (err instanceof NotFoundException) {
return { kind: 'noop', reason: 'target_stale' };
}
throw err;
}
Why: collapsing all errors as target_stale would silently swallow production incidents (Prisma outage, connection failure) as routine stale assignments. The outer service catch wraps unknowns as extractor_error so the failure stays visible. (Codex P1 #2.)
Provenance assignment at the submission boundary
ActualsSource (the source the resolver compares via canResolverOverwrite) is decided upstream of the engine, in TaskOrchestrationService.submitTaskContent — NOT inside fact-extraction.service.ts. The engine only consumes it. Get this wrong and every downstream priority decision is wrong while every test that asserts only "extraction fired" stays green.
Rules (PR 12.4.6):
MANAGER (rank 4) is reserved for an actual manager override — options.mode === 'admin' AND payload.content !== undefined (the actor changed content) AND the role is STUDIO_ROLE.ADMIN/STUDIO_ROLE.MANAGER. Everything else — a plain approval, and every bulk approval (bulkApproveTasks routes through submitTaskContent with { status } only, no content) — stays OPERATOR (rank 1) so a later PLATFORM sync (rank 3) can still overwrite it. Tagging plain approvals MANAGER would permanently freeze actuals against platform truth.
auditContext.actorRole is a lowercase StudioRole ('admin'/'manager', from request.studioMembership.role). Compare against the STUDIO_ROLE constants, never uppercase literals like 'ADMIN' — that comparison is silently always-false and collapses every write back to OPERATOR. Type the field as StudioRole, not string, so the compiler catches it.
- Tests MUST assert the resolved
source (extractFromTask called with expect.objectContaining({ source: 'MANAGER' })), not merely that extraction was invoked — a "did it fire" assertion cannot catch a provenance regression.
Persisted-JSON registry lookups
Snapshots, metadata, and task content are persisted JSON cast to a TS type at read time. The TS type is NOT load-bearing — mixed-version / legacy / future-binary data can carry keys this binary doesn't know.
Every enum / registry lookup off persisted data MUST guard for undefined:
const definition = SYSTEM_FACT_KEY_DEFINITIONS[key];
if (!definition) continue;
Sites that need this: collectBoundFacts, findCollidingFacts, any metadata.actuals_source[factKey] discriminator, any audit-action enum access off persisted JSON. (Codex P1 #9: a single unguarded .target deref aborted the entire extractFromTask run with TypeError on a mixed-version sibling.)
Per-target paired atomic write
When a fact has both *_start_time and *_end_time (or any merged-validation pair), the per-extractor flow is racy — Codex P1 on PR #101 spells out why. The fix is an atomic applyPaired{Entity}Actuals per-target processor:
- ONE
@Transactional() per target — multiple targets on the same task = multiple transactions, so a validation failure on platform A doesn't roll back platform B's already-written pair.
- Merged-pair validation gated on EFFECTIVE write (
startCanWrite && !startUnchanged), not canWrite. Otherwise a no-op resubmission against a stored pair that's already inverted (because updateShow itself doesn't enforce range ordering) surfaces as extractor_error. (Codex P2 on PR #101.)
- Catch
NotFoundException from updateActuals → target_stale on both sides (Codex P2 on PR #103).
- Caller filters absent / unparseable / colliding / stale BEFORE invoking; the processor's contract is "both sides are writable values."
Reference template: applyPairedShowPlatformActuals + tryAtomicPairedShowPlatformActuals.
Adding a new extractor — checklist
For PR 12.2 (creator), 12.3.2 (violations), or any future extractor. Each box maps to a real Codex finding from PRs #91 / #101 / #103.
Catalog + schema:
Model service:
Extractor class:
Service wiring:
Collision detection (findCollidingFacts):
Tests (mirror the regression coverage on PR #103):
Field-id naming in tests
FIELD_ID_PART = /^fld_[a-z0-9]{10,}$/ in parseHydratedContentKey. Underscores after fld_ are NOT allowed and silently fail to parse — fld_plat_start won't produce facts. Use fld_platstart1 (10+ lowercase alphanumeric only). Spent real debugging time on this during PR 12.1.2.
Outcome → audit table
| Outcome | Audit written? | Notes |
|---|
written | ✅ CREATE / UPDATE | Pairs with column write inside same transaction |
skipped_lower_priority | ✅ SKIPPED_LOWER_PRIORITY | Records the attempt + losing source |
skipped_collision | ✅ SKIPPED_LOWER_PRIORITY | collision_reason: 'cross_task_same_fact_key' in metadata |
skipped_stale_target | ❌ no audit | Unwritable row — no contested-write record |
skipped_no_extractor | ❌ no audit | Silent registry contract |
noop / value_absent | ❌ no audit | Operator left field blank |
noop / value_unchanged | ❌ no audit | Idempotent resubmission |
noop / target_stale | ❌ no audit | Stale-target race after a DB read |
noop / extractor_error | ❌ no audit | Logged + swallowed by outer service catch |
State-transition handoff between co-submitted facts
PR 12.2 (creator actuals + attendance) shipped after eight Codex iterations. Reading them now, the column itself is fine: ShowCreator.attendanceReason holds either the LATE reason (creator_actual_start_time > Show.startTime) or the MISSING reason (attendanceMissing = true), and those states are mutually exclusive under the read-side derivation rule in TASK_INPUT_FACT_BINDING_DESIGN.md. In any steady state, only one writer's reason is semantically meaningful.
The real bug pattern was the cross-fact handoff during state transitions in the same submission. A submission can simultaneously toggle attendance_missing = false AND set actual_start_time > show.startTime — that's a legitimate MISSING → LATE transition, and the late extractor's reason needs to take the column while the missing extractor's false-write must NOT clear it. Each extractor runs in its own transaction with no shared context, so the coordination had to be threaded through ExtractedFact.coSubmittedFactKeysForTarget — and every iteration uncovered a new way that signal could be wrong.
Mandatory rules when two facts derive from mutually-exclusive states
These are the invariants the inference scheme depends on. Break any one of them and a state-transition submission silently loses the operator's context.
-
Resolve column values as operator-provided > preserve-existing > fallback. Never trimmedReason || FALLBACK — a retry that omits the sidecar would downgrade a real operator reason to placeholder text. The fallback only seeds the FIRST write into an empty column.
-
Detect drift before writing. A same-time / same-flag resubmission with a different reason MUST flush the column; the equality short-circuit cannot mask reason corrections. Conversely, a write whose resolved value matches storage MUST short-circuit to value_unchanged to avoid audit noise + idempotent rewrites that hide bugs.
-
Co-submission ownership must be derived from the CURRENT RUN's writing-eligible facts — never from persisted metadata.actuals_source. Historical metadata records "ever happened," not "happening in this submission"; relying on it traps stale reasons on creators whose state has long since transitioned.
-
A fact's pre-flight writing-eligibility filter (writingFacts) MUST include every condition the extractor uses to noop. Currently:
!isFactValueAbsent(rawValue) (null/undefined/empty)
extractorRegistry.has(factKey) (registered)
isFactValueParseable(fact) — parser-aware check keyed on SYSTEM_FACT_KEY_DEFINITIONS[factKey].field_type (datetime → parseDateTimeValue, checkbox → parseBooleanValue)
- Priority skips are an accepted residual gap (cannot be predicted without reading current state)
If you add a new field-type or a new noop reason inside an extractor, ALSO extend the pre-flight filter or coSubmittedFactKeysForTarget will lie to siblings.
-
Clearing a shared column requires explicit co-submission check. When fact A toggles off and would normally clear the column, check whether fact B is co-submitted IN THIS RUN. If yes, leave the column alone — fact B will write its own reason as part of the transition. The coSubmittedFactKeysForTarget set on ExtractedFact exists for this; populate it from writingFacts only.
Hydrated content-key sidecar parsing
parseHydratedContentKey's UID_PART = /^[\w-]+$/ regex accepts underscores, so sidecar keys like fld_x:creator:<uid>__reason parse as valid hydrated facts with a target UID suffixed by __reason. Always filter known suffixes (TASK_CONTENT_REASON_SUFFIX, TASK_CONTENT_EXTRA_SUFFIX) before calling parseHydratedContentKey in collectBoundFacts. The hydration layer's own validator already guards these suffixes — the extraction layer must too.
When this pattern applies (and when it doesn't)
Apply these rules whenever two or more fact keys write to the same column AND the read-side derivation treats them as mutually exclusive. The shared column is correct design — the work is to keep the cross-fact handoff coherent during transitions.
If you're considering adding a THIRD fact key to such a column, stop. Two-writer coordination is already at the limit of what runtime inference can express coherently (eight iterations on PR 12.2 is the evidence). Three writers means six pairwise transitions to model, with no symmetry to lean on — at that point split the column or introduce explicit per-write source attribution.
Why this section exists
The full iteration log is in PR #104's commit history: f9014f84 → 8d2e84d8 → 2fae3daa → 7d117b61 → 5cae322c → 6f1daf02 → 7098371e → 25f6719c. Each fix exposed the next layer of the same root cause: an implicit invariant in the cross-fact handoff that hadn't been encoded yet. Reading the commits in order is the fastest way to internalize why the rules above are absolute.
Related skills