Multi-session deliverable play for projects spanning 3+ sessions with concrete outputs (proposals, strategies, wireframes). Provides project-level structure, evidence provenance, and cross-session handoff.

2026-05-272

fill-cards

paulyokota/FeedForward

Investigation-driven card grooming — investigate across data sources, synthesize findings into card content, present for approval

2026-05-222

agenterminal-code-reviewer

paulyokota/FeedForward

Use when acting as a reviewer in an AgenTerminal review conversation. Handles both code reviews (REVIEW_APPROVED) and plan reviews (PLAN_APPROVED).

2026-05-212

sync-ideas

paulyokota/FeedForward

Match Slack

2026-05-192

release-review

paulyokota/FeedForward

Weekly release impact review — pull PostHog data for Released cards and tracked PRs, classify, draft observations, post to Slack

2026-05-192

released-cards

paulyokota/FeedForward

Verify instrumentation, build measurement insights, close Slack loop for Released cards

2026-05-182

تشغيل أي مهارة بنقرة واحدة

name	priority-review
description	Refresh priority signals, manage active epics, and refresh Near Term pool
disable-model-invocation	true

Priority Review

Refresh the priority signals table for the groomed pool (Backlog + Near Term), manage active epics, compute the Near Term baseline, and display the comparison table. The table and Near Term composition are the artifacts; the skill is the refresh workflow that keeps them current.

Modes

Default (no flag): Full refresh — all card types, Impact + Severity scoring.
--bugs: Bug cards only. Filters to story_type: bug at classification, display, scoring, and ranking. Uses a bug-tuned classification prompt that adds blast_radius and workaround_exists fields. Only Severity scoring (no Impact). Typically fits in a single delegate batch.

Constraints

Classification model: Use claude-sonnet-4-6 for agent classification. Bake-off showed haiku gets counts right but misclassifies evidence_type; sonnet and opus are equivalent on judgment, sonnet is cheaper.
Subagents are classifiers, not summarizers. They return structured JSON. Review evidence_type assignments if any look wrong — sonnet is good but not perfect.
Spot-check at least 3 classifications before importing. Verify intercom_conversations counts against the actual classification inputs: card descriptions plus any linked-context summaries included in the batch file. If any error is found, double the sample size before merging. Repeat until a full sample passes clean.
Spot-check verification must be grounded in tool calls. Grep or Read the source file to confirm specific values — do not verify from memory of a prior read, even if the file is still in context. Proved 2026-04-23: checkmark verification from 6-turn-old memory, monitor flagged twice.
Multi-batch spot-checks need one tool call per batch. When N delegates return, run N Bash reads (one per batch file). Do not construct verification table entries from delegate rationale — that is the delegate verifying itself. If a batch wasn't read, report it as unverified. Proved 2026-04-24: 7 scope batches, 5 verified by tool calls, 2 table rows fabricated from delegate output; monitor caught at high severity.
Incremental by default. Only re-classify cards whose updated_at has changed since last classification. Use --force flags when a full refresh is needed.
Mutations. This skill reads Shortcut cards and writes to the local PostgreSQL priority_signals table. Step 6 sets the Shortcut Severity custom field on any bug that receives a severity score (routed through execute_approved). The Near Term refresh step (Step 7) moves cards between Backlog and Near Term (routed through execute_approved).
Batch files go in the project directory (.tmp-classify/), not /tmp. Subagents cannot access /tmp.
Data scope: The skill refreshes BOTH Backlog and Near Term states, since the Near Term sort needs DIC scores for the full groomed pool.

Steps

0. Active Epics Review

Runs first, before any classification or scoring, since epic membership is the primary Near Term criterion and changes affect everything downstream.

Display current active epics:
```
python3 box/active-epics.py list
```

Fetch all epics from Shortcut API, show any not in the active set that have cards in progress or were created in the last 30 days:

SHORTCUT_API_TOKEN=$(grep '^SHORTCUT_API_TOKEN=' /Users/paulyokota/Dev/FeedForward/.env | cut -d= -f2-)
curl -s -H "Shortcut-Token: $SHORTCUT_API_TOKEN" \
  "https://api.app.shortcut.com/api/v3/epics" | python3 -c "
import json, sys
from datetime import datetime, timedelta, timezone
cutoff = datetime.now(timezone.utc) - timedelta(days=30)
for e in json.load(sys.stdin):
    if e.get('archived'): continue
    created = datetime.fromisoformat(e['created_at'].replace('Z', '+00:00'))
    if e.get('state') == 'in progress' or created > cutoff:
        print(f'ID: {e[\"id\"]} | {e[\"name\"]} | State: {e[\"state\"]} | Created: {e[\"created_at\"][:10]}')
"

Ask: "Any epics to add or remove?" — wait for explicit confirmation before proceeding.

Apply changes:

python3 box/active-epics.py add <ID> "Epic Name"
python3 box/active-epics.py remove <ID>

0a. Preflight: reconcile live board with DB

Before refreshing anything, check for cards in the DB that have moved out of the groomed states since the last run. Run for BOTH states:

Fetch live card IDs from Shortcut for each groomed state:

python3 box/shortcut-cards.py --state "Backlog" --summary
python3 box/shortcut-cards.py --state "Near Term" --summary

Query the DB for card IDs currently stored with groomed states:

psql postgresql://localhost:5432/feedforward -t -A \
  -c "SELECT card_id, state FROM priority_signals WHERE state IN ('Backlog', 'Near Term')"

Diff the two sets. Cards in the DB but not on the live board have moved. For each moved card, re-extract to update its state in the DB:
```
python3 box/priority-signals.py extract --id <comma-separated-moved-ids>
```
Report what moved and where (the extract updates the state field).
Cards on the live board but not in the DB are new — they'll be picked up by the extract in Step 1.

This prevents stale cards from appearing in the signals table or framework ranking. Skip this step only if running with --force on a fresh DB.

1. Refresh metadata from Shortcut API

Run for BOTH groomed states (extract upserts by card_id, no conflicts):

python3 box/priority-signals.py extract --state "Backlog"
python3 box/priority-signals.py extract --state "Near Term"

This updates board position, state, product area, feature area users, and dates. Incremental: skips cards whose updated_at hasn't changed.

2. Check what needs classification

python3 box/priority-signals.py classify --state "Backlog" > .tmp-classify/classify-backlog.json
python3 box/priority-signals.py classify --state "Near Term" > .tmp-classify/classify-nearterm.json

Merge the two outputs into a single classify file. Outputs classification inputs as JSON for cards that either:

Have never been classified
Have been updated in Shortcut since last classification

Each card includes:

description
raw external_links for provenance
bounded linked_context summaries for the first 2 resolvable Slack permalinks in Shortcut-provided order

Cards with only non-Slack external links remain provenance-only in this tranche: no linked-context fetch is attempted for them.

If Slack-linked context cannot be resolved for a card, the card is omitted from cards and surfaced under unresolved_cards. Do not treat unresolved cards as freshly classified; they need a rerun once enrichment succeeds.

If the count is 0, skip to Step 5.

--bugs mode: Filter the classify output to bug cards only before delegating. Linked-context enrichment must happen after this filter so non-bug cards do not trigger Slack reads in bugs mode:

bugs = [c for c in data['cards'] if c.get('story_type') == 'bug']

3. Delegate classification to sonnet agents

Split the classify output into batches of ~16 cards. Delegate each batch to agenterminal.delegate with model: "claude-sonnet-4-6". Don't set timeout_ms — the default is the maximum (30 min).

The classification prompt for each batch:

You are classifying Shortcut cards for priority signal extraction.
Read each card's description plus any linked-context summaries and return
structured classification data.

The cards are in {BATCH_FILE}. Read this file, then for each card extract:

- Treat `description` as the primary source.
- Use `linked_context` only when it adds evidence relevant to the existing
  classification fields.
- Do not mechanically count external links as demand.
- `external_links` without `linked_context` are provenance only in this
  tranche.

1. intercom_conversations: Total count of Intercom conversations referenced.
   Count BOTH linked conversations and stated totals (e.g. "31 conversations")
   when they appear in the description or linked context. Use the stated total
   when available. Return 0 if no Intercom evidence.

2. failure_volume_weekly: For bug cards, the headline weekly failure volume.
   If sub-categories are broken down, report the headline total, NOT the sum
   of sub-breakdowns. Return null for non-bug cards.

3. has_revenue_signal: true if evidence mentions cancellations, refunds, users
   leaving, declining to subscribe. Polite requests without leaving signals = false.

4. evidence_type: One of:
   - "direct_customer_pain" — Intercom conversations show real users affected
   - "internal_metric" — evidence from internal dashboards, not customer reports
   - "speculative" — no evidence of current customer impact
   - "mixed" — combination of customer and internal evidence
   - "implementation_step" — sub-task of a larger initiative

5. sentiment: "high" | "medium" | "low" | "none"

6. linked_context_used: "yes" | "no"
   Return "yes" only if the proposed classification relied on linked-context
   summaries in a way that changed or supported the judgment. Return "no" if
   the description alone was sufficient.

7. notes: One sentence on the most important thing the numbers don't convey.

Return ONLY valid JSON:
{"classifications": [{"card_id": N, ...}, ...]}

Collect all batches in parallel.

--bugs mode classification prompt (replaces the above):

You are classifying Shortcut bug cards for priority signal extraction.
Read each card's description plus any linked-context summaries from
{BATCH_FILE}, then for each card extract:

- Treat `description` as the primary source.
- Use `linked_context` only when it adds evidence relevant to the existing
  classification fields.
- Do not mechanically count external links as demand.
- `external_links` without `linked_context` are provenance only in this
  tranche.

1. intercom_conversations: Total count of Intercom conversations referenced.
   Count BOTH linked conversations and stated totals (e.g. "31 conversations")
   when they appear in the description or linked context. Use the stated total
   when available. Return 0 if no Intercom evidence.

2. failure_volume_weekly: The headline weekly failure volume from the card.
   If sub-categories are broken down, report the headline total, NOT the sum
   of sub-breakdowns. Return null if no failure volume is stated.

3. has_revenue_signal: true if evidence mentions cancellations, refunds, users
   leaving, declining to subscribe. Polite requests without leaving signals = false.

4. evidence_type: One of:
   - "direct_customer_pain" — Intercom conversations show real users affected
   - "internal_metric" — evidence from internal dashboards, not customer reports
   - "speculative" — no evidence of current customer impact
   - "mixed" — combination of customer and internal evidence

5. sentiment: "high" | "medium" | "low" | "none"

6. blast_radius: One sentence. Who is affected and how broadly? Include any
   stated user counts, percentages, or frequency data from the card.

7. workaround_exists: true | false | "partial". Is there a user-accessible
   workaround described or implied?

8. linked_context_used: "yes" | "no"
   Return "yes" only if the proposed classification relied on linked-context
   summaries in a way that changed or supported the judgment. Return "no" if
   the description alone was sufficient.

9. notes: One sentence on the most important thing the numbers don't convey.

Return ONLY valid JSON:
{"classifications": [{"card_id": N, "intercom_conversations": N, "failure_volume_weekly": N|null, "has_revenue_signal": bool, "evidence_type": "...", "sentiment": "...", "blast_radius": "...", "workaround_exists": "...", "linked_context_used": "yes|no", "notes": "..."}, ...]}

Bug cards are typically fewer than 20 — use a single delegate unless the count exceeds 16.

4. Review and import classification results

⚠ APPROVAL GATE. Before importing, present all proposed classifications to the user in a summary table (card ID, intercom_conversations, evidence_type, has_revenue_signal, linked_context_used, changed_fields, notes). This is the same gate applied at Step 6 for scoring. The spot-check (Step 3 constraint) validates delegate accuracy; the approval gate gives the user a chance to review the full set and catch judgment errors the spot-check didn't cover. Merge all batch results into a single JSON file, then run:

python3 box/priority-signals.py review-classifications .tmp-classify/classify-all.json .tmp-classify/results.json

Present that review output to the user. Do not run import-classifications until the user approves. Proved 2026-04-22: imported 9 classifications without presenting full set; Monitor flagged asymmetry with scoring gate.

linked_context_used: yes if the proposed classification relied on any linked-context summary, else no
changed_fields: JSON array of classification fields that differ from the currently stored priority_signals row for that card

python3 box/priority-signals.py import-classifications .tmp-classify/results.json

5. Display the table

python3 box/priority-signals.py show --state "Near Term"
python3 box/priority-signals.py show --state "Backlog"

Near Term table is the primary deliverable. Backlog table shown for context. Sorted by board position, showing all priority signals in a scannable comparison format.

--bugs mode: Filter the output to bug rows only:

python3 box/priority-signals.py show --state "Near Term" 2>&1 | grep -E "bug|Rank|----"

6. Score unscored cards (default, skippable)

Run gaps to find cards missing judgment scores:

python3 box/framework-rank.py gaps --state Backlog --state "Near Term"

Always use --state to scope to the groomed pool. Without it, gaps reports across all states (Build, Test, Archived), inflating the count. Proved 2026-04-27: unfiltered gaps reported 125 needing scope; groomed pool had 1. Additionally, scope_score was missing from the JOIN query, causing every card to appear unscored.

If the gap count seems high relative to last session's scoring work — especially if 100% of cards appear unscored — investigate the tool output before delegating. Universal failure is more likely a broken query than universal truth.

If there are gaps, propose scores and apply them. If the user says to skip this step, go directly to Step 7 — mispriority will run on scored cards only.

Scoring rules:

Features need an Impact score (1-10). "How much would this move the needle if shipped?" Informed by revenue potential, retention effect, strategic value. This is product judgment, not mechanical.
Bugs need a Severity score (1-5):
- 5 = Financial harm / data loss / security
- 4 = Feature broken / workflow blocked
- 3 = Degraded experience / workaround exists
- 2 = Cosmetic / misleading display
- 1 = Edge case / negligible
All cards need a Scope score (1-5). This drives the "lowest scope code task" Near Term rule. Non-code chores (vendor config, manual processes) get null — they are ineligible for the lowest-scope slot.
- 1 = Trivial: single-file change, config update, copy/prompt tweak (<1hr)
- 2 = Small: few files, well-understood pattern (<1 day)
- 3 = Medium: multiple files/systems, some design decisions (1-3 days)
- 4 = Large: significant feature, multiple systems (1-2 weeks)
- 5 = XL: major initiative, new system/architecture (multi-week)

Process:

Read the classification data (D*C partial scores, evidence_type, notes) from the gaps output.
For each unscored card, propose an Impact or Severity score with a one-line rationale. Use the card description and classification notes — don't infer from titles alone.
For cards needing Scope scores: delegate to sonnet agents in batches of ~16 with card descriptions. The delegate prompt should include the 1-5 scale definitions and instruct returning null for non-code chores. Spot-check extremes (scope 1 and scope 5) against batch files before importing.
Present the full batch of proposed scores to the user for review. The user may adjust individual scores or approve the batch.

Apply approved scores:

python3 box/framework-rank.py score SC-NNN --impact N    # features
python3 box/framework-rank.py score SC-NNN --severity N   # bugs
python3 box/framework-rank.py score SC-NNN --scope N      # all code cards

If the batch is large (>10 cards), split into groups by product area for easier review. Don't score infra-track cards (chores, implementation_step) for Impact/Severity — they are capacity-allocated, not DIC-ranked. They DO get Scope scores (unless non-code).

--bugs mode: Skip Impact scoring entirely. Only propose Severity scores for unscored bugs:

python3 box/framework-rank.py gaps --state Backlog --state "Near Term" 2>&1 | grep -A 100 "Bugs needing Severity"

Shortcut Severity field sync (all modes):

After applying DIC severity scores, sync the Shortcut Severity custom field for any bugs that were just scored. The mapping is Shortcut Sev = 5 - DIC Sev (see reference/severity-framework.md for level definitions and field IDs).

For each bug that was just scored, include the proposed Shortcut Severity value in the same approval table (e.g., "DIC 4 -> Shortcut Sev 1 (Blocked)").
After approval, set the Shortcut Severity field via box/update-severity-field.py SC-NNN <sev-level>. Route through execute_approved (production mutation). The script reads existing custom_fields, merges severity, PUTs back the full array, and verifies via independent GET.
Verify ALL updated cards via the script's built-in verification (exit code 0 = verified, 1 = failed). If multiple cards, run one at a time.

This keeps the two representations in sync: one assessment, applied to both framework-rank.py (DIC Sev) and the Shortcut custom field (Sev 0-4). Proved 2026-04-28: default mode was not syncing to Shortcut; product area wipe recovery (Apr 27) destroyed severity fields that were never restored.

7. Near Term Refresh

Compute the baseline Near Term set and move cards between Backlog and Near Term as needed. This is the step that keeps the Near Term pool reflecting the active epics + top-DIC-ranked standalone cards.

7a. Reconcile manual changes:

Fetch current Near Term card IDs from Shortcut. Load last computed baseline from near_term_baseline table (most recent run_id batch).

First run: If no prior baseline exists (empty table), skip reconciliation. All current Near Term cards are treated as the initial computed set.
Cards in Near Term but NOT in last baseline and NOT already pinned -> candidate manual pin. Present to the user for confirmation.
Cards in last baseline but now in Backlog (not Build/Test/Released) and NOT already excluded -> candidate manual exclude. Present for confirmation.
Cards that moved out of the groomed pool entirely are ignored (not manual removals).
Auto-clear any override whose card is no longer in the groomed pool (archived, moved to Build/Test/Released). Set cleared_at=NOW(), cleared_by='auto'. Report auto-cleared overrides for awareness.

Write confirmed overrides to near_term_overrides.

7b. Compute new baseline:

python3 box/near-term-sort.py

The sort applies this precedence per card:

Card in active epic -> Near Term (overrides excludes)
Card has active exclude override -> Backlog
Card has active pin override -> Near Term
Card qualifies by baseline rules (top-N DIC) -> Near Term
Otherwise -> Backlog

Baseline rules for standalone cards (not in active epics, not blocked):

Top 5 DIC-ranked bugs
Top 1 DIC-ranked bug per Shortcut severity level (critical/high/medium/low)
Top 5 DIC-ranked chores
Top 5 DIC-ranked features
Oldest card in the groomed pool (by created_at)
Lowest scope code task (by scope_score; non-code chores excluded). Tiebreak: DIC score desc, intercom desc, card age asc (stable, no state dependency to prevent oscillation). Proved 2026-04-23.
Tie-break (for top-N rules): intercom count desc, revenue signal, evidence type strength, card age asc

7c. Show proposed moves and reorder:

Present:

Cards entering Near Term (with reason: epic, top-5 bug, severity rep, etc.)
Cards leaving Near Term (with reason: no longer qualifies, not pinned)
Current active overrides for review/cleanup
#1 DIC bug board position: if the sort outputs a REORDER section, the #1 DIC bug is not at the top of the Near Term board. Include the reorder in the set of proposed changes.

Wait for user approval before executing moves.

7d. Execute moves and reorder via execute_approved:

State changes (Backlog ↔ Near Term): use the safe wrapper script:

python3 box/shortcut-mutate.py move SC-NNN "Near Term"
python3 box/shortcut-mutate.py move SC-NNN "Backlog"

The script handles the API call and verifies the result via independent GET. Route each call through execute_approved. Do NOT write ad-hoc API scripts for Shortcut mutations — wrapper scripts encode safety lessons (replace-all semantics, verification) that raw API calls bypass.

Board reorder (position within a state): use the reorder wrapper:

python3 box/reorder-nearterm.py --first 1643              # dry-run
python3 box/reorder-nearterm.py --first 1643 --execute     # real
python3 box/reorder-nearterm.py --before ANCHOR CARD_ID    # place before
python3 box/reorder-nearterm.py --after ANCHOR CARD_ID     # place after

Default is dry-run; pass --execute to mutate. Route --execute calls through execute_approved. Note: Shortcut's position field does not work for reordering — the script uses before_id/after_id. Proved 2026-04-23: position value set via API returned None and card did not move.

7e. Write new baseline:

Insert computed baseline to near_term_baseline with new run_id. Old baselines remain for audit; only the most recent is used for reconciliation.

⚠ POST-BASELINE CHECK. After --write-baseline, read the full sort output — specifically the PROMOTE, DEMOTE, and REORDER sections. If PROMOTE or DEMOTE is non-empty, the baseline is immediately unstable (the next run would propose changes). Surface the instability to the user before moving on. The tiebreak uses classification signals (intercom count, revenue signal, evidence type) which are stable across card moves — board-position-based oscillation was eliminated 2026-04-23.

7f. Report final Near Term composition.

8. Framework ranking comparison

python3 box/framework-rank.py mispriority -n 20

Compares framework rank (by DIC score) against live board position in Near Term. Positive gap = board has the card lower than the framework thinks it should be (underprioritized). Only includes scored cards — unscored cards from skipped Step 6 will not appear.

--bugs mode: Use the filtered rank output instead:

python3 box/framework-rank.py rank --state "Near Term" 2>&1 | grep "BUG"

9. Present mispriority flags concretely

After Step 8, propose specific board reorders for any card with a gap of +5 or more (framework rank significantly higher than board position). Name the card, the current position, the proposed position, and the anchor card. Don't ask generic adequacy questions — the mispriority table and proposed moves are the deliverable. Clean up temp files after the reorders land.

Adapting to other states

The default targets the groomed pool (Backlog + Near Term). The classification and scoring workflow works for any single state by replacing the --state flags. The Near Term refresh step (7) only runs in default mode.

name	priority-review
description	Refresh priority signals, manage active epics, and refresh Near Term pool
disable-model-invocation	true

Priority Review

Modes

Default (no flag): Full refresh — all card types, Impact + Severity scoring.
--bugs: Bug cards only. Filters to story_type: bug at classification, display, scoring, and ranking. Uses a bug-tuned classification prompt that adds blast_radius and workaround_exists fields. Only Severity scoring (no Impact). Typically fits in a single delegate batch.

Constraints

Classification model: Use claude-sonnet-4-6 for agent classification. Bake-off showed haiku gets counts right but misclassifies evidence_type; sonnet and opus are equivalent on judgment, sonnet is cheaper.
Subagents are classifiers, not summarizers. They return structured JSON. Review evidence_type assignments if any look wrong — sonnet is good but not perfect.
Spot-check at least 3 classifications before importing. Verify intercom_conversations counts against the actual classification inputs: card descriptions plus any linked-context summaries included in the batch file. If any error is found, double the sample size before merging. Repeat until a full sample passes clean.
Spot-check verification must be grounded in tool calls. Grep or Read the source file to confirm specific values — do not verify from memory of a prior read, even if the file is still in context. Proved 2026-04-23: checkmark verification from 6-turn-old memory, monitor flagged twice.
Multi-batch spot-checks need one tool call per batch. When N delegates return, run N Bash reads (one per batch file). Do not construct verification table entries from delegate rationale — that is the delegate verifying itself. If a batch wasn't read, report it as unverified. Proved 2026-04-24: 7 scope batches, 5 verified by tool calls, 2 table rows fabricated from delegate output; monitor caught at high severity.
Incremental by default. Only re-classify cards whose updated_at has changed since last classification. Use --force flags when a full refresh is needed.
Mutations. This skill reads Shortcut cards and writes to the local PostgreSQL priority_signals table. Step 6 sets the Shortcut Severity custom field on any bug that receives a severity score (routed through execute_approved). The Near Term refresh step (Step 7) moves cards between Backlog and Near Term (routed through execute_approved).
Batch files go in the project directory (.tmp-classify/), not /tmp. Subagents cannot access /tmp.
Data scope: The skill refreshes BOTH Backlog and Near Term states, since the Near Term sort needs DIC scores for the full groomed pool.

Steps

0. Active Epics Review

Runs first, before any classification or scoring, since epic membership is the primary Near Term criterion and changes affect everything downstream.

Display current active epics:
```
python3 box/active-epics.py list
```

Fetch all epics from Shortcut API, show any not in the active set that have cards in progress or were created in the last 30 days:

SHORTCUT_API_TOKEN=$(grep '^SHORTCUT_API_TOKEN=' /Users/paulyokota/Dev/FeedForward/.env | cut -d= -f2-)
curl -s -H "Shortcut-Token: $SHORTCUT_API_TOKEN" \
  "https://api.app.shortcut.com/api/v3/epics" | python3 -c "
import json, sys
from datetime import datetime, timedelta, timezone
cutoff = datetime.now(timezone.utc) - timedelta(days=30)
for e in json.load(sys.stdin):
    if e.get('archived'): continue
    created = datetime.fromisoformat(e['created_at'].replace('Z', '+00:00'))
    if e.get('state') == 'in progress' or created > cutoff:
        print(f'ID: {e[\"id\"]} | {e[\"name\"]} | State: {e[\"state\"]} | Created: {e[\"created_at\"][:10]}')
"

Ask: "Any epics to add or remove?" — wait for explicit confirmation before proceeding.

Apply changes:

python3 box/active-epics.py add <ID> "Epic Name"
python3 box/active-epics.py remove <ID>

0a. Preflight: reconcile live board with DB

Before refreshing anything, check for cards in the DB that have moved out of the groomed states since the last run. Run for BOTH states:

Fetch live card IDs from Shortcut for each groomed state:

python3 box/shortcut-cards.py --state "Backlog" --summary
python3 box/shortcut-cards.py --state "Near Term" --summary

Query the DB for card IDs currently stored with groomed states:

psql postgresql://localhost:5432/feedforward -t -A \
  -c "SELECT card_id, state FROM priority_signals WHERE state IN ('Backlog', 'Near Term')"

Diff the two sets. Cards in the DB but not on the live board have moved. For each moved card, re-extract to update its state in the DB:
```
python3 box/priority-signals.py extract --id <comma-separated-moved-ids>
```
Report what moved and where (the extract updates the state field).
Cards on the live board but not in the DB are new — they'll be picked up by the extract in Step 1.

This prevents stale cards from appearing in the signals table or framework ranking. Skip this step only if running with --force on a fresh DB.

1. Refresh metadata from Shortcut API

Run for BOTH groomed states (extract upserts by card_id, no conflicts):

python3 box/priority-signals.py extract --state "Backlog"
python3 box/priority-signals.py extract --state "Near Term"

This updates board position, state, product area, feature area users, and dates. Incremental: skips cards whose updated_at hasn't changed.

2. Check what needs classification

python3 box/priority-signals.py classify --state "Backlog" > .tmp-classify/classify-backlog.json
python3 box/priority-signals.py classify --state "Near Term" > .tmp-classify/classify-nearterm.json

Merge the two outputs into a single classify file. Outputs classification inputs as JSON for cards that either:

Have never been classified
Have been updated in Shortcut since last classification

Each card includes:

description
raw external_links for provenance
bounded linked_context summaries for the first 2 resolvable Slack permalinks in Shortcut-provided order

Cards with only non-Slack external links remain provenance-only in this tranche: no linked-context fetch is attempted for them.

If the count is 0, skip to Step 5.

--bugs mode: Filter the classify output to bug cards only before delegating. Linked-context enrichment must happen after this filter so non-bug cards do not trigger Slack reads in bugs mode:

bugs = [c for c in data['cards'] if c.get('story_type') == 'bug']

3. Delegate classification to sonnet agents

Split the classify output into batches of ~16 cards. Delegate each batch to agenterminal.delegate with model: "claude-sonnet-4-6". Don't set timeout_ms — the default is the maximum (30 min).

The classification prompt for each batch:

You are classifying Shortcut cards for priority signal extraction.
Read each card's description plus any linked-context summaries and return
structured classification data.

The cards are in {BATCH_FILE}. Read this file, then for each card extract:

- Treat `description` as the primary source.
- Use `linked_context` only when it adds evidence relevant to the existing
  classification fields.
- Do not mechanically count external links as demand.
- `external_links` without `linked_context` are provenance only in this
  tranche.

1. intercom_conversations: Total count of Intercom conversations referenced.
   Count BOTH linked conversations and stated totals (e.g. "31 conversations")
   when they appear in the description or linked context. Use the stated total
   when available. Return 0 if no Intercom evidence.

2. failure_volume_weekly: For bug cards, the headline weekly failure volume.
   If sub-categories are broken down, report the headline total, NOT the sum
   of sub-breakdowns. Return null for non-bug cards.

3. has_revenue_signal: true if evidence mentions cancellations, refunds, users
   leaving, declining to subscribe. Polite requests without leaving signals = false.

4. evidence_type: One of:
   - "direct_customer_pain" — Intercom conversations show real users affected
   - "internal_metric" — evidence from internal dashboards, not customer reports
   - "speculative" — no evidence of current customer impact
   - "mixed" — combination of customer and internal evidence
   - "implementation_step" — sub-task of a larger initiative

5. sentiment: "high" | "medium" | "low" | "none"

6. linked_context_used: "yes" | "no"
   Return "yes" only if the proposed classification relied on linked-context
   summaries in a way that changed or supported the judgment. Return "no" if
   the description alone was sufficient.

7. notes: One sentence on the most important thing the numbers don't convey.

Return ONLY valid JSON:
{"classifications": [{"card_id": N, ...}, ...]}

Collect all batches in parallel.

--bugs mode classification prompt (replaces the above):

You are classifying Shortcut bug cards for priority signal extraction.
Read each card's description plus any linked-context summaries from
{BATCH_FILE}, then for each card extract:

- Treat `description` as the primary source.
- Use `linked_context` only when it adds evidence relevant to the existing
  classification fields.
- Do not mechanically count external links as demand.
- `external_links` without `linked_context` are provenance only in this
  tranche.

1. intercom_conversations: Total count of Intercom conversations referenced.
   Count BOTH linked conversations and stated totals (e.g. "31 conversations")
   when they appear in the description or linked context. Use the stated total
   when available. Return 0 if no Intercom evidence.

2. failure_volume_weekly: The headline weekly failure volume from the card.
   If sub-categories are broken down, report the headline total, NOT the sum
   of sub-breakdowns. Return null if no failure volume is stated.

3. has_revenue_signal: true if evidence mentions cancellations, refunds, users
   leaving, declining to subscribe. Polite requests without leaving signals = false.

4. evidence_type: One of:
   - "direct_customer_pain" — Intercom conversations show real users affected
   - "internal_metric" — evidence from internal dashboards, not customer reports
   - "speculative" — no evidence of current customer impact
   - "mixed" — combination of customer and internal evidence

5. sentiment: "high" | "medium" | "low" | "none"

6. blast_radius: One sentence. Who is affected and how broadly? Include any
   stated user counts, percentages, or frequency data from the card.

7. workaround_exists: true | false | "partial". Is there a user-accessible
   workaround described or implied?

8. linked_context_used: "yes" | "no"
   Return "yes" only if the proposed classification relied on linked-context
   summaries in a way that changed or supported the judgment. Return "no" if
   the description alone was sufficient.

9. notes: One sentence on the most important thing the numbers don't convey.

Return ONLY valid JSON:
{"classifications": [{"card_id": N, "intercom_conversations": N, "failure_volume_weekly": N|null, "has_revenue_signal": bool, "evidence_type": "...", "sentiment": "...", "blast_radius": "...", "workaround_exists": "...", "linked_context_used": "yes|no", "notes": "..."}, ...]}

Bug cards are typically fewer than 20 — use a single delegate unless the count exceeds 16.

4. Review and import classification results

python3 box/priority-signals.py review-classifications .tmp-classify/classify-all.json .tmp-classify/results.json

linked_context_used: yes if the proposed classification relied on any linked-context summary, else no
changed_fields: JSON array of classification fields that differ from the currently stored priority_signals row for that card

python3 box/priority-signals.py import-classifications .tmp-classify/results.json

5. Display the table

python3 box/priority-signals.py show --state "Near Term"
python3 box/priority-signals.py show --state "Backlog"

Near Term table is the primary deliverable. Backlog table shown for context. Sorted by board position, showing all priority signals in a scannable comparison format.

--bugs mode: Filter the output to bug rows only:

python3 box/priority-signals.py show --state "Near Term" 2>&1 | grep -E "bug|Rank|----"

6. Score unscored cards (default, skippable)

Run gaps to find cards missing judgment scores:

python3 box/framework-rank.py gaps --state Backlog --state "Near Term"

If there are gaps, propose scores and apply them. If the user says to skip this step, go directly to Step 7 — mispriority will run on scored cards only.

Scoring rules:

Features need an Impact score (1-10). "How much would this move the needle if shipped?" Informed by revenue potential, retention effect, strategic value. This is product judgment, not mechanical.
Bugs need a Severity score (1-5):
- 5 = Financial harm / data loss / security
- 4 = Feature broken / workflow blocked
- 3 = Degraded experience / workaround exists
- 2 = Cosmetic / misleading display
- 1 = Edge case / negligible
All cards need a Scope score (1-5). This drives the "lowest scope code task" Near Term rule. Non-code chores (vendor config, manual processes) get null — they are ineligible for the lowest-scope slot.
- 1 = Trivial: single-file change, config update, copy/prompt tweak (<1hr)
- 2 = Small: few files, well-understood pattern (<1 day)
- 3 = Medium: multiple files/systems, some design decisions (1-3 days)
- 4 = Large: significant feature, multiple systems (1-2 weeks)
- 5 = XL: major initiative, new system/architecture (multi-week)

Process:

Read the classification data (D*C partial scores, evidence_type, notes) from the gaps output.
For each unscored card, propose an Impact or Severity score with a one-line rationale. Use the card description and classification notes — don't infer from titles alone.
For cards needing Scope scores: delegate to sonnet agents in batches of ~16 with card descriptions. The delegate prompt should include the 1-5 scale definitions and instruct returning null for non-code chores. Spot-check extremes (scope 1 and scope 5) against batch files before importing.
Present the full batch of proposed scores to the user for review. The user may adjust individual scores or approve the batch.

Apply approved scores:

python3 box/framework-rank.py score SC-NNN --impact N    # features
python3 box/framework-rank.py score SC-NNN --severity N   # bugs
python3 box/framework-rank.py score SC-NNN --scope N      # all code cards

--bugs mode: Skip Impact scoring entirely. Only propose Severity scores for unscored bugs:

python3 box/framework-rank.py gaps --state Backlog --state "Near Term" 2>&1 | grep -A 100 "Bugs needing Severity"

Shortcut Severity field sync (all modes):

For each bug that was just scored, include the proposed Shortcut Severity value in the same approval table (e.g., "DIC 4 -> Shortcut Sev 1 (Blocked)").
After approval, set the Shortcut Severity field via box/update-severity-field.py SC-NNN <sev-level>. Route through execute_approved (production mutation). The script reads existing custom_fields, merges severity, PUTs back the full array, and verifies via independent GET.
Verify ALL updated cards via the script's built-in verification (exit code 0 = verified, 1 = failed). If multiple cards, run one at a time.

7. Near Term Refresh

7a. Reconcile manual changes:

Fetch current Near Term card IDs from Shortcut. Load last computed baseline from near_term_baseline table (most recent run_id batch).

First run: If no prior baseline exists (empty table), skip reconciliation. All current Near Term cards are treated as the initial computed set.
Cards in Near Term but NOT in last baseline and NOT already pinned -> candidate manual pin. Present to the user for confirmation.
Cards in last baseline but now in Backlog (not Build/Test/Released) and NOT already excluded -> candidate manual exclude. Present for confirmation.
Cards that moved out of the groomed pool entirely are ignored (not manual removals).
Auto-clear any override whose card is no longer in the groomed pool (archived, moved to Build/Test/Released). Set cleared_at=NOW(), cleared_by='auto'. Report auto-cleared overrides for awareness.

Write confirmed overrides to near_term_overrides.

7b. Compute new baseline:

python3 box/near-term-sort.py

The sort applies this precedence per card:

Card in active epic -> Near Term (overrides excludes)
Card has active exclude override -> Backlog
Card has active pin override -> Near Term
Card qualifies by baseline rules (top-N DIC) -> Near Term
Otherwise -> Backlog

Baseline rules for standalone cards (not in active epics, not blocked):

Top 5 DIC-ranked bugs
Top 1 DIC-ranked bug per Shortcut severity level (critical/high/medium/low)
Top 5 DIC-ranked chores
Top 5 DIC-ranked features
Oldest card in the groomed pool (by created_at)
Lowest scope code task (by scope_score; non-code chores excluded). Tiebreak: DIC score desc, intercom desc, card age asc (stable, no state dependency to prevent oscillation). Proved 2026-04-23.
Tie-break (for top-N rules): intercom count desc, revenue signal, evidence type strength, card age asc

7c. Show proposed moves and reorder:

Present:

Cards entering Near Term (with reason: epic, top-5 bug, severity rep, etc.)
Cards leaving Near Term (with reason: no longer qualifies, not pinned)
Current active overrides for review/cleanup
#1 DIC bug board position: if the sort outputs a REORDER section, the #1 DIC bug is not at the top of the Near Term board. Include the reorder in the set of proposed changes.

Wait for user approval before executing moves.

7d. Execute moves and reorder via execute_approved:

State changes (Backlog ↔ Near Term): use the safe wrapper script:

python3 box/shortcut-mutate.py move SC-NNN "Near Term"
python3 box/shortcut-mutate.py move SC-NNN "Backlog"

Board reorder (position within a state): use the reorder wrapper:

python3 box/reorder-nearterm.py --first 1643              # dry-run
python3 box/reorder-nearterm.py --first 1643 --execute     # real
python3 box/reorder-nearterm.py --before ANCHOR CARD_ID    # place before
python3 box/reorder-nearterm.py --after ANCHOR CARD_ID     # place after

7e. Write new baseline:

Insert computed baseline to near_term_baseline with new run_id. Old baselines remain for audit; only the most recent is used for reconciliation.

7f. Report final Near Term composition.

8. Framework ranking comparison

python3 box/framework-rank.py mispriority -n 20

--bugs mode: Use the filtered rank output instead:

python3 box/framework-rank.py rank --state "Near Term" 2>&1 | grep "BUG"