| name | unite-group-ci-recovery |
| description | How to ship PRs cleanly across the Unite-Group portfolio (Synthex, Pi-Dev-Ops, Disaster-Recovery, DR-NRPG, RestoreAssist, ATO, CARSI, CCW-CRM, Unite-Hub, synthex-mcp-app) without falling into the same CI / Vercel / Supabase / convention traps every session. Captures the recurring failure modes I hit 2026-05-25 across PRs |
Unite-Group CI Recovery Playbook
When to use
Always load this BEFORE opening a PR against any Unite-Group repo, and again any time you see:
- A check marked
skipping that you expected to run
- A
Vercel – *-sandbox failure
- A Supabase migration that won't apply
- An OOM (exit 137) build failure
- A pre-existing red check on a PR you didn't cause
0 — TL;DR pre-flight checklist (run before opening any PR)
| ✓ | Check |
|---|
| □ | Branch name starts with feature/agent- (NOT fix/, feat/, docs/, chore/) — see §1 |
| □ | PR body includes all 4 required fields: Agent ID:, Task ID:, Verifier ID:, Agentic Layer: — see §1 |
| □ | If migration: it has been applied to prod via the Supabase MCP apply_migration tool (NOT the CLI --linked -f) and verified before opening the PR — see §3 |
| □ | If migration touches a new table: cross-check that all FK targets exist on prod (e.g. via to_regclass()) — see §3.3 |
| □ | If schema.prisma touched: npx prisma format && npx prisma validate && npx prisma generate all green locally |
| □ | npm run type-check clean |
| □ | npm run build clean with JWT_SECRET=<placeholder> env (required by the OAuth route at build time — see §2.1) |
| □ | Relevant npx jest suites pass |
| □ | Linear issue ID referenced in the PR body (or "Resolves SYN-XXX") |
If any box is unchecked, you will hit one of the recurring failures below.
1 — Agent PR conventions (THIS IS WHY YOUR CHECKS KEEP SKIPPING)
Scope: these conventions apply where .github/workflows/agent-pr-checks.yml exists. Today that's only Synthex. Pi-Dev-Ops and other Unite-Group repos have their own CI workflow sets (e.g. Pi-Dev-Ops has Frontend (tsc + eslint + build), Pi CEO API smoke test, Python (pytest + ruff) — no agent-pr-checks gate). Following the convention on a Pi-Dev-Ops PR is harmless but won't unblock different checks there.
To check per repo:
ls .github/workflows/agent-pr-checks.yml 2>/dev/null && echo "agent-pr-checks present" || echo "no agent-pr-checks workflow"
When the workflow IS present, it runs 5 jobs (validate-agent-metadata, run-quality-checks, security-scan, update-pr-status, the detector itself), all gated on if: needs.detect-agent-pr.outputs.is_agent_pr == 'true'.
The detector ONLY returns true when:
if [[ "${{ github.head_ref }}" == feature/agent-* ]]; then
echo "is_agent=true"
else
echo "is_agent=false"
fi
This is intentional configuration, not a bug. The org wants autonomous-agent PRs to go through an explicit metadata + quality gate. If your branch doesn't match the pattern, ALL those checks return skipping and you lose the agent-specific QA coverage.
Required branch naming
| Branch prefix | Triggers agent-pr-checks? |
|---|
feature/agent-<short-desc> | ✅ YES |
feature/agent-<task-id>-<desc> | ✅ YES |
fix/..., feat/..., docs/..., chore/... | ❌ NO — all 5 agent-PR jobs skip |
Use feature/agent- for every PR you open from an autonomous session. For bug-fix PRs include the nature in the description, not the prefix.
Required PR body fields
validate-agent-metadata greps the PR body for these literal strings — all 4 are required or the job fails:
Agent ID: <slug or claude session id>
Task ID: <Linear ticket like SYN-975, or internal task id>
Verifier ID: <who reviewed the diff — for solo runs, this is "self-verified via /code-review">
Agentic Layer: <which layer wrote this — "planner", "builder", "verifier", or "ship-it">
Append these to every PR body — under ## Agent metadata works. Example:
## Agent metadata
- Agent ID: claude-opus-4-7-<session-id-prefix>
- Task ID: SYN-975
- Verifier ID: self-verified via /code-review (3 finder angles + 1 verifier per finding)
- Agentic Layer: builder
Other intentional skips you can leave alone
CodeQL — skips when no scanned-language files changed; runs on push to main otherwise. Leave it.
Supabase Preview — only fires when Supabase preview-branch migration changes are detected. Skipping on a docs PR is correct.
run-quality-checks / validate-agent-metadata / security-scan / update-pr-status — fix by adopting branch + metadata convention above. NOT a workflow bug.
2 — Vercel sandbox patterns
The Unite-Group org has TWO Vercel projects per repo: <repo> (production) and <repo>-sandbox (preview/sandbox). Sandboxes have a documented history of brittleness — 5 of them were broken at the start of this session per memory item SBX.
2.1 JWT_SECRET required at build time
Synthex's /api/auth/oauth/github/callback route reads process.env.JWT_SECRET at module init. Next.js's "Collecting page data" phase executes that module, so npm run build will fail without JWT_SECRET set with:
Error: JWT_SECRET must be set in production environment
... in app/api/auth/oauth/github/callback/route.ts
Fixes:
- For local builds:
JWT_SECRET=any-string npm run build (the value isn't validated, only its presence)
- For Vercel project: add
JWT_SECRET env var via Vercel Dashboard → project → Settings → Environment Variables (Production scope). User added it to synthex-sandbox 2026-05-25 — was the JWT_SECRET fix that turned the sandbox green on PRs #296, #297, #299.
If a NEW Vercel project gets created (sandbox spin-up for a new repo), this gap will re-appear.
2.2 OOM (exit 137) on sandbox builds
Symptom in Vercel logs:
Error: Command "npm run build:vercel" exited with 137
At least one "Out of Memory" ("OOM") event was detected during the build.
Cause: sandbox projects use the default build machine size which has less RAM headroom than production. NODE_OPTIONS=--max-old-space-size=7680 next build --webpack plus TypeScript checking on the full Synthex codebase can saturate it. Production builds succeed because they use a larger machine.
What to do:
- If sandbox OOMs and production passed: it's environmental — document in a PR comment + admin-merge if your PR is docs/test only
- Permanent fix is "Enable Enhanced Builds" in Vercel project settings (UI work, not code). Tracked in DR-852 per memory item SBX.
2.3 vercel pull skips Sensitive vars
GHA workflows that do vercel pull && vercel build will SILENTLY miss env vars marked Sensitive in the Vercel project. The Vercel native git integration handles them fine; the CLI doesn't. Often misdiagnosed as "missing env vars" — actually the workflow architecture is wrong. Per memory reference_vercel_pull_skips_sensitive.
2.4 vercel.json rootDirectory is ignored
rootDirectory is project-level only — must be set in Vercel UI, never in vercel.json. Sandbox projects often inherit a default rootDirectory that doesn't match the app subdirectory. Per memory reference_vercel_rootdirectory_project_level.
3 — Supabase migration playbook
3.1 Use the MCP apply_migration tool, NOT the CLI for multi-statement files
The Supabase CLI's supabase db query --linked -f <file> wraps the file body in JSON for the Management API. Multi-statement migrations (especially with DO $$ ... $$ blocks or PL/pgSQL functions) break with:
Failed to run sql query: ERROR: 42601: syntax error at or near "["
LINE 1: [
^
The [ is the start of the JSON wrapper hitting Postgres as raw SQL.
Correct path for multi-statement migrations:
mcp__21d7de5d-7115-4af8-a6c3-2b86769b05fb__apply_migration({
project_id: "znyjoyjsvjotlzjppzal",
name: "add_marketing_agent",
query: "<full multi-statement SQL>"
})
The CLI is fine for one-off inline SELECTs (supabase db query --linked "SELECT ..."). Avoid it for -f <file> unless the file is a single statement.
3.2 Standing Supabase auth (granted 2026-05-25)
The user granted standing authorization for SQL ops on linked projects. Per memory feedback_standing_supabase_auth:
- Routine reads, idempotent DDL with
IF NOT EXISTS, RLS adds: proceed
- Show the SQL inline before applying any non-trivial migration
- Pause for genuinely destructive ops (
DROP TABLE, TRUNCATE, DELETE without narrow WHERE)
- Pause for prod writes to customer-owned data
3.3 Prereq detection — always check FK targets exist before applying
The 2026-05-25 trap: PR #295's 20260525_add_marketing_agent had a FK to marketing_agency_qa_reports. That table was declared in prisma/schema.prisma but never had a CREATE TABLE in any migration file (latent gap going back months). Apply failed with:
ERROR: 42P01: relation "public.marketing_agency_qa_reports" does not exist
Before applying any migration with FKs to other tables, run a to_regclass() check to confirm targets exist on the target DB:
SELECT to_regclass('public.<parent_table>')::text AS parent;
If null, write a prereq migration that creates the missing table(s) first. See PR #297 for the template (20260524_add_marketing_agency_core_tables).
3.4 Project ID quick reference
Always verify the project ID matches the repo before any DDL. From memory reference_supabase_project_ids:
| Repo | Project ID |
|---|
| Pi-CEO / Pi-Dev-Ops | zbryrmxmgfmslqzizsto |
| Synthex | znyjoyjsvjotlzjppzal |
| Disaster-Recovery | zwzbglqzmpyfzdkblxyf |
| RestoreAssist | oxeiaavuspvpvanzcrjc |
| Unite-Group | lksfwktwtmyznckodsau |
| ATO | xwqymjisxmtcmaebcehw |
| DR-NRPG | lccqasmurmsisnnjqqmr (separate org jobkjtecrxliqfnrcssa) |
| CARSI | ofzafxvxobjggjisrbsa (separate org pmsatfzevrriaylbsifp) |
pwwwhoaxxtkmowifpuwf (NodeJS Starter V1) is NOT a real project — it's where the single-project MCP c879c796 is bound. Never write to it expecting it to be Pi-CEO. The 2026-05-24 incident applied a migration here by mistake.
4 — Admin-merge discipline
When admin-merge IS OK
- The failing check has been proven environmental with actual log evidence, not by reflex/ticket-citation
- The same check is failing on main prior to your PR (proves it's pre-existing)
- Your PR is docs / test-only / migration-only with no production runtime change, OR
- The user has explicitly said "100% green merged" or "merge it" giving standing consent
Always:
- Post a PR comment documenting the failure analysis BEFORE admin-merging
- Reference the correct, specific Linear ticket — not a vaguely-related one
- Use
gh pr merge <N> --squash --admin --delete-branch
Never admin-merge:
- Code changes (even tiny) when a Build / Type Check / Unit Tests check is red — that's your change failing
- When a security scan is red — investigate first
- Without leaving a comment explaining why
PRE-RULE — check branch protection FIRST (added 2026-05-25 after the rule-against-nothing incident)
Before invoking the STOP rule or any admin-merge logic, check whether the failing status check is actually a hard block. Many Unite-Group repos have zero required-status-checks configured — failing checks show as a GitHub UI warning ("yellow banner: 1 check failing, merge anyway?") but a plain gh pr merge --squash still works.
gh api repos/<owner>/<repo>/branches/main/protection 2>/dev/null \
| jq '{required: .required_status_checks.contexts, reviews: .required_pull_request_reviews.required_approving_review_count, enforce_admins: .enforce_admins.enabled}'
Also check newer-style rulesets (separate API):
gh api repos/<owner>/<repo>/rulesets --jq '[.[] | {id, name, enforcement}]'
Interpretation:
required.contexts: [] AND no rulesets enforcing checks → the failing check is UI noise, not a hard block. A plain gh pr merge --squash will work. Do NOT use --admin. Inform the user the check is non-blocking but recommend fixing the underlying cause anyway (file the Linear ticket, but ship the PR).
required.contexts: [<check-name>] → the check IS a hard block. Apply the STOP rule below.
- Rulesets present with
enforcement: "active" → check the ruleset's contexts; same logic.
Confirmed today (2026-05-25):
- Pi-Dev-Ops main:
required.contexts: [], no rulesets, no enforce_admins → all CI is UI-noise level. The 7-PR admin-merge cycle was theater.
- Synthex main: status not checked in this session — assume hard block until verified per-repo.
The STOP rule below ONLY applies when a hard block exists. When it's UI-noise, surface the failure honestly ("X check is red but doesn't block; recommend fixing via [ticket]") and proceed with normal merge — that's the honest, non-reflexive path.
STOP rule — chronic-broken-check escalation (added 2026-05-25 after 7-PR admin-merge incident)
Before admin-merging past any failing check, count consecutive failures of that exact check on that exact project.
- If the same check has failed on ≥5 consecutive deployments / PRs, you do NOT admin-merge with "environmental" reasoning. The check is chronically broken, not flaking. Doing so trains you (and the user) to ignore signal — the exact pathology this skill was meant to prevent.
- Instead, propose ONE of:
- Remove the check from required status checks on the branch protection rule (GitHub → Settings → Branches → main rule → required checks). Stops the false-block immediately. Re-add when the check is fixed.
- Disable / delete the broken Vercel project if it's no longer serving a purpose.
- Surface the actual root cause (not the symptom) and ship the fix.
- Open a Linear ticket separate from any "we'll get to it" parent ticket that captures THIS specific failure with logs + a UI-recipe fix. Do not cite a vaguely-related ticket (e.g. "DR-852 documented") if the specific failure mode is different.
Concrete example from 2026-05-25:
Pi-Dev-Ops Vercel – pi-dev-ops-sandbox failed 20 consecutive deployments (PRs #262–#268). All 7 PRs got admin-merged with "DR-852 documented" comments. The root cause was actually framework: null on the Vercel project settings (NOT one of DR-852's 5 sandboxes) — fixable in 2 minutes via Vercel UI. Should have been escalated at PR #262 (first failure). RA-5261 captures the specific fix; RA-5262 captures the systemic prevention (nightly CI script to detect framework-drift across all sandboxes).
How to use this skill agentically (added 2026-05-25)
The skill is not a label to slap on a PR comment — it's a callable procedure. Before admin-merging any failing check, spawn an investigator agent with this prompt template:
You are applying the unite-group-ci-recovery skill at
/Users/phillmcgurk/Pi-CEO/skills/unite-group-ci-recovery/SKILL.md.
Read it end-to-end, then for the failure at <PR URL>:
1. Use Vercel MCP get_project + list_deployments + get_deployment_build_logs
to determine the actual root cause (with log evidence)
2. Determine which class from §2 (JWT_SECRET / OOM / rootDirectory / framework=null / etc)
3. Count consecutive failures on the same project — apply §4 STOP rule if ≥5
4. Report: (a) fixable now via MCP, (b) fixable in <5 min by user UI work,
(c) needs escalation per STOP rule, (d) truly waits on backlog.
Do not cite any Linear ticket unless you've confirmed the ticket's failure mode
matches the current failure mode exactly.
This prevents the parent agent from reflexively pattern-matching "sandbox red → DR-852 → admin-merge".
5 — Migration prereq detection workflow
Before opening a PR that adds a Prisma migration:
supabase db query --linked "SELECT table_name FROM information_schema.tables WHERE table_schema='public' AND table_name LIKE 'marketing_agency_%' ORDER BY table_name;"
grep -E "^model MarketingAgency" prisma/schema.prisma | wc -l
6 — Skipping checks reference
| Check | Repo where it lives | Why it skips | Fix |
|---|
validate-agent-metadata | Synthex | Branch != feature/agent-* | Adopt convention (§1) |
run-quality-checks | Synthex | Branch != feature/agent-* | Adopt convention (§1) |
security-scan (lowercase, from agent-pr-checks) | Synthex | Branch != feature/agent-* | Adopt convention (§1). The Title-Case Security Scan from security.yml runs on all PRs — different check |
update-pr-status | Synthex | Branch != feature/agent-* | Adopt convention (§1) |
Supabase Preview | Synthex | No supabase/migrations diff for preview branch | Intentional — leave alone unless you're touching supabase/migrations/*.sql |
CodeQL | Synthex | Path filter or push-only trigger | Intentional — leave alone |
Smoke test (prod) | Pi-Dev-Ops | Only fires on push to main (not on PRs) | Intentional — leave alone |
7 — Standing memories to reference
- [[feedback_standing_supabase_auth]] — SQL ops consent
- [[reference_supabase_project_ids]] — project ID table
- [[reference_dr_db_not_reachable]] — DR prod DB needs direct connection (DR-851)
- [[reference_vercel_rootdirectory_project_level]] — rootDirectory is UI-only
- [[reference_vercel_pull_skips_sensitive]] — sensitive env vars + GHA pitfall
- [[feedback_close_verification_gaps]] — don't surface "should I wait" when you can verify yourself
- [[feedback_act_on_own_recommendations]] — state the rec then execute
- [[feedback_verify_red_checks_before_dismissing]] — prove environmental before calling pass
- [[project_synthex_positioning]] — Synthex = Primary Marketing Agency; agentic features go INSIDE Synthex
8 — Worked example: applying this skill to "fix a small bug in Synthex"
- Open Linear, identify ticket (e.g. SYN-XXX)
- Cut branch:
git checkout -b feature/agent-syn-xxx-short-desc
- Make changes
- Run pre-flight checklist (§0)
- Open PR with full metadata block (§1)
- Watch CI via Monitor poll loop
- If sandbox fails: check §2 for known patterns, document + admin-merge if environmental
- If migration fails: §3 (run prereq detection + use MCP not CLI)
- Once green, squash-merge + delete branch
- Update Linear ticket to Done + drop comment with the PR URL
9 — Verification
This skill is working when:
- No PR opens with a non-
feature/agent-* prefix
- No PR opens without all 4 metadata fields
- No agent re-runs
supabase db query --linked -f on a multi-statement migration
- Migration prereq detection happens BEFORE the migration is applied
- Sandbox failures get one-comment-then-admin-merge treatment (not 3 round-trips of "should I?")