Read this skill BEFORE you start any of the following — it will save 1-3 hours of debugging on average:
If you are merely calling a pre-existing Zitadel deployment from your code, you probably only need references/token-validation.md and references/api-cheatsheet.md.
The SKILL.md body intentionally stays short. Drill into the relevant reference based on the immediate task:
The references are designed to be readable in isolation — open the one you need without slogging through the rest.
These are the issues that consistently bite first-time Zitadel integrators. Each has a dedicated section in the references; this list is the trigger map so you know which file to open.
-
ZITADEL_FIRSTINSTANCE_* env vars must live on the zitadel service, not on zitadel-init — the init container only runs schema migrations; setup (which honors FirstInstance) runs from start-from-init on the main service. → docker-compose-bootstrap.md §1.
-
The /current-dir volume must be writable by uid 1000 — Zitadel runs as a non-root user. A previous root-owned init leaves the volume unwritable and setup silently fails with permission denied in a restart loop. → docker-compose-bootstrap.md §2.
-
ZITADEL_EXTERNALDOMAIN is enforced on every request via the Host header — calling http://127.0.0.1:8080 (the bind port) when external domain is 127.0.0.1.sslip.io returns Instance not found. Node fetch does not let you override the Host header at runtime — use the URL with the external domain literally. → docker-compose-bootstrap.md §3.
-
/admin/v1/orgs/_setup requires a human admin user; use /zitadel.org.v2.OrganizationService/AddOrganization to create an org without one — v2 endpoints accept JSON over Connect protocol. → api-cheatsheet.md §"Create org".
-
AddHumanUserRequest requires userName and profile.firstName/lastName (not givenName/familyName) — payload shape changed across versions; v4 also rejects clockSkew > 5s on OIDC apps. → api-cheatsheet.md §"Create human user" + §"Create OIDC app".
-
Users created without initialPassword enter state initial and cannot be _deactivated — Zitadel rejects with COMMAND-ke0fw. In production the user exits this state by completing invite; in tests/seeds set an initialPassword. → troubleshooting.md.
-
Domain tenantId (stable string like JRC) ≠ Zitadel orgId (numeric like 370503937624637443) — every Management API call needs the numeric orgId in x-zitadel-orgid. Build a translation layer in your adapter. → tenant-org-mapping.md.
-
User grants search is at /management/v1/users/grants/_search (global, filter by userIdQuery), not /management/v1/users/{id}/grants/_search — the latter returns 405 Method Not Allowed. assignRole / revokeRole should be implemented as search-then-PUT/DELETE for idempotency, not POST a fresh grant each time. → api-cheatsheet.md §"User grants".
-
loginV2.required=true is the default instance feature flag in Zitadel ≥ v3, but Login UI v2 is a separate Next.js app you must deploy yourself — without it, every /oauth/v2/authorize redirects to /ui/v2/login/login and returns {"code":5,"message":"Not Found"}. App-level loginVersion: {loginV1: {}} alone does NOT override the instance flag — you must PUT /v2/features/instance {loginV2: {required: false}} or deploy v2. → troubleshooting.md.
-
Silent-renew redirect URI must be in the OIDC app's redirectUris — most bootstrap scripts only register /auth/callback. Without /silent-renew, every prompt=none request returns 400 and the SPA loops forever in "verifying session…". Add it at bootstrap time, not after the first failure. → troubleshooting.md + api-cheatsheet.md §"Create OIDC application".
-
JWT access tokens DO NOT carry profile claims (name, preferred_username, email, urn:zitadel:iam:org:id) — those live in the id_token and /oidc/v1/userinfo only. Backend mappers that hard-require them silently 401 every request with "Token inválido" even when iss/aud/exp are perfect. Always fall back to sub for operatorName and defaultTenant for tenantId. → token-validation.md §"Access token vs ID token".
-
Node backends fetching JWKS over HTTPS with a self-signed cert (mkcert/dev CA) need NODE_EXTRA_CA_CERTS — createRemoteJWKSet(new URL(jwksUrl)) does a normal Node fetch, which uses the OS trust store. With a local CA, the TLS handshake fails before signature check; jose surfaces it as a generic verification error. Symptom: 100% of /api requests return 401, JWT decoded by hand looks perfect, SPA falls into a silent-renew loop and rate-limit (429) follows. → token-validation.md §"Trusting a self-signed JWKS endpoint" + troubleshooting.md §"401 storm with apparently-valid JWT".
-
Zitadel volume reset (down -v / --reset-zitadel) regenerates projectId and clientId — any backend env (AUTH_AUDIENCE/OIDC_AUDIENCE/VITE_OIDC_CLIENT_ID) cached from a previous bootstrap goes stale silently. Same 401-storm symptom as quirk 12, no clear error. Fix: re-derive from bootstrap.json on every boot, never hardcode. → api-cheatsheet.md §"Re-reading bootstrap output after volume reset" + troubleshooting.md §"401 storm with apparently-valid JWT".
-
PUT /management/v1/projects/{p}/apps/{a}/oidc_config returns 400 COMMAND-1m88i "No changes" when the body matches current state — idempotent bootstrap scripts that always PUT the OIDC config crash on second run unless they catch this code. → api-cheatsheet.md §"Create OIDC application" + assets/bootstrap-zitadel.ts (idempotent template).
-
Running Zitadel behind a TLS-terminating reverse proxy (Caddy/NGINX/Traefik) requires THREE settings, not two — ZITADEL_EXTERNALSECURE=true + ZITADEL_TLS_ENABLED=false + the start flag --tlsMode external. Without --tlsMode external the binary still tries to bind a TLS listener on its internal port and refuses traffic. → docker-compose-bootstrap.md §"TLS terminated by reverse proxy".
-
Browsers expose crypto.subtle (used by PKCE in oidc-client-ts) only in secure contexts — localhost/127.0.0.1 are the only HTTP exceptions — accessing the SPA via LAN IP / .sslip.io / custom hostname over HTTP makes signinRedirect() throw. react-oidc-context callers typically void the promise, so the failure is silent — clicking "Entrar" produces no console error, no navigation, no network request. Fix: serve everything via HTTPS even for dev/LAN testing (mkcert + reverse proxy). → troubleshooting.md §"Entrar / Login button does nothing".
-
F5 with InMemoryWebStorage requires a boot-time signinSilent — and <AuthProvider> wrapping the silent-renew route makes that recursive — automaticSilentRenew only fires on accessTokenExpiring, which needs an existing User; in-memory storage has no User after F5, so the lib never recovers the session even with the IdP cookie alive. The fix is an active auth.signinSilent() at boot — but if your provider mounts above the <Routes> tree, the iframe loads /silent-renew, re-mounts the provider, fires another signinSilent, and the parent Promise never settles. Symptom: SPA stuck on "Verifying session…" indefinitely with no IdP-side error. Fix requires three guards (route check, iframe check, ref-gated useEffect) and a watchdog setTimeout — and crucially, no closure-scoped cancelled flag, because StrictMode flips it before the Promise resolves. → spa-recipes.md §"Recipe 1" + troubleshooting.md §"SPA stuck on Verifying session…".
-
post_logout_redirect_uri is byte-matched just like redirect_uri — registering ${WEB_BASE}/ while the SPA sends ${WEB_BASE}/login (a common drift introduced when the SPA's VITE_OIDC_POST_LOGOUT_REDIRECT_URI is set to /login for UX but the bootstrap registered just the root) makes Zitadel respond with {"error":"invalid_request","error_description":"post_logout_redirect_uri invalid"} and the user is stranded. Trailing slash, scheme, and port all matter. Fix: register both URIs the SPA might send, comma-joined: ${WEB_BASE}/login,${WEB_BASE}/. → troubleshooting.md §"Logout returns post_logout_redirect_uri invalid".
-
Branding aplicado na org não pinta a Login UI sem privateLabelingSetting no projeto — POST /management/v1/policies/label + _activate na org JRC retorna 200, GET subsequente confirma primaryColor: "#ed1c24", mas browser em /ui/login/login continua azul Zitadel default #5469d4. Causa: Zitadel só usa a label policy da org dona do projeto se este tiver PRIVATE_LABELING_SETTING_ENFORCE_PROJECT_RESOURCE_OWNER_POLICY. O default _UNSPECIFIED cai pra label policy da instância. Fix: setar o flag no payload do POST/PUT /management/v1/projects/{p} e checar via GET. → branding.md §"Quirk 19".
-
POST vs PUT na primeira label policy + dois error IDs distintos pra "no-op" — PUT /management/v1/policies/label quando policy.isDefault === true retorna 404 "Private Label Policy not found (Org-0K9dq)" (precisa POST primeiro pra criar override). Re-runs do bootstrap idempotente disparam 400 "Private Label Policy has not been changed (Org-8nfSr)" — diferente do COMMAND-1m88i genérico (quirk 14). Bootstrap deve: GET prévio → ramificar por isDefault → tratar ambos error IDs como no-op. → branding.md §"Quirk 20".
-
Path de assets em Zitadel v4 mudou para /assets/v1/org/policy/label/... — singular org, sem s/me. O path /assets/v1/orgs/me/policy/label/logo que aparece em docs/exemplos antigos do Zitadel v1/v2/v3 retorna HTTP 405 Method Not Allowed em v4. Confunde porque parece "endpoint existe mas método errado" — na verdade existe em outro path. Endpoints corretos: logo, logo/dark, icon, font. Header x-zitadel-orgid continua obrigatório. Multipart com field name file. → branding.md §"Quirk 21".
-
Custom login text usa /management/v1/text/login/{lang}, e Zitadel só aceita códigos curtos — não /policies/custom_login_text/{lang} (path antigo, retorna 404). E só aceita pt, en, de etc. — pt-BR retorna 400 LANG-lg4DP "Language is not supported". Resolução de Accept-Language pt-BR → pt é feita server-side. PUT é mergeable: campos não enviados preservam i18n default; pra zerar um campo mande string vazia explícita. → branding.md §"Quirk 22".
-
Refatoração single-app → multi-app YAML: env vars do dev.sh precisam DOMINAR o YAML — quando você evolui o bootstrap pra ler applications[].redirectUris declarativos do YAML, o YAML naturalmente lista hosts canônicos (localhost:5173, app.example.com). Mas dev workflows com LAN HTTPS (mkcert + proxy reverso, host dinâmico via IP .sslip.io) populam OIDC_REDIRECT_URIS=https://<ip-da-lan>.sslip.io:5443/auth/callback,... — host que muda a cada IP, nunca está no YAML. Se YAML wins, dev login bate em {"error":"invalid_request","error_description":"The requested redirect_uri is missing in the client configuration"} no callback do Zitadel. Família dos quirks 10 (silent-renew em redirectUris) e 18 (byte-match), mas o gatilho é diferente: regressão de refatoração, não config inicial. Fix canônico: precedência env > YAML > hardcoded fallback — env do dev.sh (quando setada explicitamente) domina; YAML é fallback de produção (onde env unset, hosts são canônicos). → troubleshooting.md §"redirect_uri missing in client configuration after multi-app refactor".
-
Zitadel v2.66.x start-from-init: env-var ZITADEL_MASTERKEY não é lida com confiabilidade — passe via flag CLI — em v2.66.10 (e potencialmente outras patches da linha 2.66) o subcomando start-from-init falha com panic: No master key provided ... masterkey must either be provided by file path, value or environment variable mesmo quando ZITADEL_MASTERKEY está injetada corretamente no container (validável via docker inspect <ctr> --format '{{range .Config.Env}}{{println .}}{{end}}' — 32 chars exatos, sem CR, sem leading whitespace, sem null bytes). Em v4.x o fallback os.Getenv funciona; em v2.66 não. Sintoma: container em loop de restart com RestartCount crescendo, exit 2, e o panic acima a cada ciclo (~60s, devido ao retry policy). Fix canônico: passar --masterkey ${ZITADEL_MASTERKEY} no command: do compose (a flag tem precedência sobre env). Trade-off: a flag aparece em docker inspect e ps aux, então o .env precisa ser chmod 600 e o host dedicado/de confiança. Se isso virar problema, migrar para --masterkeyFile /run/secrets/zitadel_masterkey + Docker secret. → docker-compose-bootstrap.md §"Quirk 24 — masterkey via flag em v2.66.x".
-
Login UI v2 em v4 é um container Next.js separado — diferente de Login UI v1, que vive embutida no binário do Zitadel em /ui/login/, a Login UI v2 (/ui/v2/login) é servida pela imagem ghcr.io/zitadel/zitadel-login. Em v3+ o instance flag loginV2.required default é true — o signinRedirect da SPA cai em /ui/v2/login por padrão, e se você só tem o container zitadel no compose recebe 404 {"code":5,"message":"Not Found"}. Duas saídas: (A) deploy do container zitadel-login + reverse proxy roteando /ui/v2/login → zitadel-login:3000 e tudo o mais → zitadel:8080; (B) PUT /v2/features/instance {"loginV2":{"required":false}} e seguir com Login UI v1 indefinidamente. Escolha uma — oscilar entre as duas sem pensar gera 404 esporádicos. → migration-v2-to-v4.md §3.2-3.3 + docker-compose-bootstrap.md §8.
-
Idempotência em v2 via IDs determinísticos em vez de search-then-create — em v1 o padrão idiomático era POST /resource/_search filtrando por nome → se 0 hits, POST /resource (Zitadel gera o ID). Em v2 você pode passar seu próprio userId/applicationId/projectId no body; tentar criar com ID já existente retorna ALREADY_EXISTS, que você trata como sucesso. Resultado: 1 round-trip em vez de 2 por recurso. Vale a pena para bootstraps multi-app (5 apps × 2 round-trips × N boots = ruído mensurável em hot-deploy). Pra recursos com nome humano-legível e sem ID estável (org "JRC", project "ERP-JRC"), o padrão search-then-create v1 continua válido em v2 também. → api-v1-to-v2-mapping.md §5.
-
Contextual orgId mudou de header para body em v2 — v1 exigia x-zitadel-orgid: 379... em todo call org-scoped. v2 coloca o equivalente no body, geralmente como organizationId (às vezes nested em org.id ou resourceOwner). Modo de falha mais comum em refactor: dropar o header e esquecer de adicionar o campo no body — symptom é INVALID_ARGUMENT: missing organization_id. O header é inofensivo em calls v2 (ignorado), então durante a transição você pode manter o header setado globalmente no HTTP client sem quebrar nada. → api-v1-to-v2-mapping.md §3.
-
Login UI v2 auto-provisioning is broken in v4.15.0 — the FirstInstance envs that should auto-create the IAM_LOGIN_CLIENT service user + write its PAT to a shared volume don't work in v4.15. Two failure modes, no winner: (A) setting only ZITADEL_FIRSTINSTANCE_LOGINCLIENTPATPATH is a no-op — the server has no service user to create, so the volume stays empty and the zitadel-login container loops on stderr ZITADEL_SERVICE_USER_TOKEN_FILE=/login-client/login-client.pat is set. Awaiting file and reading token. indefinitely; /ui/v2/login returns 404. (B) adding the ZITADEL_FIRSTINSTANCE_ORG_LOGINCLIENT_MACHINE_USERNAME / MACHINE_NAME / PAT_EXPIRATIONDATE envs to actually provision the service user causes setup migration 03_default_instance to fail with Errors.Instance.Domain.AlreadyExists / unique_constraints_pkey (instance domain reserved twice — Human admin + LoginClient race). Container enters restart loop and never becomes healthy. Tracked in zitadel/zitadel#8910 and #9293; partial fix in PR #10518 may not be in every v4.x patch. Pragmatic mitigations: (1) Path B of Quirk 25 — PUT /v2/features/instance {"loginV2":{"required":false}} and pin OIDC apps with loginVersion: { loginV1: {} }. Login UI v1 stays default with branding intact via label policy. The zitadel-login container can stay deployed (idle, looping benignly) so promoting it is a single config flip when upstream stabilizes. (2) Provision in bootstrap — after Zitadel is healthy, your bootstrap creates the machine user login-client + role IAM_LOGIN_CLIENT + PAT via the regular UserService/AuthorizationService API, then writes the PAT to a bind-mounted file the zitadel-login container reads. More work but independent of upstream. → troubleshooting.md §"zitadel-login: Awaiting file and reading token (forever)" + migration-v2-to-v4.md §3.4.
-
OIDC client_id is the numeric clientId from oidcConfiguration, NOT the deterministic applicationId UUID you supplied — even when you pass your own applicationId (UUID v7) in CreateApplicationRequest for idempotency, Zitadel auto-generates a separate numeric clientId (e.g. 371898282416275459) for OAuth/OIDC. The applicationId is the resource handle for v2 APIs (UpdateApplication, GetApplication, DeleteApplication); the clientId is what /oauth/v2/authorize?client_id=… requires. Frontend VITE_OIDC_CLIENT_ID MUST be the numeric value. Symptom: OIDC authorize returns 400 invalid_request "Errors.App.NotFound" with the UUID; SPA login fails right after a clean cutover even though the bootstrap log shows [app] created appId=<uuid>. Fix in CD: bootstrap output exposes both (appId=… clientId=…); a CD step extracts the numeric clientId from oidcConfiguration of the bootstrap response and overwrites VITE_OIDC_CLIENT_ID (image-rebuild required — VITE_* are baked at build time). Backend AUTH_AUDIENCE for JWT validation can be the deterministic projectId (it appears in the JWT aud claim alongside the clientId). → troubleshooting.md §"OIDC client_id mismatch — Errors.App.NotFound".
-
ZITADEL_BOOTSTRAP_ENV-style env that picks dev-vs-prod deterministic IDs from YAML must be set explicitly in CD — silent default to dev is a footgun — when your bootstrap script has applications[].ids.dev and applications[].ids.prod blocks and selects between them via process.env.ZITADEL_BOOTSTRAP_ENV ?? 'dev', a CD pipeline that doesn't set the env creates prod entities with dev IDs. The frontend secret has prod IDs, so the SPA's client_id doesn't match anything in the IdP — same Errors.App.NotFound symptom as Quirk 29 but with a different root cause and trickier fix (already-created entities have wrong IDs and must be wiped + recreated). Mitigations: (1) require the env at script start, throw on undefined rather than defaulting; (2) set ZITADEL_BOOTSTRAP_ENV: prod (or equivalent) explicitly on the bootstrap container in docker-compose.prod.yml; (3) document the env in the runbook. → troubleshooting.md §"Wrong-environment IDs in prod IdP".
-
ZITADEL_DEFAULTINSTANCE_FEATURES_LOGINV2_REQUIRED=false at FirstInstance time breaks the chicken-and-egg of Quirks 25 + 28 in CD cutovers — when the upstream Login UI v2 auto-provisioning bug (Quirk 28) blocks /ui/v2/login, and the operator can't login to the console to create the IAM_OWNER PAT manually because the OIDC redirect to /ui/v2/login 404s — the cleanest break is to set the feature flag in DefaultInstance config (env on the zitadel server) so the instance is born with loginV2.required=false from boot zero. The OIDC authorize endpoint then redirects to /ui/login (v1, embedded in the binary) immediately, no PAT required. Bootstrap's PUT /v2/features/instance call against the same flag becomes a no-op idempotency check. Why this is better than waiting for bootstrap to flip the flag: bootstrap needs a PAT, PAT requires console login, console login redirects to broken /ui/v2/login → without DefaultInstance pre-config, you're stuck. With it, the loop opens at the right place. → docker-compose-bootstrap.md §"DefaultInstance feature flags pre-config".
-
nginx-proxy: when 2 containers share a VIRTUAL_HOST and one declares VIRTUAL_PATH while the other doesn't, the no-VIRTUAL_PATH container is silently ignored — only the more-specific path-route is registered in the generated nginx.conf. Every request that doesn't match the prefix returns 404 (e.g., /.well-known/openid-configuration, /oauth/v2/*, /ui/console). Symptom in nginx logs: trailing "-" upstream means no route was matched. Fix: declare VIRTUAL_PATH=/ + VIRTUAL_DEST=/ on the "default" container (the one serving root) so nginx-proxy registers it as a less-specific location alongside the prefix-routed sibling. Discovered when adding the zitadel-login container (VIRTUAL_PATH=/ui/v2/login) to an idp.jrcbrasil.com host that previously had only zitadel (no VIRTUAL_PATH). Applies to any nginx-proxy split (e.g., API + admin UI on same host). → docker-compose-bootstrap.md §"nginx-proxy: split VIRTUAL_HOST + VIRTUAL_PATH".
-
Console v4 "Add Human User" form auto-truncates Username to the email's local-part — typing the email into the E-Mail field auto-fills Username with everything before @ (e.g., tadeu.mezzavilla@jrcbrasil.com → tadeu.mezzavilla). Combined with userLoginMustBeDomain=false (the instance default), the canonical loginName becomes just the local-part. The end user then tries to login with their full email and the IdP responds "O usuário não pôde ser encontrado" / "User could not be found" — even though the user exists, has the right grants, and email is correct in the contact section. API creation does NOT trigger this (AddHumanUserRequest honors whatever username you pass). Operationally this is the #1 cause of "I created the user but they can't login" reports in self-hosted setups where TI provisions humans via the Console. Fix at creation time: after typing the e-mail, manually clear the Username field and re-type the full email. Retroactive fix: detail page → PROFILE → click modify next to Username → enter full email → Save. The loginName is recomputed on save. → troubleshooting.md §"User created via Console can't login by email".
-
userLoginMustBeDomain=true (Settings → Domain settings → "Add Organization Domain as suffix to loginnames") stamps loginNames irreversibly on existing users — flipping the org-level checkbox rewrites every existing user's loginName to userName + "@" + orgPrimaryDomain. Disabling and clicking "Reset to Instance default" does NOT undo the stamp — affected users keep the suffix indefinitely. Doubly hostile in self-hosted Zitadel because the org's primaryDomain auto-generates as <orgSlug>.<externalDomain> (e.g., jrc.idp.jrcbrasil.com) — completely unrelated to the company's email domain (e.g., jrcbrasil.com). So enabling the flag without first adding a custom domain via Organization Domains → Add (and verifying it) produces nonsensical loginNames like user@jrc.idp.jrcbrasil.com, or worse, double-suffixed user@jrcbrasil.com@jrc.idp.jrcbrasil.com if the userName already had @. Recovery for stamped users: edit the affected user's userName to a temporary value, Save, then change it back to the desired value, Save again — this forces Zitadel to recompute the loginName under the (now disabled) policy. Last resort: delete + recreate. Lesson: do NOT enable this flag in production until you have added and verified a custom domain matching your users' email domain. The setting is only sensible as a multi-tenant disambiguator across orgs that share usernames. → troubleshooting.md §"Login Name stamped with org domain after policy toggle".
-
Console "Add Human User" leaves "E-mail Verificado" / "Email verified" unchecked by default — first login then triggers an SMTP verification code that never arrives if SMTP isn't configured — the form has the verified checkbox immediately under the E-Mail field, and it's easy to miss. When unchecked, the freshly-created user logs in fine up to the password step, then gets stuck on a "Verifique seu e-mail / Code sent to your email" screen. In self-hosted deployments where SMTP is commonly deferred ("we'll wire it up later"), no code is ever delivered and the user is permanently stuck. API creation avoids this entirely — AddHumanUserRequest lets you set email.isVerified: true explicitly in the payload, which bootstrap scripts typically do. Fix retroactively: detail page → CONTACT INFORMATION → pencil icon next to email → modal opens → tick "Is verified" → Save. Do NOT use "Resend Code" — that hits the same broken SMTP path. Better long-term: configure SMTP (the entire invite-and-self-service flow depends on it: password reset, MFA enrollment notifications, email change verification). → troubleshooting.md §"User stuck on email verification code after manual creation".
-
Backend container that validates JWT needs extra_hosts: idp.<domain>:host-gateway when IdP is on the same host with unreliable hairpin NAT — jose.createRemoteJWKSet(new URL(jwksUrl)) does an ordinary Node fetch on every JWKS reload (default cacheMaxAge: 600s). When the backend container resolves idp.<domain> via public DNS, it gets the host's external IP — and on VPS providers where hairpin NAT (DNS → external IP → loopback to local nginx-proxy) is flaky, every reload after the 10-minute TTL fails with a network error. The failure surfaces as JOSEError from the catch block, which the validator throws as InvalidTokenError → 401. Symptom: backend works perfectly for ~10 minutes after each restart, then 401-storms every authenticated request even though JWTs are decoded-by-hand correct (iss/aud/exp/kid all valid; the kid is verifiably present in the JWKS endpoint when fetched from outside the container). Family of quirks 12 (self-signed JWKS) and 13 (volume reset stale clientId) — same symptom, different cause; this is the third documented cause of "401-storm with apparently-valid JWT". Fix: add extra_hosts: ["idp.<domain>:host-gateway"] to the backend service in docker-compose.prod.yml so the container resolves the IdP hostname to the docker-bridge IP — same path external traffic takes. Diagnostic in 3 commands: docker exec <backend> getent hosts idp.<domain> (should be the bridge IP), docker exec <backend> wget -qO- $JWKS_URL | head -c 200 (should be valid JWKS), docker logs <backend> --tail 100 | grep '\[auth\] JOSE' (any code=ERR_JOSE_* confirms the JWKS-after-TTL failure mode). → token-validation.md §"Network reachability to the IdP from the JWT validator" + troubleshooting.md §"401 storm starting ~10 min after backend restart".
-
Frontend defense-in-depth against the 401 → silent-renew → RT-reuse → session-revoke cascade requires THREE coordinated layers, not just dedupe — Zitadel has refresh-token rotation with reuse detection enabled by default; if the SPA fires signinSilent twice with the same RT (race between concurrent 401-retry callers OR race between 401-retry and addAccessTokenExpiring handler), Zitadel detects reuse and revokes the entire session — every access token AND refresh token in that session becomes invalid. After that, re-login creates a fresh session but if the underlying root cause persists, a new storm revokes that too → infinite loop. The three layers that close the cascade: (L1) Dedupe lives in ApiClient, not in the provider — instance field pendingRenew: Promise<string|null>|null cleared in .finally(); expose apiClient.refreshToken() public method that returns the deduped Promise; survives provider remount (test harnesses, Storybook), is testable in isolation, AND is shared between 401-retry path and addAccessTokenExpiring handler. Provider-level useRef dedupe is fragile because the expiring handler typically calls signinSilent() directly, bypassing the dedupe. (L2) TanStack Query retry predicate filters ApiError 401 — 401 is not transient (the user needs re-auth, not retry), and retrying amplifies the storm by N requests/refresh attempts. Predicate: (failureCount, error) => error instanceof ApiError && error.status === 401 ? false : failureCount < N. (L3) Listener for the apiclient:unauthorized CustomEvent in <AuthProvider> with isAuthRoute() early-return + state: { returnTo } on signinRedirect — without the guard, a 401 bubbling up from /auth/callback or /login itself triggers another signinRedirect → IdP rejects → 401 → loop; without state.returnTo the user lands on home// after re-auth instead of the page they were on. The three layers are independent and addressing only one of them leaves a known leak path. → spa-recipes.md §"Recipe — Defense in depth against 401-storm-revokes-session".
-
CI bind mount perms make ZITADEL_FIRSTINSTANCE_PATPATH fail with EACCES, then cascades into unique_constraints_pkey on retry — looks like Quirk 28 but isn't — ZITADEL_FIRSTINSTANCE_PATPATH=/current-dir/admin.pat writes via the Zitadel container's uid (default 1000) into a host bind mount (./zitadel/local:/current-dir:rw). On a GitHub Actions ubuntu-latest runner the host directory is created post-checkout owned by runner:docker (uid 1001) with mode 0755 — uid 1000 can't write. The first boot fails on 03_default_instance migration with open /current-dir/admin.pat: permission denied, but Zitadel's setup failed, skipping cleanup leaves partial state (the instance_domain=127.0.0.1.sslip.io record) in the eventstore. Every subsequent retry under restart: always then fails with unique_constraints_pkey on the same migration — looking exactly like Quirk 28 (#8910/#9293) at the bottom of the logs, but caused by a host-perm issue, not by the upstream LoginClient envs. Diagnose: ALWAYS scroll up to the FIRST migration attempt — if it failed with permission denied, this is Quirk 38, not 28; the cure is different. Fix (CI-only — dev's host pasta gets your user's perms via your scaffolding):
- name: Pre-create writable bind mount for Zitadel admin.pat
run: |
mkdir -p infra/docker/zitadel/local
chmod 0777 infra/docker/zitadel/local
chmod 0777 is idiomatic for ephemeral CI bind mounts and avoids hardcoding the container uid. Don't try to swap to a named volume in CI without doing it in dev too — bootstrap.json and other host-readable artifacts depend on the bind mount layout. → troubleshooting.md §"migration failed name=03_default_instance err.parent=...permission denied" + docker-compose-bootstrap.md §"Smoke-e2e plumbing checklist for GHA".
-
Default Zitadel password policy requires all FOUR character classes — openssl rand -hex is lowercase-only and trips it deterministically — fresh instances enforce upper + lower + digit + symbol on AddHumanUser (and any password-bearing API). Common CI shortcut openssl rand -hex 16 only emits [0-9a-f] and dies with 400 invalid_argument: Password must contain upper case (COMMA-VoaRj) on the seed user step. "Fix" by adding only an uppercase prefix and you trip the next class missing. Fix: structured generator with a 4-class prefix — Aa1! covers all four classes deterministically; the random tail brings entropy:
RAND_TAIL="$(LC_ALL=C tr -dc 'A-Za-z0-9' </dev/urandom | head -c 28)"
export ZITADEL_SEED_USER_PASSWORD="Aa1!${RAND_TAIL}"
Same pattern works against any password-policy-shaped service. Avoid openssl rand -base64 (padding = and slash variability) — alphanum + structured prefix is more portable. If you've changed the default policy, mirror in the script too. → troubleshooting.md §"Bootstrap fails with COMMAND-VoaRj" + docker-compose-bootstrap.md §"Smoke-e2e plumbing checklist for GHA".
-
zitadel-login (Login UI v2 Next.js container) takes 90+ seconds to first healthcheck render on small CI runners — up --wait for the WHOLE stack times out before bootstrap or REST tests get to start — its healthcheck (wget --spider http://localhost:3000/) only passes after Next.js bootstrap completes; on ubuntu-latest shared (2 vCPU) that's ~90s+. With --wait-timeout 120 and start_period: 30s the wait abandons before the container goes Healthy — even though zitadel, zitadel-init, and zitadel-db were Healthy long before. Bootstrap and integration tests usually don't need Login UI — bootstrap-zitadel.ts reads login-client.pat from the named volume populated by the zitadel service, and REST tests hit zitadel:8080 directly. Fix — scope --wait to only the services tests actually exercise:
- name: Boot Zitadel stack
run: |
docker compose -f infra/docker/docker-compose.zitadel.yml up -d --wait --wait-timeout 120 \
zitadel-db zitadel-init zitadel
Compose only waits for the listed services to be Healthy; zitadel-login and mailpit are simply not started. Dev's default up keeps everything available for browser smoke. If your CI specifically needs Login UI healthy (e.g., a Playwright login spec), bump --wait-timeout to ~240s — same wait, more headroom. Always include zitadel-login in your on-failure log dump (docker compose logs zitadel-login || true) for debugging, even when you don't --wait for it. → troubleshooting.md §"zitadel-login container never goes Healthy in CI" + docker-compose-bootstrap.md §"Smoke-e2e plumbing checklist for GHA".
When the references in this skill diverge from upstream, trust the upstream docs but raise a note — Zitadel evolves quickly and these references will drift.