en un clic
debug-deploy
// Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml.
// Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml.
| name | debug-deploy |
| description | Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml. |
This skill helps diagnose and fix issues with the GitHub Actions deploy workflow for the Learn to Cloud app.
gh run list --workflow=deploy.yml --limit 5
gh run view <run-id> --log-failed
Or use the debug script in this skill's folder:
.github/skills/debug-deploy/debug-deploy.sh logs
Look for these common patterns in the logs:
Pattern: Error acquiring the state lock or state blob is already locked
Cause: Previous workflow was cancelled mid-execution, leaving the state locked.
Fix:
efd4cede-d5a2-61c3-31db-462852989510)cd infra && terraform force-unlock -force <lock-id>gh run rerun <run-id>Pattern: AuthorizationFailed, AADSTS, unauthorized
Fix: Check the OIDC deployment configuration: AZURE_CLIENT_ID and AZURE_TENANT_ID secrets, plus the AZURE_SUBSCRIPTION_ID repository variable. The federated credential or Azure RBAC assignment may need updating.
Pattern: ResourceNotFound or does not exist
Fix: Resource was deleted outside Terraform. Run terraform refresh or re-import.
Pattern: QuotaExceeded
Fix: Request quota increase in Azure portal or clean up unused resources.
Pattern: Run database migrations fails, /ready returns 503 after deploy, or Alembic reports a database error.
Cause: The Azure Container Apps migration job failed before the API image was updated. Common causes are missing PostgreSQL role mapping, migration SQL errors, or job image/env override issues.
Fix:
migration_identity_principal_id is mapped to the Terraform migration_postgres_role output in PostgreSQL.az containerapp job start passes the SHA image, alembic upgrade head, DB env vars, AZURE_CLIENT_ID, and --registry-identity.alembic/env.py still uses psycopg2 and acquires the advisory lock before migrations.Pattern: FAILED, pytest, AssertionError
Fix: Run tests locally: (cd api && uv run pytest tests/ -v) and (cd packages/learn-to-cloud-shared && uv run pytest tests/ -v)
Pattern: ruff, lint error
Fix: Run linter locally: (cd api && uv run ruff check . ../packages/learn-to-cloud-shared)
After fixing the issue:
gh run rerun <run-id>
Or watch the progress:
gh run watch <run-id>
| Command | Description |
|---|---|
./.github/skills/debug-deploy/debug-deploy.sh status | Show recent workflow runs |
./.github/skills/debug-deploy/debug-deploy.sh logs | View and analyze failed logs |
./.github/skills/debug-deploy/debug-deploy.sh logs <id> | View specific run's failed logs |
./.github/skills/debug-deploy/debug-deploy.sh unlock | Fix Terraform state lock |
./.github/skills/debug-deploy/debug-deploy.sh rerun | Re-run most recent failed workflow |
./.github/skills/debug-deploy/debug-deploy.sh watch | Watch running workflow |
The debug-deploy.sh script automates the debugging process with automated issue detection.
The workflows are configured with these safeguards:
cancel-in-progress: true - Cancels in-progress runs when a new push arrives (in deploy.yml)-lock-timeout=120s - Waits for locks instead of failing immediately (in deploy.yml terraform job)Attach relevant Copilot session lessons, mistakes, decisions, model details, token usage, MCP servers, and skills to a GitHub issue comment. Use when the user asks to add session notes, lessons learned, Copilot notes, mistakes, or implementation details to an issue.
Run ruff lint, ruff format, ty type-check, shared/API tests, start the API, smoke test endpoints, then kill the API. Use after editing Python files to catch errors before commit.
Run prek, run tests, resolve issues, commit, push, then monitor the deploy workflow and resolve any deploy failures. Use when user says "ship it", "commit and deploy", "push and deploy", or "land this".
Map concepts, issues, and code changes in this repo to specific chapters and pages of Fluent Python, 2nd edition (Luciano Ramalho). Use when the user wants to ground a task in the underlying Python concept, e.g. "where is this in fluent python?", "what should I read for
Undo local submissions for DevOps and Verify Journal API Implementation so you can re-test verification flows. Also supports custom requirement IDs and user scoping.
Reset verification submissions for a user in production. Use when user says "reset prod submissions", "reset phase X in prod", "undo prod verification", or "reset prod for <username>".