with one click
debug-deploy
// Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml.
// Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | debug-deploy |
| description | Debug GitHub Actions workflow failures and Terraform errors. Use when deployment failed, Terraform state lock, CI/CD pipeline errors, or troubleshooting deploy.yml. |
This skill helps diagnose and fix issues with the GitHub Actions deploy workflow for the Learn to Cloud app.
gh run list --workflow=deploy.yml --limit 5
gh run view <run-id> --log-failed
Or use the debug script in this skill's folder:
.github/skills/debug-deploy/debug-deploy.sh logs
Look for these common patterns in the logs:
Pattern: Error acquiring the state lock or state blob is already locked
Cause: Previous workflow was cancelled mid-execution, leaving the state locked.
Fix:
efd4cede-d5a2-61c3-31db-462852989510)cd infra && terraform force-unlock -force <lock-id>gh run rerun <run-id>Pattern: AuthorizationFailed, AADSTS, unauthorized
Fix: Check the OIDC deployment configuration: AZURE_CLIENT_ID and AZURE_TENANT_ID secrets, plus the AZURE_SUBSCRIPTION_ID repository variable. The federated credential or Azure RBAC assignment may need updating.
Pattern: ResourceNotFound or does not exist
Fix: Resource was deleted outside Terraform. Run terraform refresh or re-import.
Pattern: QuotaExceeded
Fix: Request quota increase in Azure portal or clean up unused resources.
Pattern: Run database migrations fails, /ready returns 503 after deploy, or Alembic reports a database error.
Cause: The Azure Container Apps migration job failed before the API image was updated. Common causes are missing PostgreSQL role mapping, migration SQL errors, or job image/env override issues.
Fix:
migration_identity_principal_id is mapped to the Terraform migration_postgres_role output in PostgreSQL.az containerapp job start passes the SHA image, alembic upgrade head, DB env vars, AZURE_CLIENT_ID, and --registry-identity.alembic/env.py still uses psycopg2 and acquires the advisory lock before migrations.Pattern: FAILED, pytest, AssertionError
Fix: Run tests locally: (cd api && uv run pytest tests/ ../packages/learn-to-cloud-shared/tests -v)
Pattern: ruff, lint error
Fix: Run linter locally: (cd api && uv run ruff check . ../packages/learn-to-cloud-shared)
After fixing the issue:
gh run rerun <run-id>
Or watch the progress:
gh run watch <run-id>
| Command | Description |
|---|---|
./.github/skills/debug-deploy/debug-deploy.sh status | Show recent workflow runs |
./.github/skills/debug-deploy/debug-deploy.sh logs | View and analyze failed logs |
./.github/skills/debug-deploy/debug-deploy.sh logs <id> | View specific run's failed logs |
./.github/skills/debug-deploy/debug-deploy.sh unlock | Fix Terraform state lock |
./.github/skills/debug-deploy/debug-deploy.sh rerun | Re-run most recent failed workflow |
./.github/skills/debug-deploy/debug-deploy.sh watch | Watch running workflow |
The debug-deploy.sh script automates the debugging process with automated issue detection.
The workflows are configured with these safeguards:
cancel-in-progress: true - Cancels in-progress runs when a new push arrives (in deploy.yml)-lock-timeout=120s - Waits for locks instead of failing immediately (in deploy.yml terraform job)