with one click
with one click
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | condition-based-waiting |
| description | Polling, retry, and backoff patterns. |
| user-invocable | false |
| allowed-tools | ["Read","Write","Bash","Grep","Glob","Edit"] |
| routing | {"triggers":["exponential backoff","health check polling","retry pattern","wait for","keep trying until","poll until ready","retry until success"],"category":"process","pairs_with":["shell-process-patterns","service-health-check"]} |
Implement condition-based polling and retry patterns with bounded timeouts, jitter, and error classification. Select the right pattern for the scenario, implement it with safety bounds, and verify both success and failure paths.
| Pattern | Use When | Key Safety Bound |
|---|---|---|
| Simple Poll | Wait for condition to become true | Timeout + min poll interval |
| Exponential Backoff | Retry with increasing delays | Max retries + jitter + delay cap |
| Rate Limit Recovery | API returns 429 | Retry-After header + default fallback |
| Health Check | Wait for service(s) to be ready | All-pass requirement + per-check status |
| Circuit Breaker | Prevent cascade failures | Failure threshold + recovery timeout |
| Signal | Load These Files | Why |
|---|---|---|
| implementation patterns | preferred-patterns.md | Loads detailed guidance from preferred-patterns.md. |
| implementation patterns | implementation-patterns.md | Loads detailed guidance from implementation-patterns.md. |
| tests, implementation patterns | testing-patterns.md | Loads detailed guidance from testing-patterns.md. |
Before implementing any pattern, read the repository CLAUDE.md and search the codebase for existing wait/retry patterns to maintain consistency with what already exists.
Walk this decision tree to pick the right pattern. Only implement the pattern directly needed -- do not add circuit breakers when simple retries suffice, and do not add health checks when a single poll works.
1. Waiting for a condition to become true?
YES -> Simple Polling (Step 2)
NO -> Continue
2. Retrying a failing operation?
YES -> Rate-limited (429)?
YES -> Rate Limit Recovery (Step 5)
NO -> Exponential Backoff (Step 4)
NO -> Continue
3. Waiting for a service to start?
YES -> Health Check Waiting (Step 6)
NO -> Continue
4. Service frequently failing, need fast-fail?
YES -> Circuit Breaker (Step 7)
NO -> Simple Poll or Backoff
Wait for a condition to become true with bounded timeout.
time.monotonic() for elapsed time measurement -- never time.time(), which drifts with clock adjustments.| Target Type | Min Interval | Typical Interval | Example |
|---|---|---|---|
| In-process state | 10ms | 50-100ms | Flag, queue, state machine |
| Local file/socket | 100ms | 500ms | File exists, port open |
| Local service | 500ms | 1-2s | Database, cache |
| Remote API | 1s | 5-10s | HTTP endpoint, cloud service |
Never busy-wait (tight loop with no sleep). The minimum poll interval is 10ms for local operations, 100ms for external services. Tighter loops burn CPU, cause thermal throttling, and starve other processes.
# Core pattern (full implementation in references/implementation-patterns.md)
start = time.monotonic()
deadline = start + timeout_seconds
while time.monotonic() < deadline:
result = condition()
if result:
return result
time.sleep(poll_interval)
raise TimeoutError(f"Timeout waiting for: {description}")
After implementing any pattern from Steps 2-7, verify:
sleep(N) with condition-based polling)Retry failing operations with increasing delays and jitter.
Classify errors before implementing retries. Separate transient from permanent errors -- retrying permanent errors wastes time and quota.
Configure backoff parameters. Every retry loop must have a maximum retry count.
max_retries: 3-5 for APIs, 5-10 for infrastructureinitial_delay: 0.5-2smax_delay: 30-60sjitter_range: 0.5 (adds +/-50% randomness)Implement with jitter. Jitter is mandatory on all exponential backoff -- without it, all clients retry at the same instant after an outage (thundering herd), amplifying the load spike that caused the failure.
# Core pattern (full implementation in references/implementation-patterns.md)
for attempt in range(max_retries + 1):
try:
return operation()
except retryable_exceptions as e:
if attempt >= max_retries:
raise
jitter = 1.0 + random.uniform(-0.5, 0.5)
actual_delay = min(delay * jitter, max_delay)
time.sleep(actual_delay)
delay = min(delay * backoff_factor, max_delay)
Handle HTTP 429 responses using Retry-After headers.
Retry-After header (seconds or HTTP-date format).See references/implementation-patterns.md for full RateLimitedClient class.
Wait for services to become healthy before proceeding.
| Type | Check | Example |
|---|---|---|
| TCP | Port accepting connections | localhost:5432 |
| HTTP | Endpoint returns 2xx | http://localhost:8080/health |
| Command | Exit code 0 | pgrep -f 'celery worker' |
See references/implementation-patterns.md for full wait_for_healthy() implementation.
Prevent cascade failures by failing fast after repeated errors.
Configure thresholds:
failure_threshold: Number of failures before opening (typically 5)recovery_timeout: Time before testing recovery (typically 30s)half_open_max_calls: Successful calls needed to close (typically 3)Implement state machine:
Add fallback behavior for OPEN state.
Test all state transitions:
CLOSED --(failure_threshold reached)--> OPEN
OPEN --(recovery_timeout elapsed)---> HALF_OPEN
HALF_OPEN --(success streak)----------> CLOSED
HALF_OPEN --(any failure)-------------> OPEN
See references/implementation-patterns.md for full CircuitBreaker class.
Flaky test with sleep(): User says "This test uses sleep(5) and sometimes fails in CI." Identify as Simple Polling (Step 2). Define what the test is actually waiting for, replace sleep(5) with wait_for(condition, description, timeout=30), run test 3 times to verify reliability, then force the condition to never be true and confirm TimeoutError.
API integration with rate limits: User says "Our batch job hits 429 errors from the API." Classify errors: 429 is retryable, 400/401/404 are not (Step 4). Add Retry-After header parsing with 60s default fallback (Step 5). Add exponential backoff with jitter for non-429 transient errors (Step 4). Test: normal flow, 429 handling, exhausted retries, non-retryable errors.
Service startup in Docker Compose: User says "App crashes because it starts before the database is ready." Define checks: TCP on postgres:5432, HTTP on api:8080/health (Step 6). Set 120s timeout with 2s poll interval. Implement wait_for_healthy() with all-pass requirement. Verify: services start within timeout, timeout fires when service is down.
Cause: Condition never became true within timeout window. Solution:
Cause: Operation failed on every attempt including retries. Solution:
Cause: Failure threshold exceeded, circuit rejecting calls. Solution:
| Task Type | Signal Keywords | Load |
|---|---|---|
| Implementing any pattern | "implement", "add retry", "write polling", "create backoff" | implementation-patterns.md |
| Reviewing existing code | "review", "check for", "find issues", "audit", "detect" | preferred-patterns.md |
Replacing sleep() in tests | "sleep", "flaky test", "CI hang", "test timeout" | preferred-patterns.md |
| Writing tests for retry/wait code | "test", "pytest", "mock", "verify retry", "unit test" | testing-patterns.md |
${CLAUDE_SKILL_DIR}/references/implementation-patterns.md: Complete Python/Bash implementations for all patterns${CLAUDE_SKILL_DIR}/references/preferred-patterns.md: Detection commands and fixes for common wait/retry mistakes${CLAUDE_SKILL_DIR}/references/testing-patterns.md: pytest patterns for testing polling, backoff, and circuit breaker code