// Setup observability platform configuration (Datadog, Prometheus, Splunk) with REQ-* dashboards and alerts. Creates monitors for each requirement with SLA tracking. Use when deploying to production or setting up monitoring.
| name | create-observability-config |
| description | Setup observability platform configuration (Datadog, Prometheus, Splunk) with REQ-* dashboards and alerts. Creates monitors for each requirement with SLA tracking. Use when deploying to production or setting up monitoring. |
| allowed-tools | ["Read","Write","Edit"] |
Skill Type: Actuator (Runtime Setup) Purpose: Setup observability with REQ-* dashboards and alerts Prerequisites: Code deployed or ready to deploy, telemetry tagged
You are setting up observability with requirement-level monitoring.
Create:
Dashboard per Requirement:
# datadog/dashboards/req-f-auth-001.json
{
"title": "<REQ-ID>: User Login Monitoring",
"description": "Real-time monitoring for user login functionality",
"widgets": [
{
"title": "Login Success Rate",
"definition": {
"type": "timeseries",
"requests": [{
"q": "sum:auth.login.attempts{req:<REQ-ID>,success:true}.as_count() / sum:auth.login.attempts{req:<REQ-ID>}.as_count()"
}]
}
},
{
"title": "Login Latency (p95)",
"definition": {
"type": "timeseries",
"requests": [{
"q": "p95:auth.login.duration{req:<REQ-ID>}"
}],
"markers": [{
"value": 500, // REQ-NFR-PERF-001 threshold
"display_type": "error dashed"
}]
}
},
{
"title": "Failed Login Reasons",
"definition": {
"type": "toplist",
"requests": [{
"q": "top(auth.login.failures{req:<REQ-ID>} by {error}, 10, 'sum', 'desc')"
}]
}
}
]
}
Alerts:
# datadog/monitors/req-f-auth-001-latency.json
{
"name": "<REQ-ID>: Login latency exceeded",
"type": "metric alert",
"query": "avg(last_5m):p95:auth.login.duration{req:<REQ-ID>} > 500",
"message": "Login latency exceeded 500ms threshold (REQ-NFR-PERF-001)\n\nRequirement: <REQ-ID> (User Login)\nSLA: < 500ms\nCurrent: {{value}}ms\n\n@slack-alerts",
"tags": ["req:<REQ-ID>", "sla:performance"],
"options": {
"thresholds": {
"critical": 500,
"warning": 400
},
"notify_no_data": true,
"no_data_timeframe": 10
}
}
Recording Rules:
# prometheus/rules/req-f-auth-001.yml
groups:
- name: req_f_auth_001
interval: 30s
rules:
# Success rate
- record: req:auth_login_success_rate
expr: |
sum(rate(auth_login_attempts_total{req="<REQ-ID>",success="true"}[5m]))
/
sum(rate(auth_login_attempts_total{req="<REQ-ID>"}[5m]))
labels:
req: "<REQ-ID>"
# Latency p95
- record: req:auth_login_duration_p95
expr: histogram_quantile(0.95, rate(auth_login_duration_seconds_bucket{req="<REQ-ID>"}[5m]))
labels:
req: "<REQ-ID>"
Alerts:
# prometheus/alerts/req-f-auth-001.yml
groups:
- name: req_f_auth_001_alerts
rules:
- alert: REQ_F_AUTH_001_LatencyHigh
expr: req:auth_login_duration_p95{req="<REQ-ID>"} > 0.5
for: 5m
labels:
severity: critical
req: <REQ-ID>
sla: performance
annotations:
summary: "Login latency exceeded (<REQ-ID>)"
description: "p95 latency is {{ $value }}s (threshold: 0.5s)"
requirement: "REQ-NFR-PERF-001: Login response < 500ms"
runbook: "docs/runbooks/performance-degradation.md"
Log Search:
# Splunk saved search for <REQ-ID>
index=production sourcetype=app_logs req="<REQ-ID>"
| stats count by success, error
| eval success_rate = round(count(eval(success="true")) / count() * 100, 2)
Dashboard:
<dashboard>
<label><REQ-ID>: User Login</label>
<row>
<panel>
<title>Login Success Rate</title>
<single>
<search>
<query>
index=production req="<REQ-ID>"
| stats count by success
| eval rate = round(count(eval(success="true")) / count() * 100, 2)
</query>
</search>
</single>
</panel>
</row>
</dashboard>
[TELEMETRY TAGGING - <REQ-ID>]
Platform: Datadog
Configuration Created:
Dashboards (1):
โ datadog/dashboards/req-f-auth-001.json
- Login success rate widget
- Login latency (p95) widget with 500ms threshold
- Failed login reasons widget
- Active users widget
Monitors/Alerts (3):
โ datadog/monitors/req-f-auth-001-latency.json
- Alert: p95 latency > 500ms (REQ-NFR-PERF-001)
- Warning: > 400ms
- Critical: > 500ms
โ datadog/monitors/req-f-auth-001-errors.json
- Alert: Error rate > 5%
- Links to: <REQ-ID>
โ datadog/monitors/req-f-auth-001-lockouts.json
- Alert: Lockout rate > 10%
- Links to: BR-003
Logs:
โ All log statements tagged with req="<REQ-ID>"
โ Searchable: logs.req:<REQ-ID>
Metrics:
โ auth.login.attempts{req:<REQ-ID>}
โ auth.login.duration{req:<REQ-ID>}
โ auth.login.lockouts{req:<REQ-ID>}
Traces:
โ Span "auth.login" tagged with req="<REQ-ID>"
Backward Traceability Enabled:
Alert โ req:<REQ-ID> โ docs/requirements/auth.md โ INT-100 โ
โ
Observability Setup Complete!
Why observability per requirement?
Homeostasis Goal:
desired_state:
all_requirements_monitored: true
alerts_tagged_with_req: true
dashboards_per_requirement: true
"Excellence or nothing" ๐ฅ