con un clic
prometheus-configuration-specialist
// Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager.
// Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager.
Analyzes and optimizes frontend performance using Core Web Vitals, bundle analysis, lazy loading, image optimization, and caching strategies
Design RESTful APIs with OpenAPI 3.1/3.2, resource modeling, HTTP semantics, versioning, pagination, HATEOAS, and OWASP API Security.
Design data pipelines with quality checks, orchestration, and governance using modern data stack patterns for robust ELT/ETL workflows.
Validate WCAG 2.2 compliance (A/AA/AAA) with ARIA, color contrast, keyboard navigation, screen readers, and automated testing via axe-core/Pa11y.
Design Kafka architectures with exactly-once semantics, Kafka Streams, ksqlDB, Schema Registry (Avro/Protobuf), performance tuning, and KRaft.
Design RabbitMQ architectures with exchanges, quorum queues, routing patterns, clustering, dead letter exchanges, and AMQP best practices.
| name | Prometheus Configuration Specialist |
| slug | observability-prometheus-configurator |
| description | Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager. |
| capabilities | ["Prometheus scrape configuration with service discovery","Alerting rules with multi-window burn rate patterns","Recording rules for pre-computing expensive queries","Relabeling for metric filtering and label transformation","Federation for multi-DC and cross-service monitoring","PromQL query optimization and cardinality management","Alertmanager routing and notification configuration","Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0)"] |
| inputs | ["Service topology and scrape targets","Service discovery mechanism (Kubernetes, Consul, EC2, file_sd)","Alert definitions with severity levels","Recording rule requirements","Alertmanager notification channels (PagerDuty, Slack, email)","Federation topology (if multi-DC or cross-service)","Cardinality constraints and retention requirements"] |
| outputs | ["prometheus.yml configuration file","Alerting rules YAML files","Recording rules YAML files","Alertmanager configuration","Relabeling strategies for cardinality management","PromQL query optimization recommendations","Federation endpoint configuration","Service discovery relabel configs"] |
| keywords | ["prometheus","monitoring","observability","alerting","recording-rules","service-discovery","kubernetes-sd","promql","federation","alertmanager","metrics","relabeling","cardinality","burn-rate","slo"] |
| version | 1.0.0 |
| owner | cognitive-toolworks |
| license | MIT |
| security | No sensitive data allowed in metric labels. Use relabeling to drop secrets. Avoid high-cardinality labels (user IDs, request IDs). |
| links | [{"title":"Prometheus 3.0 Release (November 2024)","url":"https://prometheus.io/blog/2024/11/14/prometheus-3-0/","accessed":"2025-10-26"},{"title":"Prometheus Configuration Documentation","url":"https://prometheus.io/docs/prometheus/latest/configuration/configuration/","accessed":"2025-10-26"},{"title":"Prometheus Alerting Best Practices","url":"https://prometheus.io/docs/practices/alerting/","accessed":"2025-10-26"},{"title":"Prometheus Recording Rules","url":"https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/","accessed":"2025-10-26"},{"title":"Prometheus Naming Conventions","url":"https://prometheus.io/docs/practices/naming/","accessed":"2025-10-26"}] |
Trigger conditions:
Complements:
observability-stack-configurator: For overall observability stack designobservability-unified-dashboard: For Grafana dashboard design with Prometheus datasourcesobservability-slo-calculator: For SLO/error budget definitions that drive alerting rulesOut of scope:
Time normalization:
NOW_ET using NIST/time.gov semantics (America/New_York, ISO-8601)NOW_ET for all access dates in citationsVerify inputs:
Validate service discovery:
kubernetes_sd_config: Verify Kubernetes API access and RBAC permissionsconsul_sd_config: Verify Consul agent accessibility and service catalogec2_sd_config: Verify AWS credentials and EC2 instance tagsfile_sd_config: Verify JSON/YAML file path and refresh intervalCheck cardinality constraints:
metric_relabel_configs to drop high-cardinality labelsSource freshness:
NOW_ET)Abort if:
Scenario: Single service with static targets or file-based service discovery, basic alerting, no recording rules.
Steps:
Global Configuration:
scrape_interval: 15s (balance between data freshness and storage)evaluation_interval: 15s (how often to evaluate alerting/recording rules)external_labels for federation or remote write (e.g., datacenter: us-east-1)Scrape Configuration:
job_name (logical grouping, e.g., api-service, postgres-exporter)static_configs with targets: ['localhost:9090']file_sd_configs with files: ['/etc/prometheus/targets/*.json']scrape_interval override if different from globalBasic Alerting Rules:
alerts.yml with groupsseverity: critical|warning|info)Alertmanager Integration:
alertmanager_config with static_configs pointing to Alertmanager instancesend_resolved: true to notify when alert resolvesOutput:
prometheus.yml with global config, single scrape job, alerting rules file referencealerts.yml with 2-5 basic alertsToken budget: ≤2000 tokens
Scenario: Multiple services with Kubernetes/Consul/EC2 service discovery, recording rules for expensive queries, Alertmanager routing with grouping.
Steps:
Service Discovery Configuration:
Kubernetes Service Discovery:
kubernetes_sd_configs with role: pod (discover all pods with prometheus.io/scrape: "true" annotation)relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
node, pod, service, endpoints, ingress (accessed NOW_ET: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config)Consul Service Discovery:
consul_sd_configs with server: 'consul.service.consul:8500'tags: ['production', 'monitoring-enabled']EC2 Service Discovery:
ec2_sd_configs with AWS region and filters__meta_ec2_tag_<tagkey>Recording Rules:
level:metric:operations (accessed NOW_ET: https://prometheus.io/docs/practices/naming/)job, instance, cluster)sum, avg, rate)groups:
- name: api_recording_rules
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Relabeling Strategies:
Metric Relabeling (metric_relabel_configs):
metric_relabel_configs:
- source_labels: [user_id]
action: labeldrop
regex: .*
- source_labels: [__name__]
action: drop
regex: 'expensive_metric_.*'
Target Relabeling (relabel_configs):
Alerting Rules (Advanced):
Multi-Window Burn Rate Alerts:
groups:
- name: slo_alerts
rules:
- alert: ErrorBudgetBurn_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning 14.4× faster than allowed"
description: "{{ $labels.job }} has {{ $value | humanizePercentage }} error rate (SLO: 99.9%, budget exhausted in 2 days)"
Symptom-Based Alerts:
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.job }}"
description: "p95 latency is {{ $value }}s (threshold: 0.5s)"
Alertmanager Routing:
cluster + alertname, wait 30s for batchroute:
receiver: 'default-email'
group_by: ['cluster', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<PD_SERVICE_KEY>'
- name: 'slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'default-email'
email_configs:
- to: 'ops@example.com'
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['cluster', 'alertname']
Output:
prometheus.yml with Kubernetes/Consul/EC2 service discovery, relabeling configsrecording_rules.yml with 5-10 recording rules (level:metric:operations naming)alerts.yml with multi-window burn rate alerts and symptom-based alertsalertmanager.yml with routing tree, receivers, inhibition rulesToken budget: ≤6000 tokens
Scenario: Multi-datacenter federation, cardinality management, PromQL query optimization, Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0).
Steps:
Federation Configuration:
Hierarchical Federation (Multi-DC):
NOW_ET: https://prometheus.io/docs/prometheus/latest/federation/)scrape_configs:
- job_name: 'federate-us-east-1'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # Only federate aggregated recording rules
static_configs:
- targets:
- 'prometheus-us-east-1:9090'
- 'prometheus-us-west-2:9090'
Cross-Service Federation:
PromQL Optimization:
Query Performance Best Practices:
sum(http_requests_total) (aggregates 10k+ time series)sum(http_requests_total{job="api-service", status=~"5.."}) (aggregates 10-50 time series)api_http_requests_total) without labelsNOW_ET: https://prometheus.io/docs/prometheus/latest/querying/basics/)# Compute error rate using pre-recorded job-level metrics (fast)
job:http_requests_total:rate5m{job="api-service", status=~"5.."}
/
job:http_requests_total:rate5m{job="api-service"}
Cardinality Management:
topk(10, count by (__name__)({__name__=~".+"})) to find high-cardinality metricsmetric_relabel_configs to remove high-cardinality labelsmetric_relabel_configs with action: drop to sample metricsmetric_relabel_configs:
# Drop user_id label (high cardinality)
- source_labels: [user_id]
action: labeldrop
regex: .*
# Keep only 5xx errors (reduce cardinality of status label)
- source_labels: [status]
action: keep
regex: '5..'
Prometheus 3.0+ Features:
UTF-8 Support (Prometheus 3.0+):
NOW_ET: https://prometheus.io/blog/2024/11/14/prometheus-3-0/)http_requests_total{endpoint="用户登录"} (Chinese characters now valid)OpenTelemetry OTLP Receiver (Prometheus 3.0+):
/api/v1/otlp/v1/metricsotlp:
protocols:
http:
endpoint: 0.0.0.0:9090
Remote Write 2.0 (Prometheus 3.0+):
Advanced Relabeling Patterns:
Extract Kubernetes Annotations into Labels:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_app_version]
action: replace
target_label: version
- source_labels: [__meta_kubernetes_pod_annotation_team]
action: replace
target_label: team
Drop Expensive Metrics Based on Name Pattern:
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: 'go_.*|process_.*' # Drop Go runtime metrics to save storage
Recording Rules for Aggregation:
Multi-Level Aggregation:
groups:
- name: instance_aggregation
interval: 30s
rules:
# Level 1: Instance-level
- record: instance:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (instance, job, status)
# Level 2: Job-level (aggregates Level 1)
- record: job:http_requests_total:rate5m
expr: sum(instance:http_requests_total:rate5m) by (job, status)
# Level 3: Cluster-level (aggregates Level 2)
- record: cluster:http_requests_total:rate5m
expr: sum(job:http_requests_total:rate5m) by (status)
Alertmanager Advanced Features:
Time-Based Routing (Mute Alerts During Maintenance):
route:
routes:
- match:
severity: warning
mute_time_intervals:
- weekends
- maintenance_window
mute_time_intervals:
- name: weekends
time_intervals:
- weekdays: ['saturday', 'sunday']
- name: maintenance_window
time_intervals:
- times:
- start_time: '23:00'
end_time: '01:00'
Grouping by Multiple Labels:
route:
group_by: ['cluster', 'namespace', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
Output:
prometheus.yml with federation endpoints, OTLP receiver, Remote Write 2.0Token budget: ≤12000 tokens
When to use federation vs remote write:
When to create recording rules:
Alert severity assignment:
Service discovery selection:
kubernetes_sd_configs with role: pod for dynamic pod discoveryconsul_sd_configs for VM-based infrastructure with Consul service catalogec2_sd_configs for AWS instances with consistent taggingfile_sd_configs for static infrastructure or external service discoveryCardinality limits:
prometheus_tsdb_symbol_table_size_bytes >1GB or prometheus_tsdb_head_series >10MAbort conditions:
prometheus.yml schema:
global:
scrape_interval: <duration>
evaluation_interval: <duration>
external_labels:
<label_name>: <label_value>
alerting:
alertmanagers:
- static_configs:
- targets: ['<alertmanager_host>:<port>']
rule_files:
- 'alerts.yml'
- 'recording_rules.yml'
scrape_configs:
- job_name: '<job_name>'
kubernetes_sd_configs: [...] # OR consul_sd_configs, ec2_sd_configs, static_configs
relabel_configs: [...]
metric_relabel_configs: [...]
alerts.yml schema:
groups:
- name: <group_name>
rules:
- alert: <alert_name>
expr: <promql_expression>
for: <duration>
labels:
severity: critical|warning|info
annotations:
summary: <short_description>
description: <detailed_description_with_templating>
recording_rules.yml schema:
groups:
- name: <group_name>
interval: <duration>
rules:
- record: <level>:<metric>:<operations>
expr: <promql_expression>
labels:
<label_name>: <label_value>
alertmanager.yml schema:
route:
receiver: <default_receiver>
group_by: [<label_name>, ...]
group_wait: <duration>
group_interval: <duration>
repeat_interval: <duration>
routes:
- match:
<label_name>: <label_value>
receiver: <receiver_name>
receivers:
- name: <receiver_name>
pagerduty_configs: [...]
slack_configs: [...]
email_configs: [...]
inhibit_rules:
- source_match:
<label_name>: <label_value>
target_match:
<label_name>: <label_value>
equal: [<label_name>, ...]
Required fields:
prometheus.yml: global.scrape_interval, scrape_configs[].job_namealerts.yml: alert, expr, labels.severity, annotations.summaryrecording_rules.yml: record, expralertmanager.yml: route.receiver, receivers[].nameValidation:
promtool check rules <file.yml>promtool check config prometheus.ymlamtool check-config alertmanager.ymlScenario: Scrape all pods with prometheus.io/scrape: "true" annotation, create recording rules for API latency.
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
recording_rules.yml:
groups:
- name: api_latency
interval: 30s
rules:
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Token budgets:
Safety:
promtool check rulesmetric_relabel_configs to drop secrets if accidentally exposedAuditability:
level:metric:operations conventionsummary and description with templatingDeterminism:
cluster + alertname produces predictable batchesPerformance:
Official Documentation:
NOW_ET)NOW_ET)NOW_ET)NOW_ET)NOW_ET)NOW_ET)Tooling:
promtool: Validate Prometheus configs and PromQL queriesamtool: Validate Alertmanager configs and manage silencesRelated Skills:
observability-stack-configurator: Overall observability stack designobservability-unified-dashboard: Grafana dashboard design with Prometheus datasourcesobservability-slo-calculator: SLO/error budget definitions for alerting ruleskubernetes-manifest-generator: Kubernetes deployment manifests for Prometheus + Alertmanager