원클릭으로
prometheus-configuration-specialist
// Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager.
// Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager.
| name | Prometheus Configuration Specialist |
| slug | observability-prometheus-configurator |
| description | Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager. |
| capabilities | ["Prometheus scrape configuration with service discovery","Alerting rules with multi-window burn rate patterns","Recording rules for pre-computing expensive queries","Relabeling for metric filtering and label transformation","Federation for multi-DC and cross-service monitoring","PromQL query optimization and cardinality management","Alertmanager routing and notification configuration","Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0)"] |
| inputs | ["Service topology and scrape targets","Service discovery mechanism (Kubernetes, Consul, EC2, file_sd)","Alert definitions with severity levels","Recording rule requirements","Alertmanager notification channels (PagerDuty, Slack, email)","Federation topology (if multi-DC or cross-service)","Cardinality constraints and retention requirements"] |
| outputs | ["prometheus.yml configuration file","Alerting rules YAML files","Recording rules YAML files","Alertmanager configuration","Relabeling strategies for cardinality management","PromQL query optimization recommendations","Federation endpoint configuration","Service discovery relabel configs"] |
| keywords | ["prometheus","monitoring","observability","alerting","recording-rules","service-discovery","kubernetes-sd","promql","federation","alertmanager","metrics","relabeling","cardinality","burn-rate","slo"] |
| version | 1.0.0 |
| owner | cognitive-toolworks |
| license | MIT |
| security | No sensitive data allowed in metric labels. Use relabeling to drop secrets. Avoid high-cardinality labels (user IDs, request IDs). |
| links | [{"title":"Prometheus 3.0 Release (November 2024)","url":"https://prometheus.io/blog/2024/11/14/prometheus-3-0/","accessed":"2025-10-26"},{"title":"Prometheus Configuration Documentation","url":"https://prometheus.io/docs/prometheus/latest/configuration/configuration/","accessed":"2025-10-26"},{"title":"Prometheus Alerting Best Practices","url":"https://prometheus.io/docs/practices/alerting/","accessed":"2025-10-26"},{"title":"Prometheus Recording Rules","url":"https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/","accessed":"2025-10-26"},{"title":"Prometheus Naming Conventions","url":"https://prometheus.io/docs/practices/naming/","accessed":"2025-10-26"}] |
Trigger conditions:
Complements:
observability-stack-configurator: For overall observability stack designobservability-unified-dashboard: For Grafana dashboard design with Prometheus datasourcesobservability-slo-calculator: For SLO/error budget definitions that drive alerting rulesOut of scope:
Time normalization:
NOW_ET using NIST/time.gov semantics (America/New_York, ISO-8601)NOW_ET for all access dates in citationsVerify inputs:
Validate service discovery:
kubernetes_sd_config: Verify Kubernetes API access and RBAC permissionsconsul_sd_config: Verify Consul agent accessibility and service catalogec2_sd_config: Verify AWS credentials and EC2 instance tagsfile_sd_config: Verify JSON/YAML file path and refresh intervalCheck cardinality constraints:
metric_relabel_configs to drop high-cardinality labelsSource freshness:
NOW_ET)Abort if:
Scenario: Single service with static targets or file-based service discovery, basic alerting, no recording rules.
Steps:
Global Configuration:
scrape_interval: 15s (balance between data freshness and storage)evaluation_interval: 15s (how often to evaluate alerting/recording rules)external_labels for federation or remote write (e.g., datacenter: us-east-1)Scrape Configuration:
job_name (logical grouping, e.g., api-service, postgres-exporter)static_configs with targets: ['localhost:9090']file_sd_configs with files: ['/etc/prometheus/targets/*.json']scrape_interval override if different from globalBasic Alerting Rules:
alerts.yml with groupsseverity: critical|warning|info)Alertmanager Integration:
alertmanager_config with static_configs pointing to Alertmanager instancesend_resolved: true to notify when alert resolvesOutput:
prometheus.yml with global config, single scrape job, alerting rules file referencealerts.yml with 2-5 basic alertsToken budget: ≤2000 tokens
Scenario: Multiple services with Kubernetes/Consul/EC2 service discovery, recording rules for expensive queries, Alertmanager routing with grouping.
Steps:
Service Discovery Configuration:
Kubernetes Service Discovery:
kubernetes_sd_configs with role: pod (discover all pods with prometheus.io/scrape: "true" annotation)relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
node, pod, service, endpoints, ingress (accessed NOW_ET: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config)Consul Service Discovery:
consul_sd_configs with server: 'consul.service.consul:8500'tags: ['production', 'monitoring-enabled']EC2 Service Discovery:
ec2_sd_configs with AWS region and filters__meta_ec2_tag_<tagkey>Recording Rules:
level:metric:operations (accessed NOW_ET: https://prometheus.io/docs/practices/naming/)job, instance, cluster)sum, avg, rate)groups:
- name: api_recording_rules
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Relabeling Strategies:
Metric Relabeling (metric_relabel_configs):
metric_relabel_configs:
- source_labels: [user_id]
action: labeldrop
regex: .*
- source_labels: [__name__]
action: drop
regex: 'expensive_metric_.*'
Target Relabeling (relabel_configs):
Alerting Rules (Advanced):
Multi-Window Burn Rate Alerts:
groups:
- name: slo_alerts
rules:
- alert: ErrorBudgetBurn_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning 14.4× faster than allowed"
description: "{{ $labels.job }} has {{ $value | humanizePercentage }} error rate (SLO: 99.9%, budget exhausted in 2 days)"
Symptom-Based Alerts:
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.job }}"
description: "p95 latency is {{ $value }}s (threshold: 0.5s)"
Alertmanager Routing:
cluster + alertname, wait 30s for batchroute:
receiver: 'default-email'
group_by: ['cluster', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<PD_SERVICE_KEY>'
- name: 'slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'default-email'
email_configs:
- to: 'ops@example.com'
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['cluster', 'alertname']
Output:
prometheus.yml with Kubernetes/Consul/EC2 service discovery, relabeling configsrecording_rules.yml with 5-10 recording rules (level:metric:operations naming)alerts.yml with multi-window burn rate alerts and symptom-based alertsalertmanager.yml with routing tree, receivers, inhibition rulesToken budget: ≤6000 tokens
Scenario: Multi-datacenter federation, cardinality management, PromQL query optimization, Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0).
Steps:
Federation Configuration:
Hierarchical Federation (Multi-DC):
NOW_ET: https://prometheus.io/docs/prometheus/latest/federation/)scrape_configs:
- job_name: 'federate-us-east-1'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # Only federate aggregated recording rules
static_configs:
- targets:
- 'prometheus-us-east-1:9090'
- 'prometheus-us-west-2:9090'
Cross-Service Federation:
PromQL Optimization:
Query Performance Best Practices:
sum(http_requests_total) (aggregates 10k+ time series)sum(http_requests_total{job="api-service", status=~"5.."}) (aggregates 10-50 time series)api_http_requests_total) without labelsNOW_ET: https://prometheus.io/docs/prometheus/latest/querying/basics/)# Compute error rate using pre-recorded job-level metrics (fast)
job:http_requests_total:rate5m{job="api-service", status=~"5.."}
/
job:http_requests_total:rate5m{job="api-service"}
Cardinality Management:
topk(10, count by (__name__)({__name__=~".+"})) to find high-cardinality metricsmetric_relabel_configs to remove high-cardinality labelsmetric_relabel_configs with action: drop to sample metricsmetric_relabel_configs:
# Drop user_id label (high cardinality)
- source_labels: [user_id]
action: labeldrop
regex: .*
# Keep only 5xx errors (reduce cardinality of status label)
- source_labels: [status]
action: keep
regex: '5..'
Prometheus 3.0+ Features:
UTF-8 Support (Prometheus 3.0+):
NOW_ET: https://prometheus.io/blog/2024/11/14/prometheus-3-0/)http_requests_total{endpoint="用户登录"} (Chinese characters now valid)OpenTelemetry OTLP Receiver (Prometheus 3.0+):
/api/v1/otlp/v1/metricsotlp:
protocols:
http:
endpoint: 0.0.0.0:9090
Remote Write 2.0 (Prometheus 3.0+):
Advanced Relabeling Patterns:
Extract Kubernetes Annotations into Labels:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_app_version]
action: replace
target_label: version
- source_labels: [__meta_kubernetes_pod_annotation_team]
action: replace
target_label: team
Drop Expensive Metrics Based on Name Pattern:
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: 'go_.*|process_.*' # Drop Go runtime metrics to save storage
Recording Rules for Aggregation:
Multi-Level Aggregation:
groups:
- name: instance_aggregation
interval: 30s
rules:
# Level 1: Instance-level
- record: instance:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (instance, job, status)
# Level 2: Job-level (aggregates Level 1)
- record: job:http_requests_total:rate5m
expr: sum(instance:http_requests_total:rate5m) by (job, status)
# Level 3: Cluster-level (aggregates Level 2)
- record: cluster:http_requests_total:rate5m
expr: sum(job:http_requests_total:rate5m) by (status)
Alertmanager Advanced Features:
Time-Based Routing (Mute Alerts During Maintenance):
route:
routes:
- match:
severity: warning
mute_time_intervals:
- weekends
- maintenance_window
mute_time_intervals:
- name: weekends
time_intervals:
- weekdays: ['saturday', 'sunday']
- name: maintenance_window
time_intervals:
- times:
- start_time: '23:00'
end_time: '01:00'
Grouping by Multiple Labels:
route:
group_by: ['cluster', 'namespace', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
Output:
prometheus.yml with federation endpoints, OTLP receiver, Remote Write 2.0Token budget: ≤12000 tokens
When to use federation vs remote write:
When to create recording rules:
Alert severity assignment:
Service discovery selection:
kubernetes_sd_configs with role: pod for dynamic pod discoveryconsul_sd_configs for VM-based infrastructure with Consul service catalogec2_sd_configs for AWS instances with consistent taggingfile_sd_configs for static infrastructure or external service discoveryCardinality limits:
prometheus_tsdb_symbol_table_size_bytes >1GB or prometheus_tsdb_head_series >10MAbort conditions:
prometheus.yml schema:
global:
scrape_interval: <duration>
evaluation_interval: <duration>
external_labels:
<label_name>: <label_value>
alerting:
alertmanagers:
- static_configs:
- targets: ['<alertmanager_host>:<port>']
rule_files:
- 'alerts.yml'
- 'recording_rules.yml'
scrape_configs:
- job_name: '<job_name>'
kubernetes_sd_configs: [...] # OR consul_sd_configs, ec2_sd_configs, static_configs
relabel_configs: [...]
metric_relabel_configs: [...]
alerts.yml schema:
groups:
- name: <group_name>
rules:
- alert: <alert_name>
expr: <promql_expression>
for: <duration>
labels:
severity: critical|warning|info
annotations:
summary: <short_description>
description: <detailed_description_with_templating>
recording_rules.yml schema:
groups:
- name: <group_name>
interval: <duration>
rules:
- record: <level>:<metric>:<operations>
expr: <promql_expression>
labels:
<label_name>: <label_value>
alertmanager.yml schema:
route:
receiver: <default_receiver>
group_by: [<label_name>, ...]
group_wait: <duration>
group_interval: <duration>
repeat_interval: <duration>
routes:
- match:
<label_name>: <label_value>
receiver: <receiver_name>
receivers:
- name: <receiver_name>
pagerduty_configs: [...]
slack_configs: [...]
email_configs: [...]
inhibit_rules:
- source_match:
<label_name>: <label_value>
target_match:
<label_name>: <label_value>
equal: [<label_name>, ...]
Required fields:
prometheus.yml: global.scrape_interval, scrape_configs[].job_namealerts.yml: alert, expr, labels.severity, annotations.summaryrecording_rules.yml: record, expralertmanager.yml: route.receiver, receivers[].nameValidation:
promtool check rules <file.yml>promtool check config prometheus.ymlamtool check-config alertmanager.ymlScenario: Scrape all pods with prometheus.io/scrape: "true" annotation, create recording rules for API latency.
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
recording_rules.yml:
groups:
- name: api_latency
interval: 30s
rules:
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Token budgets:
Safety:
promtool check rulesmetric_relabel_configs to drop secrets if accidentally exposedAuditability:
level:metric:operations conventionsummary and description with templatingDeterminism:
cluster + alertname produces predictable batchesPerformance:
Official Documentation:
NOW_ET)NOW_ET)NOW_ET)NOW_ET)NOW_ET)NOW_ET)Tooling:
promtool: Validate Prometheus configs and PromQL queriesamtool: Validate Alertmanager configs and manage silencesRelated Skills:
observability-stack-configurator: Overall observability stack designobservability-unified-dashboard: Grafana dashboard design with Prometheus datasourcesobservability-slo-calculator: SLO/error budget definitions for alerting ruleskubernetes-manifest-generator: Kubernetes deployment manifests for Prometheus + AlertmanagerAnalyzes and optimizes frontend performance using Core Web Vitals, bundle analysis, lazy loading, image optimization, and caching strategies
Design RESTful APIs with OpenAPI 3.1/3.2, resource modeling, HTTP semantics, versioning, pagination, HATEOAS, and OWASP API Security.
Design data pipelines with quality checks, orchestration, and governance using modern data stack patterns for robust ELT/ETL workflows.
Validate WCAG 2.2 compliance (A/AA/AAA) with ARIA, color contrast, keyboard navigation, screen readers, and automated testing via axe-core/Pa11y.
Design Kafka architectures with exactly-once semantics, Kafka Streams, ksqlDB, Schema Registry (Avro/Protobuf), performance tuning, and KRaft.
Design RabbitMQ architectures with exchanges, quorum queues, routing patterns, clustering, dead letter exchanges, and AMQP best practices.