| name | prometheus-configuration |
| description | Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems. |
Prometheus Configuration
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
Purpose
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
When to Use
- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery
Prometheus Architecture
┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
│ /metrics endpoint
↓
┌──────────────┐
│ Prometheus │ ← Scrapes metrics periodically
│ Server │
└──────┬───────┘
│
├─→ AlertManager (alerts)
├─→ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)
Installation
Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
Docker Compose
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
volumes:
prometheus-data:
Configuration File
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090"
metrics_path: "/metrics"
scheme: "https"
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key
Reference: See assets/prometheus.yml.template
Scrape Configurations
Static Targets
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"
File-based Service Discovery
scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5m
targets/production.json:
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]
Kubernetes Service Discovery
scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Reference: See references/scrape-configs.md
Recording Rules
Create pre-computed metrics for frequently queried expressions:
groups:
- name: api_metrics
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_requests_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_requests_error_rate:percentage
expr: |
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory:utilization
expr: |
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
- record: instance:node_disk:utilization
expr: |
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Reference: See references/recording-rules.md
Alert Rules
groups:
- name: availability
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute"
- alert: HighErrorRate
expr: job:http_requests_error_rate:percentage > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
- name: resources
interval: 1m
rules:
- alert: HighCPUUsage
expr: instance:node_cpu:utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: instance:node_memory:utilization > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"
Validation
promtool check config prometheus.yml
promtool check rules /etc/prometheus/rules/*.yml
promtool query instant http://localhost:9090 'up'
Reference: See scripts/validate-prometheus.sh
Best Practices
- Use consistent naming for metrics (prefix_name_unit)
- Set appropriate scrape intervals (15-60s typical)
- Use recording rules for expensive queries
- Implement high availability (multiple Prometheus instances)
- Configure retention based on storage capacity
- Use relabeling for metric cleanup
- Monitor Prometheus itself
- Implement federation for large deployments
- Use Thanos/Cortex for long-term storage
- Document custom metrics
Troubleshooting
Check scrape targets:
curl http://localhost:9090/api/v1/targets
Check configuration:
curl http://localhost:9090/api/v1/status/config
Test query:
curl 'http://localhost:9090/api/v1/query?query=up'
Related Skills
grafana-dashboards - For visualization
slo-implementation - For SLO monitoring
distributed-tracing - For request tracing