| name | dashboards-as-code |
| description | Use this skill when building, modifying, reviewing, or pushing Grafana dashboards under `packages/grafana-dashboards/` (Materialize observability dashboards generated from Python via `grafana-foundation-sdk` and `py-mzmon-lib`). Also use it when writing panel descriptions for those dashboards, picking palettes, or working through Materialize-specific PromQL patterns (cluster/replica filtering, peek latency, source/sink metrics, label-family quirks).
|
Dashboards as Code
This skill is the entry point for the Materialize dashboards-as-code project. Stable conventions live in the repo docsite under docs/content/reference/internal/dashboard/ — this file is intentionally slim and links into the docsite at heading-level granularity. The non-link content below is the state snapshot: what currently exists, what's in flight, and what's queued for cleanup.
Audience reminder
The dashboards themselves target Materialize end users: database-literate operators with basic graph-reading fluency but minimal cloud / Kubernetes / observability expertise. SQL is fair game; jargon like "differential dataflow's arrangement" needs a one-liner explanation. Panel descriptions, titles, and cluster names should respect that baseline.
The docsite reference pages target repo contributors (SRE, Field Engineering, CloudOps, Database Engineers) and AI agents reading this skill.
Where to find what
| Looking for… | Read |
|---|
| Grafana target versions, Dashboard v1/v2 schema state, SDK choices | SDKs and Schemas |
Code structure, UID conventions, push process, gcx dashboards update vs ad-hoc v2 API | Generating and Pushing Dashboards |
| Palettes, layouts, panel visualization, panel description voice, PromQL conventions, label families, metric quirks, PromQL recipes, module-level constants table | Style Guidelines |
| Testing conventions (currently sparse) | Testing |
Frequently needed deep links into the Style Guidelines:
And into Generating:
Schema reference files
When uncertain about the exact shape Grafana expects, the cog-generated openapi schemas are bundled here:
references/dashboard.openapi.json — v1
references/dashboardv2beta1.openapi.json — v2beta1
references/dashboardv2.openapi.json — v2
All three generated from cog 61ff0a6055fa48f0c7b105fe4a37af637191314f (April 9, 2026).
Current Dashboard State
This section captures the live state of the dashboards in this repo so the next session has something concrete to start from. Update it when state changes meaningfully (new dashboard, new tab, retired panel, theme reassignment).
Dashboard inventory
| Family | Dashboard module | Class | Live UID |
|---|
mz_environment | overview.overview_dashboard | EnvironmentOverviewDashboard | (auto-assigned at first upload; codified UID is mz-mon-env-top, but the live one diverged before that became authoritative — see UID selection and behavior) |
The mz_environment/overview dashboard has six tabs, in declared order:
| # | Tab title | Module | Theme |
|---|
| 1 | Summary | summary.py | (no unique theme; uses health palette and themes from imports) |
| 2 | Kubernetes Workloads | k8s_resources.py | K8S_THEME = palette.THEME_PALETTE[0] (blue) |
| 3 | Cluster Objects / Replicas | cluster_objects.py | CLUSTERS_THEME = palette.THEME_PALETTE[2] (teal) |
| 4 | Connections / Activity | connections_activity.py | CONNECTIONS_THEME = palette.THEME_PALETTE[1] (cyan) |
| 5 | Compute Objects | compute_objects.py | COMPUTE_THEME = palette.THEME_PALETTE[3] (orange) |
| 6 | Storage Objects | storage_objects.py | STORAGE_THEME = palette.THEME_PALETTE[4] (yellow) |
The Summary tab re-uses the KubeResourcesMixin's cpu_total_panel and memory_totals_panel, and also mirrors add_currently_hydrating_panel(...) from compute_objects.py in its Environment Health row.
Tab-by-tab row structure
Summary
- Environment Health — Environment Status, Availability, Last Restart, Currently Hydrating (mirror), Current CPU Usage, Current Memory Usage
- Environment Info — Materialize Version, Total CPU Capacity, Total Memory
Kubernetes Workloads
- Resources Summary — Total CPU Capacity, Total Memory (includes monitoring)
- Workload Readiness — Pod Readiness, StatefulSet Readiness, Deployment Readiness
- Pod Metrics — Pod CPU Usage, Pod Memory Usage
- Pod Networking — Rx, Tx, Errors, Packet Drops
Cluster Objects / Replicas
- Cluster Summary — Cluster Count, Replica Count
- Replication / Availability — Replica Sizes (donut), Replica AZs
- Cluster Information — Cluster Information table
Connections / Activity
- Connection Summary — Active Sessions, Active Queries, Adapter Command Rate
- Queries — Distribution donut, Query Rate, Peek Latency p50/p90/p99 (3 separate panels)
- Adapter Commands — Adapter Commands by Application table
Compute Objects
- Compute Objects Summary — Active MV, Active Indexes, Active Views, Active Subscribes (donut), Index Types (donut)
- Freshness — STUB row, no panels yet (placeholder title only)
- Hydration — Currently Hydrating, Hydration Queue Size, Slowest Hydrating Collections (top-15 horizontal bar)
- Dataflows — Dataflow Count, Dataflow Count (per worker), Dataflow Elapsed Rate (log scale)
- Arrangements — Arrangement Rate, Arrangement Rate (per worker), 3 record-count tables (System / User / Transient)
Storage Objects
- Storage Objects Summary — Active Sources, Active Sinks, Active Tables
- Sources — Source Types donut, Sources by Status table, Source Bytes Received (rate)
- Sinks — Sink Types donut, Sink Throughput, Sink Lag (staged minus committed)
- Iceberg Sinks (collapsed by default) — Commit Latency p50/p90/p99, Commit Failures & Conflicts, File & Snapshot Rate
- Kafka Sinks (collapsed by default) — TX Error Rate, Output Buffer, Connect / Disconnect Rate
Known stubs and orphans
compute_objects.py Freshness row — title-only, reserved for end-to-end freshness/lag metrics. Pick a freshness signal (mz_internal.mz_materialized_view_refreshes?) when filling it in.
dataflows.py — orphaned after Dataflows became a row inside Compute Objects rather than its own tab. Safe to delete; only referenced from overview_dashboard.py's import history (now removed).
Reference environments
Materialize developers may have access to an internal shared Grafana with multiple test environments. It can be useful to look at queries in live environments when building dashboards. Do not use environments without explicit permission.
Always scope investigative queries with materialize_cloud_organization_id="..." when testing — these are shared envs and you don't want to mix data across them.
Cleanup / refactor candidates
Tracked items that are working but could be tidier:
ENV_SCOPED_NOTE is duplicated in compute_objects.py and storage_objects.py. Lift to visualization.py (or a sibling _messages.py if it grows).
_COMPUTE_FILTER and _ARRANGEMENT_FILTER are the same string in two modules. Lift to a shared place; rename to something neutral like _LONGFORM_CLUSTER_FILTER.
dataflows.py is orphaned. Safe to rm.
- The Compute Objects "Freshness" row is a title-only stub. Pick a freshness signal and fill it in (
mz_materialized_view_lag_seconds in newer Materialize versions, or a derived metric from frontier metrics).
mz-mon- prefix isn't enforced in MzDashboard.UID values today (the class has UID = "env-top" and MzDashboard.__init__ prefixes it). Consistent across all current dashboards (one). Worth a validator if more dashboards land.