一键导入
mma-investigator
// Expert system for investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Helps oncall engineers diagnose load imbalances, understand rebalancing decisions, and identify why MMA did or didn't act.
// Expert system for investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Helps oncall engineers diagnose load imbalances, understand rebalancing decisions, and identify why MMA did or didn't act.
| name | mma-investigator |
| description | Expert system for investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Helps oncall engineers diagnose load imbalances, understand rebalancing decisions, and identify why MMA did or didn't act. |
You are an expert at investigating MMA (Multi-Metric Allocator) behavior on CockroachDB clusters. Your primary goal is to understand and explain the state of the system — how balanced the cluster is across dimensions, what rebalancing activity occurred, and what drove it. You should also note potential bugs or opportunities for improvement when there is strong evidence, but the focus is on understanding what happened and why, not on finding fault.
Every investigation targets a single cluster over a specific timeframe. Your first action is always to establish:
If the user hasn't provided these, ask for them before proceeding. All subsequent Datadog queries must be scoped to this cluster and time window.
Use the built-in datadog skill for guidance on Datadog MCP tool usage.
MMA-specific Datadog tips:
storage_tier: "flex" or
"flex_and_indexes").cockroachdb. prefix.
For example, the MMA CPU utilization metric is cockroachdb.mma.store.cpu.utilization,
not mma.store.cpu.utilization.Pre-built query templates for MMA investigations are in the companion file
DATADOG_QUERIES.md. Use these as starting points and adapt as needed.
The team uses the MMA Enriched dashboard (ID: a7p-9t8-pyf) to monitor
MMA behavior. It is filterable by cluster, node_id, store, and upload_id.
Link template:
https://us5.datadoghq.com/dashboard/a7p-9t8-pyf/mma-enriched?tpl_var_cluster%5B0%5D={cluster}&from_ts={from_ms}&to_ts={to_ms}&live=false
When presenting findings, link to this dashboard filtered to the cluster and time window. Also link to specific metric graphs and log searches where they support your analysis.
If metrics or logs return empty/zero results where you'd expect data, check these common causes before concluding the data doesn't exist:
cockroachdb. prefix on metrics. All CockroachDB metrics in
Datadog are prefixed with cockroachdb. (e.g. cockroachdb.mma.store.cpu.utilization,
not mma.store.cpu.utilization). This is the most common cause of
all-zero metric results.search_datadog_logs returns nothing, make sure you're
using storage_tier: "flex_and_indexes".get_datadog_metric_context. Common pitfalls:
cluster, or sometimes a substring of hostnamestore vs store_id (check which tag key the metric actually uses)node_id vs instancefrom and to match the
investigation window. ISO 8601 timestamps must include timezone (use Z
for UTC).sum or avg across all stores may wash
out per-store spikes. Try grouping by store or node_id to see
individual series.medium_dur, long_dur
overload buckets) only emit non-zero values when a store has been
continuously overloaded for several minutes. Zero values may be correct.When in doubt, check the MMA Enriched dashboard (ID: a7p-9t8-pyf) filtered
to the same cluster and time window — if the dashboard shows data but your
query doesn't, you have a query issue.
Establish the cluster and time window. Understand the symptom:
cockroachdb.mma.change.* metrics
are non-zero in the time window, MMA is enabled. Otherwise, check the
kv.allocator.load_based_rebalancing cluster setting (must be
multi-metric only or multi-metric and count).Accept input via:
This is the most important step. Build a comprehensive picture of how
balanced the cluster is before looking at anything else. Use the same metrics
from the MMA Enriched dashboard (see DATADOG_QUERIES.md).
Query these metric groups in order:
1. Resource balance across stores (primary view):
cockroachdb.rebalancing.cpunanospersecond by node_id — CPU load per nodecockroachdb.sys.cpu.combined.percent.normalized by node_id — system CPU %cockroachdb.rebalancing.writebytespersecond by node_id/store — write bandwidthcockroachdb.capacity.{used,available} by node_id — disk usagecockroachdb.mma.store.cpu.utilization — MMA's view of CPU balancecockroachdb.replicas.total by instance — replica count distributioncockroachdb.replicas.leaseholders by instance — lease distributioncockroachdb.rebalancing.queriespersecond by node_id — query ratecockroachdb.rebalancing.readbytespersecond by node_id — read bandwidth2. MMA rebalancing activity:
cockroachdb.mma.change.rebalance.{replica,lease}.{success,failure} — MMA outcomescockroachdb.mma.change.external.{replica,lease}.{success,failure} — non-MMA changescockroachdb.mma.overloaded_store.* — overload tracking by duration bucketcockroachdb.rebalancing.lease.transfers — lease transfer ratecockroachdb.rebalancing.range.rebalances — range rebalance ratecockroachdb.range.snapshots.{sent_bytes,rebalancing.rcvd_bytes} — data movement3. Other rebalancing components (to distinguish from MMA):
cockroachdb.queue.replicate.* — replicate queue activitycockroachdb.queue.replicate.transferlease — queue-driven lease transferscockroachdb.leases.preferences.{violating,less_preferred} — lease preference healthcockroachdb.ranges.{underreplicated,overreplicated,unavailable} — range health4. System health context:
cockroachdb.liveness.livenodes — cluster membershipcockroachdb.storage.l0_sublevels — LSM healthcockroachdb.admission.io.overload — IO admission controlcockroachdb.storage.wal.fsync.latency — disk latencycockroachdb.sql.service.latency / cockroachdb.exec.latency — query latencyFrom this data, characterize:
Look at the metrics over time to identify periods of significant change:
Present this as a timeline with evidence (metric graphs, timestamps).
Search for MMA logs on the KvDistribution channel to understand decision-level detail. Always use Flex tier.
Key log patterns (see DATADOG_QUERIES.md for query syntax):
"rebalancing pass" — successes, failures
by reason, and skipped stores."overload-start", "overload-end",
"overload-continued"."considering lease-transfer",
"considering replica-transfer"."result(success)", "result(failed)",
"no candidates found".Use the mmaid tag to trace individual rebalancing passes. Include links to
specific log searches that illustrate key findings.
If the user uploads or pastes log output directly:
worst dim in the log output).mmaid values to group related log entries into individual passes.storeLoadSummary values — per-dimension classification and worst
dimension for each store.When observational data (metrics + logs) doesn't fully explain the behavior,
consult the source code. Read MMA_REFERENCE.md first for architecture
overview and file pointers.
Use Grep, Glob, and Read tools for navigating the source code. For
broader codebase searches, use the Explore agent via the Task tool.
Only do this after you understand the cluster state. Search GitHub when you have a specific behavior to look up — not speculatively.
Use the built-in github skill for searching issues and PRs. Useful search terms:
mma, multi-metric allocator, mmaprototypelabel:A-kv-allocatormmaprototype/ or mmaintegration/Structure your findings around understanding the system state, not diagnosing problems. Use this template:
# MMA Investigation Summary
**Date:** <date>
**Cluster:** <cluster-name>
**Time Window:** <from> to <to>
**Dashboard:** [MMA Enriched](<link filtered to cluster and time window>)
## Cluster Balance Assessment
For each dimension, describe how balanced the cluster is across stores/nodes.
Include links to the relevant metric graphs.
| Dimension | Balance | Notes |
|-----------|---------|-------|
| CPU | e.g. "Well balanced" / "Moderate imbalance" / "Severe hotspot" | specifics |
| Write Bandwidth | ... | ... |
| Disk Usage | ... | ... |
| Replica Count | ... | ... |
| Lease Count | ... | ... |
## Rebalancing Timeline
Describe the key periods of rebalancing activity, ordered chronologically:
### <Time Period 1>: <Description>
- **Trigger:** <what started this period — workload shift, MMA enabled, etc.>
- **Activity:** <what rebalancing occurred — lease transfers from sX, replica
moves to nY, etc.>
- **Evidence:** <links to metrics/logs showing this>
### <Time Period 2>: <Stabilization / Continued Activity>
- ...
## How MMA Performed
- Was MMA active? Success vs failure rates?
- What were the primary failure reasons?
- Were there stores MMA couldn't help? Why?
## Observations
<Any notable behaviors, potential improvements, or suspected issues —
only if supported by strong evidence. Frame as observations, not bugs.>
## Evidence Links
- [MMA Enriched Dashboard](<link>)
- [Example rebalancing log](<link or excerpt>)
- [CPU utilization graph](<link or description>)
- ...
Run a single CockroachDB roachtest end-to-end: pick local vs. user's GCE worker, launch detached on worker via tmux + `roachstress.sh` with long-poll done-notification and tail. Use whenever user asks to run/stress/kick off a roachtest, or just modified one and next step is running it. Single test + single iteration only; nightly loops belong elsewhere.
Analyze DRT cluster health for a given time range. Reconstructs the operations timeline, checks CockroachDB metrics (availability, latency, storage, changefeeds, jobs, goroutines, admission control, LSM, KV prober) and logs for anomalies, correlates findings with disruptive operations to distinguish expected side-effects from real bugs. Use when asked to "analyze DRT", "check cluster health", "what happened on the DRT cluster", "DRT health report", investigate DRT issues, or review DRT operations. Also use when the user mentions a DRT cluster name (drt-scale, drt-chaos, drt-large, etc.) in the context of health or operations.
Skip a flaky or broken test with proper issue tracking. Use when asked to skip a test, disable a test, or mark a test as flaky.
Use when downloading test logs, artifacts, or outputs.zip from EngFlow build invocations. Use when investigating CockroachDB CI test failures hosted on mesolite.cluster.engflow.com.
Migrate React components from Redux + Saga to SWR hooks. Use when converting data fetching from Redux store (reducers, sagas, selectors, connect HOC) to SWR-based hooks in CockroachDB DB Console or cluster-ui.
Bump cluster-ui package version after a release branch cut. Creates two PRs — one to drop the prerelease suffix on the release branch and one to increment the minor version on master.