| name | ops-query |
| description | Internal — for Boundless team members only. Cross-reference Boundless indexer API data, broker telemetry, and service logs to investigate production and staging operational issues. Use when the user wants to understand why slashings happened on prod/staging, diagnose prover or service failures in deployed environments, correlate market events with broker behavior, investigate fulfillment rate drops, look at prover/service logs, or perform any analysis that requires combining on-chain indexer data with off-chain broker telemetry and CloudWatch logs. Also use when the user asks to "investigate", "diagnose", or "find root cause" for prover, service, or market issues on live networks. Do NOT use for debugging local code changes, reviewing PRs, or investigating issues in the codebase itself. |
Query
Combine on-chain indexer data, off-chain broker telemetry, and CloudWatch service logs to investigate operational issues and find insights.
Setup
Set up the data sources needed for the investigation. Not all sources are needed for every query -- use the ones relevant to the task.
-
Read network_secrets.toml from the repo root. If it exists, it contains credentials for all environments (indexer API keys, telemetry DB URLs/passwords, AWS creds). Also read network_address_labels.json (same directory) for labelling addresses -- it is plain JSON ({"0xaddr": "label", ...}). If network_secrets.toml is not present, recommend the user create it -- instructions and credentials are in the Boundless runbook. If network_address_labels.json is not present, recommend the user create it -- the canonical address mapping is in the Boundless runbook.
-
Read and follow the ops-indexer-query skill at .claude/skills/ops-indexer-query/SKILL.md to set up indexer access (MARKET_INDEXER_URL, ZKC_INDEXER_URL, optional INDEXER_API_KEY, and the indexer_get helper function).
-
Read and follow the ops-telemetry-query skill at .claude/skills/ops-telemetry-query/SKILL.md to set up Redshift access (REDSHIFT_URL). Before writing any telemetry SQL, read crates/boundless-market/src/telemetry.rs for exact column names and enum values.
-
Read and follow the ops-logs-query skill at .claude/skills/ops-logs-query/SKILL.md to set up CloudWatch log access (AWS credentials, log group discovery). Use when the investigation benefits from raw service logs -- especially for our own operated provers, or for other services like indexer, order stream, slasher, etc.
-
Ask the user which network they want to investigate (determines which market indexer + ZKC indexer to use).
Data Source Overview
| Source | What it knows | Join fields |
|---|
| Market Indexer | On-chain request lifecycle: submitted, locked, fulfilled, slashed, expired. Pricing, collateral, tx hashes, timestamps. | request_id, request_digest, prover/requestor addresses |
| Telemetry (Redshift) | Broker-side operational data: evaluation decisions, skip reasons, proving durations, error codes, queue depths, estimated vs actual proving times. | request_id, request_digest, broker_address, order_id |
| CloudWatch Logs | Raw service logs for provers we operate (and other infra services). Detailed error messages, stack traces, runtime behavior. | request_id, request_digest, timestamps |
The key join between indexer and telemetry is request_id and request_digest. The indexer's lock_prover_address or fulfill_prover_address corresponds to telemetry's broker_address. Logs can be correlated by searching for the same request_id or request_digest within the relevant time window.
Telemetry is opt-in -- not all brokers send telemetry. If a prover address has no telemetry data, note this to the user. CloudWatch logs are only available for services we operate.
Investigation Workflow
All investigations follow the same pattern:
- Identify targets -- Use the indexer to find the relevant requests/addresses/time periods.
- Correlate with telemetry -- Use the request IDs, digests, or broker address + time window to look up telemetry data. Telemetry provides a pre-processed view with structured skip reasons, error codes, proving durations, and estimation accuracy -- this should answer most questions about why orders were skipped, failed, or slashed without needing raw logs.
- Dig into logs (last resort) -- Only go to CloudWatch logs if the indexer and telemetry data are insufficient. Logs are useful when you need raw error messages, stack traces, or runtime details that telemetry doesn't capture (e.g. infrastructure-level failures, panics, or non-prover service issues). One particularly useful check: look for recent deployments in bento prover logs. Nightly deployments restart Docker Compose and can explain gaps in telemetry, sudden behavior changes, or outages. New code deployed can also introduce bugs. See the "Checking for Recent Deployments" section in the ops-logs-query skill.
- Analyze and synthesize -- Combine findings from all sources into a coherent narrative.
Rate limit all sources:
- Indexer:
sleep 1 between requests, sleep 2 between pagination pages.
- Redshift: No rate limit, but use
LIMIT on exploratory queries.
- CloudWatch:
sleep 1 between paginated log queries.
Pre-Built Investigations
Before starting any work, check if the user's question matches a pre-built investigation. These are tested playbooks with the right queries, presentation format, and step-by-step instructions. Using them produces consistent, comprehensive results. Each lives in its own file under references/:
If the user's question clearly maps to one of these, read the corresponding file and follow it step by step. If it doesn't fit any pre-built investigation, fall back to the general Investigation Workflow above and build a custom investigation.
Aurora DB Instances
When investigating RDS/Aurora issues (storage, CPU, connections, etc.), never assume the DB identifier reflects the actual instance role. Instance identifiers containing reader or writer may be mislabeled -- the name is set at creation time and does not update if Aurora promotes/demotes instances.
Always determine the actual role by querying the instance metadata:
aws rds describe-db-instances \
--query 'DBInstances[?contains(DBInstanceIdentifier, `prod-8453-indexer`)].{id: DBInstanceIdentifier, role: DBInstanceArn}' \
--output table
Or more directly, check the cluster's member list with roles:
aws rds describe-db-clusters \
--db-cluster-identifier "CLUSTER_ID" \
--query 'DBClusters[0].DBClusterMembers[].{id: DBInstanceIdentifier, isWriter: IsClusterWriter}' \
--output table
Use IsClusterWriter: true/false as the source of truth. When reporting findings, always state the actual role alongside the identifier, e.g. "instance *-reader-v19 (actual role: writer)".
Presenting Results
Addresses
Always show the full address when displaying broker/prover addresses. Do not truncate to 0x8305...04b5. If a label exists in network_address_labels.json, show both: 0x83052f16a84e6f2cec4bf3beda45c40c800904b5 (BP1).
Our Provers
Provers we operate are labeled with a BP prefix in network_address_labels.json (e.g. BP1, BP2, BPNightlyAWS). When investigating any issue, always highlight what our provers are doing -- did they skip the order, did they fail, did they drop it, what error codes are they hitting? This should be called out explicitly in every investigation, even when the issue is not specifically about our provers.
Prover Summary Tables
When showing prover activity, pivot telemetry outcomes into columns so each prover is one row. Include fulfilled, failures, and skips as separate columns. By default summary tables should cover the top 5 provers by volume plus all provers we operate (from address labels).
Failure and Skip Breakdowns
After the summary table, include two separate breakdown sections:
Failure breakdown: For each prover, show a per-prover table of outcome, error_code, summarized error_reason, and count. Group by error pattern, not by individual request.
Skip breakdown: Same structure — per-prover table of skip_code, example reason, and count, sorted by count descending.
Drop breakdown: Same structure — per-prover table of commitment_skip_code, reason, and count, sorted by count descending.
Alerts and Error Codes
Alerts always match on error codes (e.g. [B-PRO-501]), not on string patterns. Seeing a string like ProvingFailed in a log message does NOT mean it counts toward the proving-failed metric or alert. Only entries with the corresponding error code (e.g. [B-PRO-501]) are counted. When investigating alert triggers or counting occurrences for a specific alert, always filter by the error code, not by keyword/string matching.
Telemetry Terminology
- Locked: Order was priced and the broker decided to try locking it on-chain.
- Skipped: Order was rejected during pricing in the OrderPicker (e.g. unprofitable, wrong image, over capacity). It never reached the OrderMonitor.
- Committed: Order was successfully committed to the proving pipeline (lock tx succeeded or immediate commitment for FulfillAfterLockExpire).
- Dropped: Order reached the OrderMonitor but was NOT committed to the proving pipeline. Reasons include: lock tx failed, order was fulfilled/expired/locked by another prover before we could act, insufficient deadline remaining, or insufficient balance. Check
commitment_skip_code for specifics.
- Cancelled: Completion outcome meaning the broker finished proving but another prover fulfilled the order first. This is a race loss (wasted proving work), NOT an error. Do not count
Cancelled as a failure in summary tables — show it as a separate column.
Secondary Fulfillment
When a prover locks an order but fails to fulfill it before the lock expires, the order becomes available for secondary fulfillment by any other prover in the network. The secondary fulfiller earns the slash collateral as a reward. In telemetry, secondary fulfillments are identified by fulfillment_type = 'FulfillAfterLockExpire' (in both evaluations and completions). In the indexer, a secondary fulfillment shows as fulfill_prover_address differing from lock_prover_address, and market aggregates include total_secondary_fulfillments.
When investigating expired or slashed requests, always check whether secondary fulfillment was attempted -- especially by our BP provers. Did they see the opportunity? Did they skip it, and why? Did they attempt it but fail?