Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

observe

Name: Observe
Author: elastic

// The agent's Elastic-access primitive. Four modes: wait for an ML anomaly to fire, poll an ES|QL metric (live-sample or wait for a threshold), read a single-instance scalar value, or return a full ES|QL table. Use when the user says "tell me when...", "let me know if...", "wait until X drops below Y", "watch for anything unusual", "monitor for the next N minutes", "poll until stable", "what is X right now", "list …", "which … are …", or wants transient (session-scoped) monitoring or ad-hoc querying without creating a persistent Kibana rule. Also trigger for "keep an eye on" and post-remediation validation.

Exécuter dans Manus

$ git log --oneline --stat

stars:7

forks:6

updated:7 mai 2026 à 16:26

SKILL.md

readonly

related-skills.json

même dépôt

mcp-app-dev-setup.md

from "elastic/example-mcp-app-observability"

Bootstrap or repair a development environment for the Elastic Observability MCP App with Forge as the data driver. Use when the user says "set up Forge for me", "get me ready to work on this MCP app", "run the validation suite", "I just cloned this repo, what now", or wants the dev environment refreshed after a long gap. Verifies sibling Forge clone, Python venv, cluster credentials, MCP harness, and runs a smoke test against the canonical validation suite.

2026-05-017

apm-health-summary.md

from "elastic/example-mcp-app-observability"

Get a cluster-level rollup of service health from APM telemetry — the "how's my environment right now?" entry point for observability investigations. Use whenever the user asks about HEALTH, STATUS, or general wellbeing of an environment / cluster / namespace ("how's my cluster", "status of the X env", "what's broken", "any issues", "show me the health of …", "give me a status report", "what should I look at", "things feel slow"). This applies regardless of any time qualifier — "show me the health of X over the past hour" still routes here (with lookback="1h"), NOT to observe. observe is for raw-metric queries; this tool is for the rollup. Gracefully degrades: layers in Kubernetes pod data and ML anomaly context when those backends are present, but still returns useful APM-only output if they aren't. Do not use for log-only or metrics-only customers — this tool requires Elastic APM.

2026-05-017

apm-service-dependencies.md

from "elastic/example-mcp-app-observability"

Map the application topology from APM telemetry — which services call which, over what protocols, with what call volume and latency. Use when the user asks "what calls X", "what depends on X", "show me the topology", "what are the upstream/downstream services", "where does this service fit", or is doing root-cause investigation and needs to trace how a problem propagates through the call graph. Also trigger for "service map", "dependency graph", "blast radius of service X", or "who's the dependency of Y". Requires Elastic APM — do not trigger for log-only or metrics-only customers.

2026-05-017

ml-anomalies.md

from "elastic/example-mcp-app-observability"

Query Elastic ML anomaly detection results to understand what's behaving unusually, why, and how badly. Use when the user asks "what's anomalous", "is anything unusual happening", "why is X slow/spiking", "show me the weirdness", or mentions memory growth, CPU spikes, restart patterns, unusual latency, unexpected error rates, or drift from typical behavior. Also trigger for "ML anomalies", "anomaly detection", "Elastic ML", "what does ML think", or when the user wants to understand behavior that deviates from baseline. The tool opens an inline explainer view with a severity gauge, plain-English narrative, and per-entity deviation breakdown — so the agent should USE the visualization, not just dump JSON.

2026-05-017

manage-alerts.md

from "elastic/example-mcp-app-observability"

CRUD for Kibana alerting rules — create, list, get, or delete custom-threshold rules. Use when the user says "alert me when", "create a rule for", "page me if", "set up an alert", "show me my rules", "what alerts do I have", "delete that alert", "remove the rule". Backend-agnostic — works on any metric field in any index pattern (metrics-*, logs-*, traces-apm*, custom). For transient session-scoped monitoring use `observe` instead. Requires Kibana with the Alerting feature enabled — the tool is auto-disabled when no Kibana URL is configured.

2026-04-307

k8s-blast-radius.md

from "elastic/example-mcp-app-observability"

Assess the impact of a Kubernetes node going offline — which deployments lose all replicas (full outage), which lose partial capacity (degraded), which are unaffected, and whether the cluster has enough spare capacity to reschedule the lost pods. Use when the user asks "what happens if node X goes down", "what's the blast radius of draining this node", "can I safely maintain node Y", "what's running on this node", "if I evict this node what breaks", or is planning node maintenance, a cluster upgrade, or investigating an actual node failure. Requires Kubernetes (kubeletstats metrics) and Elastic APM for downstream service impact — do not trigger for non-K8s deployments.

2026-04-307

package.json

"author": "elastic"

"repository": "elastic/example-mcp-app-observability"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Administrateurs de réseaux et de systèmes informatiquesProfessions informatiques et mathématiques15-1244L4

name

observe

description

The agent's Elastic-access primitive. Four modes: wait for an ML anomaly to fire, poll an ES|QL metric (live-sample or wait for a threshold), read a single-instance scalar value, or return a full ES|QL table. Use when the user says "tell me when...", "let me know if...", "wait until X drops below Y", "watch for anything unusual", "monitor for the next N minutes", "poll until stable", "what is X right now", "list …", "which … are …", or wants transient (session-scoped) monitoring or ad-hoc querying without creating a persistent Kibana rule. Also trigger for "keep an eye on" and post-remediation validation.

Observe

Transient, session-scoped monitoring and ad-hoc querying. Unlike manage-alerts (which creates a durable saved object in Kibana), observe polls in-process and returns once fired, once its window closes, or — in now / table mode — immediately.

Modes

Decision tree — pick based on tense FIRST, then on shape

Does the user phrase it as past or future?

PAST tense / windowed lookback                    FUTURE tense / live monitoring
("what WAS / over the past N /                    ("watch / monitor / wait until /
 in the last N / how did X look")                  live-sample / for the NEXT N")
       │                                                  │
       ▼                                                  ▼
Is it a single number or a series?                Is there a threshold to fire on?
       │                                                  │
   ┌───┴────┐                                          ┌──┴──┐
single   series                                       yes    no
   │        │                                          │      │
   ▼        ▼                                          ▼      ▼
 now     table                                      metric  metric
                                                    (with    (live-
                                                  condition) sample)

Other:
  • "tell me when anything unusual fires" → anomaly (default)
  • "list X / which X are Y / top N" (any tense) → table
  • "page me whenever..." → use manage-alerts instead

Mode	When to pick it	Blocks?
`now`	Past-tense windowed scalar. "What was X right now / over the past 60 seconds / in the last 15 minutes / average X for the last hour". Put the window inside the ES\|QL via `WHERE @timestamp > NOW() - <window>`. Single number out.	Returns immediately
`table`	Past-tense time series OR group-by. "How did X look over the past hour" (with `BUCKET()` for the chart), "list X", "which X are Y", any group-by / top-N. The view auto-charts 2-column time-series tables.	Returns immediately
`metric`	Forward-looking only. "Watch X", "wait until X drops", "live-sample X for the next 60s", "wake me when X exceeds Y". Polls live for `max_wait` seconds. Never use this for past-tense / "what was X" queries — it will block for `max_wait` seconds before returning a value, which is the opposite of what the user asked.	Polls for `max_wait` seconds (default 60s)
`anomaly` (default)	"Tell me when anything unusual fires", open-ended monitoring	Until an anomaly fires or `max_wait` elapses

⚠️ Most common mistake: picking metric for a past-tense query. A prompt like "what was the frontend memory over the past 60 seconds?" is asking about data that already exists — it's a windowed lookup, not a request to wait 60 more seconds and watch. Always inspect the verb before choosing metric:

was / were / averaged / hit / spiked / over the past / in the last / for the last → now or table. NEVER metric.

watch / monitor / poll / wait until / wake me / for the next / until X happens / live-sample → metric.

If the user wants durable alerting ("page me whenever..."), use manage-alerts instead.

⚠️ Don't use observe for health rollups. "Show me the health of X", "status of the X environment", "how is X doing" — even with a time qualifier like "over the past hour" — should route to apm-health-summary, not observe. observe is for raw-metric / single-query investigations; apm-health-summary returns the full rollup (degraded services, anomalies, pod resources). Pick apm-health-summary whenever the user is asking a HEALTH question rather than a specific metric question.

Prerequisites

Mode	Requires
`anomaly`	Elastic ML anomaly detection jobs
`metric`	Any ES\|QL-queryable numeric field
`now`	Any ES\|QL-queryable numeric field
`table`	Any ES\|QL-queryable data

How to call observe

Anomaly mode (default)

{
  "mode": "anomaly",
  "min_score": 75,
  "max_wait": 600,
  "namespace": "otel-demo"
}

min_score: 75 default (major+), 50 for minor inclusion, 90 for critical-only.
max_wait: generous (600s default). Returns immediately on trigger — long waits are free.
namespace: only if the user scopes to a K8s namespace.

Metric mode — threshold condition

{
  "mode": "metric",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.pod.name == \"frontend-7d4b8f9c5-x2k9m\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
  "condition": "< 80000000",
  "description": "frontend pod memory working set",
  "max_wait": 300
}

Condition format: <comparator> <threshold> — valid comparators: <, <=, >, >=, ==.

Metric mode — live sample (no threshold)

Omit condition and the tool live-samples for the full max_wait window — use for "show me a live chart of X" prompts. The view renders an accumulating sparkline.

{
  "mode": "metric",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.namespace.name == \"oteldemo-esyox-default\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
  "description": "oteldemo-esyox-default avg pod memory",
  "max_wait": 60
}

Now mode — single read

{
  "mode": "now",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.namespace.name == \"oteldemo-esyox-default\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
  "description": "current avg pod memory in oteldemo-esyox-default"
}

Past-tense windowed read — when the user asks "what was X over the past N seconds/minutes/hours", put the window in the ES|QL WHERE clause, not in max_wait. The tool returns immediately with the aggregate; nothing is polled.

{
  "mode": "now",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.pod.name == \"frontend-7d4b8f9c5-x2k9m\" AND @timestamp > NOW() - 60 seconds | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
  "description": "frontend memory, last 60 seconds (avg)"
}

If the user wants the time series (not just one number), use table mode with BUCKET() instead — that returns rows the view can chart.

Table mode — full ES|QL rows and columns

Use when the query groups, lists, or returns mixed-type rows (strings + numbers + dates). now mode discards everything except the first numeric cell — table mode preserves the whole result.

{
  "mode": "table",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE metrics.k8s.pod.memory.working_set IS NOT NULL | STATS avg_mem = AVG(metrics.k8s.pod.memory.working_set) BY resource.attributes.k8s.pod.name, resource.attributes.k8s.namespace.name | SORT avg_mem DESC | LIMIT 10",
  "description": "top 10 pods by memory"
}

Past-tense time series — use table with BUCKET() to return one row per time slice the view can chart:

{
  "mode": "table",
  "esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.pod.name == \"frontend-7d4b8f9c5-x2k9m\" AND @timestamp > NOW() - 60 seconds | STATS v = AVG(metrics.k8s.pod.memory.working_set) BY bucket = BUCKET(@timestamp, 5 second) | SORT bucket ASC",
  "description": "frontend memory · 60s · 5s buckets"
}

Rows are capped at 50 by default. Prefer tightening the ES|QL with LIMIT / SORT over raising row_cap — very wide tables clog the context window.

Picking the right index pattern

Fields live where the data is emitted — ES|QL rejects queries that reference a field the target index doesn't map (verification_exception). Before writing the query, match the user's question to the right layer:

User asks about…	Index	Carries
Node / pod / namespace topology, resource usage	`metrics-kubeletstatsreceiver.otel` or `metrics-`	`k8s.node.name`, `k8s.pod.name`, `k8s.namespace.name`, `service.name` (via resource attrs), CPU/memory/fs gauges
Service behavior — latency, errors, throughput, spans	`traces-.otel-` or `traces-apm*`	`service.name`, `transaction.duration.us`, `event.outcome`, `span.*`
Log rate / log content	`logs-*`	`message`, `log.level`, `service.name`
ML anomalies	`.ml-anomalies-*`	`record_score`, `by_field_value`, `partition_field_value`

Cross-layer questions ("which node runs the most services") need the index that carries both fields — that's almost always metrics-*, because OTel resource attributes propagate through the Collector, so metrics docs carry k8s.node.name and service.name. Trace indices (traces-apm*, traces-*.otel-*) don't carry infra attributes like k8s.node.name — don't reach for them when the question is about nodes.

Example: "which node is running the most services"

FROM metrics-*
| WHERE @timestamp > NOW() - 5m AND k8s.node.name IS NOT NULL AND service.name IS NOT NULL
| STATS service_count = COUNT_DISTINCT(service.name) BY k8s.node.name
| SORT service_count DESC
| LIMIT 20

Common query patterns

These are the field paths this deployment's data actually uses — prefer them over guessing.

OTel Kubernetes (kubeletstats receiver)

Index: metrics-kubeletstatsreceiver.otel*

Each kubeletstats scrape emits separate documents per metric — a CPU doc, a memory doc, a network doc, etc. Always filter WHERE <field> IS NOT NULL for the field you're aggregating, otherwise most rows carry nulls for it.

Gauge fields — use AVG / MAX / MIN, never SUM:

Signal	Field	Type
Pod memory working set	`k8s.pod.memory.working_set`	`long` (bytes)
Pod memory RSS	`k8s.pod.memory.rss`	`long` (bytes)
Pod memory available	`k8s.pod.memory.available`	`long` (bytes)
Pod CPU usage	`k8s.pod.cpu.usage`	`double` (cores — 1.0 = one full core)
Pod filesystem usage	`k8s.pod.filesystem.usage`	`long` (bytes)
Node memory working set	`k8s.node.memory.working_set`	`long` (bytes)
Node memory available	`k8s.node.memory.available`	`long` (bytes)
Node CPU usage	`k8s.node.cpu.usage`	`double` (cores)
Node filesystem usage	`k8s.node.filesystem.usage`	`long` (bytes)

For counter fields (network I/O, network errors, uptime), see the "Counter fields" section below — these require TS + RATE().

Fields that may NOT exist in every cluster's kubeletstats schema — depends on whether the OTel collector / kubelet is configured to scrape pod-spec data:

Field	When it's missing	What to do
`metrics.k8s.pod.cpu.limit`	Pods don't declare CPU limits, OR the collector isn't scraping pod-spec	Don't reference it — use raw `metrics.k8s.pod.cpu.usage` and report cores instead of % utilization
`metrics.k8s.pod.memory.limit`	Same as above for memory	Same — report bytes instead of %
`metrics.k8s.container.restart_count`	Collector isn't scraping pod-status (this is COMMON)	Don't reference it. Use APM error rates or alert reasons as a proxy for "things crashing." There is no canonical alternative field name (`k8s.pod.restarts` is wrong — don't try it)

Schema-aware error recovery. When ESQL returns Unknown column [X], did you mean [Y, Z]?, read the suggestion and retry with one of Y or Z — don't blindly retry the same query or guess a different name from memory.

Dimension fields — use for filtering and BY grouping:

Dimension	Unprefixed	Prefixed (equivalent)
Pod name	`k8s.pod.name`	`resource.attributes.k8s.pod.name`
Namespace	`k8s.namespace.name`	`resource.attributes.k8s.namespace.name`
Node name	`k8s.node.name`	`resource.attributes.k8s.node.name`
Cluster name	`k8s.cluster.name`	(same)

Both forms work on metrics-kubeletstatsreceiver.otel*. Prefer the unprefixed form — it's shorter and also works on counter-field queries via TS.

Common recipes:

Top pods by memory (last 5m, across all namespaces):

FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes AND k8s.pod.memory.working_set IS NOT NULL
| STATS avg_mem = AVG(k8s.pod.memory.working_set),
        max_mem = MAX(k8s.pod.memory.working_set)
  BY k8s.pod.name, k8s.namespace.name
| SORT max_mem DESC
| LIMIT 20

Which pods are on a specific node:

FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes
  AND k8s.node.name == "<node>" AND k8s.pod.name IS NOT NULL
| STATS last_seen = MAX(@timestamp) BY k8s.pod.name, k8s.namespace.name
| SORT last_seen DESC

Namespace-wide memory average (single scalar — works in now/metric mode):

FROM metrics-kubeletstatsreceiver.otel*
| WHERE k8s.namespace.name == "oteldemo-esyox-default"
  AND k8s.pod.memory.working_set IS NOT NULL
| STATS v = AVG(k8s.pod.memory.working_set)

Is this node under memory pressure (working-set vs available):

FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes
  AND k8s.node.name == "<node>" AND k8s.node.memory.working_set IS NOT NULL
| STATS working_set = AVG(k8s.node.memory.working_set),
        available = AVG(k8s.node.memory.available)

Counter fields — require `TS` + `RATE()`

Network I/O, network errors, and uptime fields are stored as monotonically-increasing counters (counter_long), not instantaneous gauges. FROM + MAX/AVG/SUM/VALUES on a counter field is a hard error — ES|QL returns argument of [...] must be [...numeric except counter types].

Counter fields in this deployment:

Field	Notes
`k8s.pod.network.io`	bytes, carries `direction` attribute (`transmit` / `receive`) — emitted as separate docs per direction
`k8s.pod.network.errors`	error count, also carries `direction`
`k8s.node.network.io`, `k8s.node.network.errors`	node-level equivalents
`k8s.node.uptime`, `k8s.pod.uptime`	seconds since start

Correct pattern: use TS as the source command, wrap RATE() in an aggregation, filter the counter field IS NOT NULL, and group by direction whenever you query network fields.

Network throughput by cluster, last 15m (result in bytes/sec):

TS metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 15 minutes
  AND k8s.pod.network.io IS NOT NULL
| STATS rate_bps = AVG(RATE(k8s.pod.network.io))
  BY k8s.cluster.name, direction
| SORT rate_bps DESC

Rules:

TS, not FROM. FROM will be rejected.
Wrap RATE() in AVG() (or similar) when grouping — bare RATE(...) BY ... is rejected.
Network counters are emitted as separate docs per direction. Without BY direction or a direction == "..." filter, transmit and receive aggregate into a meaningless combined number.
Without IS NOT NULL the query spans many kubeletstats docs that carry a different metric — you get nulls, not errors.

Escape hatch — raw counter snapshot: if you want the current counter value (e.g. "how long has node X been up"), cast first. TO_LONG strips the counter type and unlocks standard aggregations:

FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes AND k8s.node.uptime IS NOT NULL
| EVAL u = TO_LONG(k8s.node.uptime)
| STATS uptime_s = MAX(u) BY k8s.node.name

APM traces

Primary index: traces-*.otel-* (OTel-native). Fallback: traces-apm* (classic APM — only if the OTel path returns empty).

In EDOT-ingested clusters, traces-*.otel-* carries both OTel-native fields (duration, kind, status.code) and classic-APM-compatible fields (processor.event, event.outcome, transaction.duration.us on transaction-level docs). The cluster's "APM-ness" isn't determined by the index — it's determined by which field shape you query.

Signal	OTel-native (preferred)	Classic APM
Duration	`duration` (nanoseconds, long, populated on every span)	`transaction.duration.us` (microseconds, populated only on `processor.event == "transaction"` docs)
Error signal	`event.outcome == "failure"` — use this, 100% populated	`status.code == "Error"` (sparse; only set when instrumentation explicitly calls `SetStatus`)
Error message / type / stacktrace	`exception.message`, `exception.type`, `exception.stacktrace`	`error.message`, `error.exception.type`, `error.stack_trace`
Span kind	`kind` — values `Server`, `Internal`, `Client`, `Producer`, `Consumer` (title case, not `SERVER`/`CLIENT`)	`transaction.type`
Scope filter	`kind == "Server"` isolates incoming requests	`processor.event == "transaction"`
Service name	`service.name`	`service.name`

Unit warning. OTel duration is in nanoseconds. Divide by 1,000,000 for milliseconds. Classic transaction.duration.us is in microseconds — divide by 1,000. Mixing these across a comparison produces wildly wrong numbers.

Error-field warning. On traces-*.otel-* the exception attributes use the exception.* namespace, not error.*. Querying error.message / error.type against an OTel-native index returns verification_exception: Unknown column [error.message], did you mean any of [exception.message, message]?. The error.* family belongs only to classic-APM traces-apm* documents. When you see the user ask "show me the error messages from X", reach for exception.message first.

Where exception data actually lives — three deployment shapes. OTel records exceptions as Span Events; where they end up indexed depends on the exporter pipeline:

Flattened onto the trace doc — the parent span carries exception.message, exception.type, exception.stacktrace. Common with EDOT + APM Server. Query traces-*.otel-*.

Exported as separate log records — logs-*.otel-* carries exception.message / exception.type, correlated to the trace via trace.id and span.id. This is the default OpenTelemetry Collector pipeline. The trace doc itself only carries status.message / status.code (Error on failed spans).

Classic APM error events — traces-apm* with processor.event == "error" carries error.message and error.exception.type. Only present when classic-APM agents are in use.

The trace doc always carries event.outcome == "failure" and status.message regardless. If your exception.message query against traces-*.otel-* returns Unknown column [exception.message], the deployment is shape 2 — switch to logs-*.otel-* with WHERE exception.message IS NOT NULL and join back to the trace via trace.id if needed. Don't assume the original index has the field just because the user said "from traces".

Service p95 latency (OTel-native), last 15m — result in ms:

FROM traces-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
  AND kind == "Server"
| STATS p95_ms = PERCENTILE(duration, 95) / 1000000

Error rate for a service — event.outcome is reliable here:

FROM traces-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
  AND kind == "Server"
| STATS errors = COUNT(*) WHERE event.outcome == "failure", total = COUNT(*)
| EVAL error_rate_pct = ROUND(errors * 100.0 / total, 2)
| KEEP error_rate_pct, errors, total

Recent exception messages — try the trace index first (flattened-event pipelines), then fall back to logs (default OTel Collector pipeline):

1) Trace doc carries the exception (EDOT-flattened pipelines):

FROM traces-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
  AND event.outcome == "failure"
  AND exception.message IS NOT NULL
| KEEP @timestamp, exception.type, exception.message
| SORT @timestamp DESC
| LIMIT 50

2) Exception lives in logs (default OTel pipeline) — use this when (1) returns Unknown column [exception.message] or empty:

FROM logs-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
  AND exception.message IS NOT NULL
| KEEP @timestamp, exception.type, exception.message, trace.id, span.id
| SORT @timestamp DESC
| LIMIT 50

3) Classic-APM equivalent — only when both OTel paths return empty:

FROM traces-apm*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
  AND processor.event == "error"
| KEEP @timestamp, error.exception.type, error.message
| SORT @timestamp DESC
| LIMIT 50

If you only need a coarse signal (no message text), every shape carries event.outcome == "failure" on the trace doc and status.message / status.code == "Error" on OTel traces — count or sample those when neither exception.* nor error.* resolves.

If traces-*.otel-* returns empty, the deployment is classic-APM-only — fall back to traces-apm* with processor.event == "transaction" and transaction.duration.us.

Throughput trend — use the pre-aggregated rollup when possible. metrics-service_summary.1m.otel-* carries per-minute request counts in service_summary (a regular long, designed to SUM). Cheaper and faster than scanning raw traces for "how many requests/min over the last hour":

FROM metrics-service_summary.1m.otel-*
| WHERE service.name == "frontend" AND @timestamp > NOW() - 1 hour
| STATS throughput = SUM(service_summary)
  BY bucket = BUCKET(@timestamp, 1 minute)
| SORT bucket ASC

Log rate

Index: logs-*

FROM logs-*
| WHERE service.name == "cartservice" AND @timestamp > NOW() - 5m
| STATS v = COUNT(*)

Query-construction rules

For now and metric mode, the query must return a single row with a numeric first column — the tool reads the first numeric cell. For table mode this restriction doesn't apply: any shape is fine.
Scope with @timestamp > NOW() - <window> when the user implies "right now" (default 5m is usually fine; let the window match the user's language).
When the user names a namespace, match it exactly (e.g. oteldemo-esyox-default, not otel-demo). If unsure, call apm-health-summary first — its namespace_candidates field surfaces fuzzy matches.
Match the aggregation to the field's storage shape. Three shapes to recognize:
- Gauges (memory.working_set, memory.available, cpu.usage, filesystem.usage in metrics-kubeletstatsreceiver.otel*): use AVG / MAX / MIN. Do not SUM a gauge — it will add every ~15s kubelet sample over your window and inflate the value by hundreds or thousands.
- Counters (k8s.pod.network.io, k8s.node.uptime, etc. — counter_long type): require TS + RATE(). See the "Counter fields" section above. FROM + MAX/AVG/SUM on a counter is a hard error, not a silent wrong number.
- Pre-aggregated rollups (service_summary on metrics-service_summary.1m.otel-*, span.destination.service.response_time.count on metrics-service_destination.1m.otel-*): designed for SUM across the window. Each doc is already a per-minute bucket count.

After the tool returns

_setup_notice is view-side chrome — don't echo or summarize it in chat. There are three variants:

welcome and skill-gap are user-facing nudges to install the skill packs. Ignore them.
schema-hint is an actionable retry instruction the server emits when your query failed in a way the latest skill expects on this deployment shape (e.g. exceptions live in logs-*.otel-* instead of on the trace doc). Read its message and adapt your next call accordingly — don't surface the banner text in chat, but do follow its guidance.

The observe MCP App view renders inline in one of several modes, picked automatically from the result:

Now mode (status: NOW) — compact card: big unit-formatted number, ES|QL subtitle, "evaluated Xs ago" stamp, and three follow-up actions (re-check, escalate to live observation, create alert rule).
Metric mode — area + line + dots sparkline with optional threshold line; stat cards for current / threshold / peak / baseline. Covers CONDITION_MET, TIMEOUT, and SAMPLED.
Anomaly mode — severity-scored trigger card with affected entities and click-to-send investigation prompts.
Table mode (status: TABLE) — styled HTML table with column headers, type-aware alignment (numeric right, text left), and zebra-striped rows. Row count + truncation notice in the subtitle.
Error (status: ERROR) — red-toned card with the ES|QL failure message verbatim. Surfaces instead of throwing when the query is bad (unknown field, index missing, syntax error).

All modes surface an investigation_actions list as buttons. Follow up in chat too — don't rely on the buttons alone.

Status-by-status guidance

NOW — State the value plainly. Offer to escalate to a live observation if the user seems to want ongoing visibility.
TABLE — Summarize what the rows show (top entity, total count, any outliers). Don't just dump the full table back — the user can read the widget. If the result was truncated, say so and offer to tighten the ES|QL.
ERROR — Read the error message, explain what likely went wrong (unknown field, index pattern, syntax), and propose a corrected query. Don't retry blindly.
ALERT (anomaly fired) — The response includes affected entities, affected services, top anomalies, and investigation_hints naming the next tool to reach for. Follow those hints immediately — don't just report the alert, start investigating and narrate your reasoning.
CONDITION_MET (metric threshold satisfied) — Confirm to the user and describe the trend from the returned history. If this was post-remediation validation, explicitly state the fix has been validated. Offer to graduate the condition into a durable rule via manage-alerts.
SAMPLED (live sample completed without a condition) — Summarize the trend (trending up / down / flat, peak, typical). Offer "keep observing" (extend window) or graduate to an alert rule.
TIMEOUT (metric condition never met) — Tell the user the metric didn't stabilize. Suggest follow-ups: check ml-anomalies, persist as alert rule, re-examine the ES|QL.
QUIET (anomaly mode, nothing fired) — Suggest adjustments: lower min_score, widen lookback, verify ML jobs are running.

Accumulating timelines

Every metric-mode response includes an observe_key derived from esql + condition. When Claude re-invokes observe with the same ES|QL (e.g. via the "Extend observation (+60s)" button), the view merges the new samples into the existing sparkline instead of resetting — so the user sees a continuous timeline across multiple tool calls. To keep this continuity, reuse the exact same ES|QL string and condition when extending. Capped at 240 points to keep the chart readable.

Tools

Tool	Purpose
`observe`	Polls and blocks. Four modes: `anomaly`, `metric`, `now`, `table`.
`ml-anomalies`	Follow-up: deeper look at the anomaly that fired.
`apm-service-dependencies`	Follow-up: topology of affected services (if APM available).
`apm-health-summary`	Follow-up: cluster-wide context, and useful for discovering which namespaces actually have data.
`k8s-blast-radius`	Follow-up: infra impact if a node is implicated.
`manage-alerts`	Graduate to persistent alerting once the pattern is well-understood.

Key principles

Observe is transient. Nothing is saved. If the user wants an ongoing rule, use manage-alerts.
Pick the mode from the user's phrasing. "What is X right now" (scalar) → now. "Show me a live chart of X" or "watch X for 60s" → metric (no condition). "Wait until X drops below Y" → metric (with condition). "Tell me when anything unusual fires" → anomaly. "List …", "which pods are on node X", "top N by Y" → table. If a user asks "what is X" and X is actually a list or grouping (not a single number), pick table, not now.
Use the known field paths. Don't probe generic metrics-* patterns when the deployment indexes under metrics-kubeletstatsreceiver.otel*. The cheat sheet above is authoritative for this environment.
On ALERT, start investigating immediately. The investigation_hints are suggestions — follow them and narrate your reasoning.
Don't start with observe for vague triage. If the user reports a symptom without naming a specific metric ("something feels slow", "what's wrong with prod"), reach for apm-health-summary first — it surfaces the worst-offender services without needing a query. observe needs a target metric to poll; use it to drill in after the rollup names something.
Don't over-tune min_score. 75 catches the important stuff; dropping below 50 produces noise.

Investigation discipline

One tool call per turn. After this tool returns, narrate the headline finding before the next call. Each call adds a widget — chaining several after one "yes" looks like a runaway agent.
Sequential offers, not OR. "Want me to check X or Y?" → "Want me to start with X? If it's inconclusive I can follow up with Y." The user's "yes" then maps to one call, not both.
Build queries from known fields. Use the cheat-sheet above — don't guess field names. If an "Unknown column" error returns with a "did you mean…" suggestion, read it and adjust; don't blindly retry.

observe

Plus depuis ce dépôt

Plus depuis ce dépôt

Observe

Modes

Decision tree — pick based on tense FIRST, then on shape

Prerequisites

How to call observe

Anomaly mode (default)

Metric mode — threshold condition

Metric mode — live sample (no threshold)

Now mode — single read

Table mode — full ES|QL rows and columns

Picking the right index pattern

Common query patterns

OTel Kubernetes (kubeletstats receiver)

Counter fields — require TS + RATE()

APM traces

Log rate

Query-construction rules

After the tool returns

Status-by-status guidance

Accumulating timelines

Tools

Key principles

Investigation discipline

Observe

Modes

Decision tree — pick based on tense FIRST, then on shape

Prerequisites

How to call observe

Anomaly mode (default)

Metric mode — threshold condition

Metric mode — live sample (no threshold)

Now mode — single read

Table mode — full ES|QL rows and columns

Picking the right index pattern

Common query patterns

OTel Kubernetes (kubeletstats receiver)

Counter fields — require TS + RATE()

APM traces

Log rate

Query-construction rules

After the tool returns

Status-by-status guidance

Accumulating timelines

Tools

Key principles

Investigation discipline

Counter fields — require `TS` + `RATE()`

Counter fields — require `TS` + `RATE()`