Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

agent-observability

Comprehensive skill for tracing reasoning paths, debugging non-deterministic agent loops, and monitoring agent behavior in production systems. Covers reasoning trace visualization, OpenTelemetry integration for agent systems, distributed tracing across multi-agent chains, decision audit logging, performance profiling, anomaly detection, cost tracking and optimization, and latency analysis for AI agent deployments.

In Manus ausführen

Sterne7

Forks0

Aktualisiert5. Juni 2026 um 09:02

Quelle

j4flmao

j4flmao/agent-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Datei-Explorer

9 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

agent-legibility

j4flmao/agent-skills

Use this skill to make codebases, repositories, and documentation optimally readable and navigable by AI coding agents. Covers AGENTS.md design, repo-native instruction files, convention and constraint files, progressive context disclosure patterns, agent-optimized README structures, and workspace configuration. This skill enforces: structured metadata files, layered context loading, navigation hint systems, and machine-parseable documentation conventions. Do NOT use for: human-only documentation styling, marketing copy, or API reference generation.

2026-06-057

architectural-constraints

j4flmao/agent-skills

Defines, monitors, and enforces execution-level sandboxing, performance SLA boundaries, resource limits, security isolation, network egress filters, compliance tracking, and transactional state updates. This skill enforces: resource throttling, PII scrubbers, import restrictions, network proxy compliance, atomic file locks, and circuit breakers. Do NOT use for: basic UI prompt formatting, developer code style checks, or application routing.

2026-06-057

context-engineering

j4flmao/agent-skills

Use this skill to optimize and engineer prompt context windows, manage token budgets, implement dynamic context injections, handle state management, and mitigate semantic drift in LLM agent cycles. This skill enforces: structured context priority scoring, token-budget calculations, crash-resilient persistent state adapters, and drift correction pipelines. Do NOT use for: basic prompt copywriting, model evaluation datasets, or general fine-tuning prep.

2026-06-057

error-recovery

j4flmao/agent-skills

Use this skill to classify agent failures, implement retry strategies with exponential backoff and jitter, design checkpoint-based state recovery, build fallback chains, manage dead letter queues, enforce error budgets, and apply chaos testing to LLM agent systems. This skill enforces: structured error taxonomies, idempotent retry logic, crash-resilient checkpoint persistence, graceful degradation cascades, and probabilistic failure injection frameworks. Do NOT use for: traditional application error handling, infrastructure monitoring/alerting, or network-level fault tolerance.

2026-06-057

evaluation-testing

j4flmao/agent-skills

Use this skill to design and execute evaluation frameworks for LLM agents, implement trajectory testing, deploy LLM-as-judge patterns, build automated eval pipelines, and integrate agent testing into CI/CD workflows. This skill enforces: structured behavioral assertions, trajectory-vs-outcome evaluation matrices, verifier agent topologies, regression detection baselines, hallucination scoring engines, and benchmark dataset lifecycle management. Do NOT use for: unit testing traditional software, load/performance testing infrastructure, or model fine-tuning data preparation.

2026-06-057

feedback-loops

j4flmao/agent-skills

Use this skill to implement self-correction, reflection, human-in-the-loop (HITL), and verification layers that allow AI agents to evaluate and improve their own outputs. Covers Implement-Verify-Fix cycles, reflection patterns, HITL checkpoints, output verification, automated linting hooks, multi-stage validation, correction triggers, and quality gates. This skill enforces: structured IVF cycles, multi-layer output verification, HITL checkpoint protocols, and continuous improvement feedback mechanisms. Do NOT use for: pre-execution planning, intent classification, goal decomposition, or feedforward control mechanisms.

2026-06-057

name	agent-observability
description	Comprehensive skill for tracing reasoning paths, debugging non-deterministic agent loops, and monitoring agent behavior in production systems. Covers reasoning trace visualization, OpenTelemetry integration for agent systems, distributed tracing across multi-agent chains, decision audit logging, performance profiling, anomaly detection, cost tracking and optimization, and latency analysis for AI agent deployments.
version	2.0.0
author	j4flmao
license	MIT
type	skill
compatibility	{"claude-code":true,"cursor":true,"codex":true,"windsurf":true}
tags	["harness-engineering","observability","tracing","monitoring","opentelemetry","debugging","performance","cost-optimization","anomaly-detection"]

Agent Observability

Purpose

This skill provides the complete knowledge required to build production-grade observability systems for AI agent deployments. Unlike traditional software where execution paths are deterministic and debuggable with standard tools, agent systems exhibit non-deterministic reasoning loops, branching decision trees, variable-length tool call chains, and stochastic output variations that demand specialized observability infrastructure.

The skill covers every dimension of agent observability: capturing and visualizing reasoning traces that show why an agent made each decision, integrating OpenTelemetry to instrument every LLM call and tool invocation, propagating distributed trace context across multi-agent chains, maintaining tamper-evident decision audit logs, profiling performance bottlenecks in agent pipelines, detecting anomalous behavior patterns that indicate drift or failure, tracking costs across model providers and optimizing token usage, and analyzing latency to identify and eliminate bottlenecks in the agent execution critical path.

Core Principles

Trace Everything, Sample Intelligently: Every LLM call, tool invocation, and decision point must be instrumentable. In production, use head-based and tail-based sampling to control volume while guaranteeing capture of errors and anomalies.
Reasoning is the Primary Signal: Traditional metrics (latency, throughput, errors) are necessary but insufficient. The agent's reasoning trace—the chain of observations, thoughts, and actions—is the primary debugging and auditing artifact.
Correlation Across Boundaries: A single user request may traverse multiple agents, tools, sandboxes, and external APIs. Trace context must propagate across all boundaries to enable end-to-end visibility.
Cost is a First-Class Metric: Token consumption, API call counts, and dollar costs must be tracked with the same rigor as latency and error rates. Cost anomalies often indicate reasoning loops or prompt inefficiencies.
Detect Drift Before Failure: Anomaly detection on agent behavior distributions (response length, tool call frequency, reasoning depth) catches degradation before it manifests as user-visible failures.

Agent Protocol

Triggers

Agent system deployed to production requiring monitoring
Debugging non-deterministic agent behavior or reasoning failures
Cost overruns detected in agent API usage
Compliance requirement for decision audit trails
Performance degradation in agent response times
New agent chain requiring end-to-end trace instrumentation

Input Context Required

Agent system architecture (single agent, chain, graph, swarm)
Current instrumentation state (none, partial, full OpenTelemetry)
Observability backend (Jaeger, Tempo, Datadog, Honeycomb, custom)
Compliance requirements (audit retention period, tamper evidence)
Cost tracking requirements (per-request, per-agent, per-model)
Alert thresholds (latency P99, error rate, cost per request)

Output Artifact

Instrumentation configuration (OpenTelemetry SDK setup)
Dashboard definitions (Grafana JSON, Datadog monitors)
Alert rules (Prometheus alerting rules, PagerDuty integrations)
Reasoning trace schema (structured JSON for trace storage)
Cost attribution report (per-agent, per-model breakdown)

Response Formats

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "agent_id": "agent-planner-01",
  "request_id": "req-7f8a9b0c",
  "reasoning_trace": {
    "steps": [
      {
        "step_id": 1,
        "type": "observation",
        "content": "User requests weather forecast for Tokyo",
        "timestamp": "2026-06-04T09:00:00.123Z",
        "tokens_in": 47,
        "tokens_out": 0
      },
      {
        "step_id": 2,
        "type": "thought",
        "content": "Need to call weather API tool with location=Tokyo",
        "timestamp": "2026-06-04T09:00:00.456Z",
        "tokens_in": 0,
        "tokens_out": 32,
        "model": "claude-sonnet-4-20250514",
        "cost_usd": 0.00048
      },
      {
        "step_id": 3,
        "type": "action",
        "tool": "weather_api",
        "input": {"location": "Tokyo", "units": "metric"},
        "output": {"temp_c": 22, "condition": "partly_cloudy"},
        "latency_ms": 340,
        "timestamp": "2026-06-04T09:00:00.796Z"
      }
    ],
    "total_steps": 5,
    "total_tokens": 847,
    "total_cost_usd": 0.00127,
    "total_latency_ms": 2340
  },
  "metrics": {
    "llm_calls": 2,
    "tool_calls": 1,
    "retries": 0,
    "cache_hits": 1,
    "reasoning_depth": 5
  }
}

Decision Matrix

START: Observability requirement identified
│
├─ What is the primary goal?
│  ├─ DEBUGGING → Focus on reasoning trace capture
│  │  ├─ Single agent? → Instrument with local span collection
│  │  └─ Multi-agent? → Deploy distributed tracing with context propagation
│  │
│  ├─ COMPLIANCE → Focus on decision audit logging
│  │  ├─ Tamper-evident required? → Use append-only log with Merkle tree
│  │  └─ Standard audit? → Structured JSON logs with retention policy
│  │
│  ├─ COST CONTROL → Focus on cost tracking & optimization
│  │  ├─ Per-request attribution? → Tag spans with cost metadata
│  │  └─ Aggregate trends? → Build cost dashboards with model breakdown
│  │
│  └─ PERFORMANCE → Focus on latency analysis
│     ├─ Identify bottleneck first → Capture critical path analysis
│     ├─ LLM latency dominant? → Optimize prompts, enable caching
│     └─ Tool latency dominant? → Parallelize tool calls, add timeouts
│
├─ What sampling strategy?
│  ├─ Development → 100% sampling (capture everything)
│  ├─ Staging → Head-based 10% + tail-based on errors
│  └─ Production → Head-based 1% + tail-based on errors/anomalies
│
└─ What alerting is needed?
   ├─ Latency P99 > threshold → PagerDuty critical alert
   ├─ Error rate > 5% over 5min → Slack warning + auto-investigation
   ├─ Cost per request > 2x baseline → Cost alert + reasoning review
   └─ Anomaly score > 3σ → Anomaly alert + full trace capture

Detailed Architectural Overview

┌───────────────────────────────────────────────────────────────────┐
│                       AGENT RUNTIME                               │
│  ┌──────────────────────────────────────────────────────────────┐ │
│  │                 INSTRUMENTATION LAYER                         │ │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────────────────┐  │ │
│  │  │ LLM Call   │  │ Tool Call  │  │ Reasoning Step         │  │ │
│  │  │ Interceptor│  │ Interceptor│  │ Recorder               │  │ │
│  │  └─────┬──────┘  └─────┬──────┘  └──────────┬─────────────┘  │ │
│  │        │               │                     │                │ │
│  │        ▼               ▼                     ▼                │ │
│  │  ┌──────────────────────────────────────────────────────────┐ │ │
│  │  │          OPENTELEMETRY SDK (Traces + Metrics)            │ │ │
│  │  │  ┌──────────┐  ┌───────────┐  ┌───────────────────────┐ │ │ │
│  │  │  │ Tracer   │  │ Meter     │  │ Context Propagator    │ │ │ │
│  │  │  │ Provider │  │ Provider  │  │ (W3C TraceContext)    │ │ │ │
│  │  │  └──────────┘  └───────────┘  └───────────────────────┘ │ │ │
│  │  └──────────────────────┬───────────────────────────────────┘ │ │
│  └─────────────────────────┼────────────────────────────────────┘ │
│                            │                                      │
└────────────────────────────┼──────────────────────────────────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
┌──────────────────┐ ┌─────────────┐ ┌──────────────────┐
│  TRACE BACKEND   │ │  METRICS    │ │  LOG AGGREGATOR  │
│  ┌────────────┐  │ │  BACKEND    │ │  ┌────────────┐  │
│  │ Jaeger /   │  │ │ ┌─────────┐│ │  │ Loki /     │  │
│  │ Tempo /    │  │ │ │Prometheus││ │  │ Elastic /  │  │
│  │ Honeycomb  │  │ │ │/ Mimir  ││ │  │ CloudWatch │  │
│  └────────────┘  │ │ └─────────┘│ │  └────────────┘  │
└──────────────────┘ └─────────────┘ └──────────────────┘
              │              │              │
              ▼              ▼              ▼
┌─────────────────────────────────────────────────────────┐
│                   ANALYSIS LAYER                         │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ Reasoning   │  │ Anomaly      │  │ Cost           │ │
│  │ Trace       │  │ Detection    │  │ Attribution    │ │
│  │ Visualizer  │  │ Engine       │  │ Engine         │ │
│  └─────────────┘  └──────────────┘  └────────────────┘ │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ Latency     │  │ Decision     │  │ Alert          │ │
│  │ Critical    │  │ Audit        │  │ Manager        │ │
│  │ Path        │  │ Explorer     │  │ (PagerDuty)    │ │
│  └─────────────┘  └──────────────┘  └────────────────┘ │
└─────────────────────────────────────────────────────────┘

Observability Data Flow

  LLM Call ──► Span Created ──► Attributes Added ──► Span Exported
      │              │                │                    │
      ▼              ▼                ▼                    ▼
  Token Count    Trace Context   Model, Tokens,       OTLP Collector
  Recorded       Propagated     Cost Metadata          │
                                                       ├──► Traces → Jaeger
                                                       ├──► Metrics → Prometheus
                                                       └──► Logs → Loki

Workflow Steps

Phase 1: Instrumentation Setup

Install OpenTelemetry SDK and configure tracer/meter providers for the agent runtime.
Implement LLM call interceptors that capture model, tokens, latency, and cost per call.
Implement tool call interceptors that capture tool name, input/output, and latency.
Configure W3C TraceContext propagation for cross-agent and cross-service boundaries.

Phase 2: Reasoning Trace Capture

Define the reasoning step schema (observation, thought, action, result) with unique step IDs.
Instrument the agent's reasoning loop to emit structured trace events at each step.
Attach reasoning metadata (confidence scores, alternative paths considered) to trace spans.
Configure trace sampling strategy appropriate to the deployment environment.

Phase 3: Metrics & Dashboards

Define key metrics: LLM call latency (P50/P95/P99), tool call success rate, tokens per request, cost per request, reasoning depth distribution.
Configure metric exporters to the chosen backend (Prometheus, Datadog, CloudWatch).
Build dashboards showing real-time agent health, cost trends, and performance distributions.
Set up SLO-based monitoring with error budget tracking and burn-rate alerts.

Phase 4: Decision Audit System

Design the audit log schema with request context, agent identity, decision rationale, and outcome.
Implement append-only audit log storage with cryptographic integrity verification.
Build an audit log query interface for compliance investigators and debugging workflows.
Configure retention policies aligned with regulatory requirements (30 days to 7 years).

Phase 5: Anomaly Detection

Establish behavioral baselines for key agent metrics (response length, tool call frequency, reasoning depth, cost per request).
Deploy statistical anomaly detection (z-score, IQR, isolation forest) on rolling windows.
Configure anomaly-triggered actions: increase sampling rate, capture full traces, alert on-call.
Build feedback loops where confirmed anomalies update detection thresholds.

Phase 6: Cost & Latency Optimization

Implement per-request cost attribution by tagging each LLM span with model pricing metadata.
Build cost breakdown reports by agent, model, tool, and customer/tenant.
Perform critical path analysis on latency traces to identify sequential bottlenecks.
Implement optimization recommendations: prompt caching, parallel tool calls, model downgrades.

Extended Troubleshooting Guide

Symptom	Primary Cause	Mitigation Action
Traces missing for some agent steps	Async context lost in callback chains	Use OpenTelemetry context propagation utilities; ensure async hooks are instrumented
High cardinality metric explosion	Unique trace/request IDs used as metric labels	Use bounded label sets; move unique IDs to trace attributes, not metric labels
Reasoning trace shows infinite loop	Agent re-evaluating same observation without progress	Add loop detection middleware; set max_reasoning_steps per request
Cost tracking shows $0 for all requests	Model pricing metadata not configured	Configure per-model token pricing in the cost attribution engine
Latency P99 10x higher than P50	Tail latency from cold LLM inference or rate limiting	Implement request hedging; add LLM response caching; monitor rate limit headers
Audit log gaps during high traffic	Log buffer overflow dropping entries	Use durable queue (Kafka) between agent and audit log; increase buffer capacity
Anomaly detector fires too many false positives	Baselines computed during non-representative period	Retrain baselines on 7+ days of production data; implement adaptive thresholds
Distributed trace context not propagating	Missing W3C TraceContext headers in HTTP/gRPC calls	Verify auto-instrumentation covers all HTTP clients; add manual propagation for custom protocols

Complete Execution Scenario

User Request: "Summarize the Q3 financial report and compare to Q2"
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 1. TRACE INITIATED                                            │
│    trace_id: 4bf92f3577b34da6                                 │
│    root_span: "user_request" (agent-planner-01)               │
│    Sampling decision: RECORD (matches 1% head-based sample)   │
└─────────────────────┬────────────────────────────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────────┐
│ 2. REASONING TRACE CAPTURED                                   │
│    Step 1: [observation] Parse user intent → "compare Q3/Q2"  │
│    Step 2: [thought] Need to retrieve both reports first      │
│    Step 3: [action] tool=doc_retriever input={q: "Q3 report"} │
│    Step 4: [action] tool=doc_retriever input={q: "Q2 report"} │
│    Step 5: [thought] Both retrieved, now analyze differences  │
│    Step 6: [action] tool=llm_analyze input={docs: [q3, q2]}   │
│    Step 7: [result] Summary generated with comparison table   │
└─────────────────────┬────────────────────────────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────────┐
│ 3. METRICS EMITTED                                            │
│    llm_call_count: 3 | tool_call_count: 2                     │
│    total_tokens: 12,847 | total_cost: $0.0193                 │
│    total_latency: 4,230ms | reasoning_depth: 7                │
│    Critical path: LLM analyze (2,100ms) → doc_retriever (890ms)│
└─────────────────────┬────────────────────────────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────────┐
│ 4. AUDIT LOG ENTRY                                            │
│    request_id: req-7f8a9b0c | agent: agent-planner-01        │
│    decision: "Retrieved Q3+Q2 reports, performed comparative  │
│    analysis using claude-sonnet-4-20250514"                             │
│    outcome: SUCCESS | confidence: 0.92                        │
│    hash_chain: sha256(prev_entry + this_entry)                │
└─────────────────────┬────────────────────────────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────────┐
│ 5. ANOMALY CHECK                                              │
│    Reasoning depth 7: within 1σ of baseline (μ=6.2, σ=2.1)  │
│    Cost $0.0193: within normal range                          │
│    Latency 4,230ms: slightly above P90 (3,800ms) → MONITOR   │
│    No anomaly alert triggered                                 │
└──────────────────────────────────────────────────────────────┘

Rules and Guidelines

Never log raw prompts or completions in metrics: Prompts and completions contain sensitive data and have unbounded cardinality. Store them in traces with appropriate access controls, never as metric labels.
Propagate trace context through every boundary: Every HTTP call, gRPC call, queue message, and sandbox invocation must carry W3C TraceContext headers. Broken trace context creates observability blind spots.
Cost attribution must be real-time: Cost data older than 5 minutes is too stale for anomaly detection. Compute cost at span-completion time using pre-configured pricing tables.
Audit logs are append-only and immutable: Once written, audit entries must never be modified or deleted within the retention period. Use cryptographic hash chains to detect tampering.
Alert on absence, not just presence: Missing traces, gaps in metric streams, and silent agents are often more dangerous than explicit errors. Monitor heartbeats and expected event rates.

Reference Guides

Reasoning Trace Visualization — Visualizing and exploring agent reasoning paths
OpenTelemetry Agent Integration — Full OpenTelemetry setup for agent systems
Distributed Tracing for Agents — Cross-agent and cross-service trace propagation
Decision Audit Logging — Tamper-evident audit log architecture
Performance Profiling — Profiling agent performance and identifying bottlenecks
Anomaly Detection for Agents — Statistical and ML-based anomaly detection
Cost Tracking & Optimization — Token cost tracking and optimization strategies
Latency Analysis & Optimization — Critical path analysis and latency reduction

Handoff

sandbox-execution: Sandbox telemetry is a key input to the observability pipeline
prompt-engineering: Reasoning traces inform prompt optimization and debugging
safety-guardrails: Anomaly detection feeds into safety monitoring systems