// Use when diagnosing operation failures, stuck or slow operations, querying Jaeger traces, working with Grafana dashboards, debugging distributed system issues, or investigating worker selection and service communication problems.
| name | observability |
| description | Use when diagnosing operation failures, stuck or slow operations, querying Jaeger traces, working with Grafana dashboards, debugging distributed system issues, or investigating worker selection and service communication problems. |
Load this skill when:
When users report issues with operations, use Jaeger first — not logs. KTRDR has comprehensive OpenTelemetry instrumentation that provides complete visibility into distributed operations.
This enables first-response diagnosis instead of iterative detective work.
Query Jaeger when user reports:
| Symptom | What Jaeger Shows |
|---|---|
| "Operation stuck" | Which phase is stuck and why |
| "Operation failed" | Exact error with full context |
| "Operation slow" | Bottleneck span immediately |
| "No workers selected" | Worker selection decision |
| "Missing data" | Data flow from IB to cache |
| "Service not responding" | HTTP call attempt and result |
From CLI output or API response (e.g., op_training_20251113_123456_abc123)
OPERATION_ID="op_training_20251113_123456_abc123"
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID&limit=1" | jq
# Get span summary with durations
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
service: .process.serviceName,
duration_ms: (.duration / 1000),
error: ([.tags[] | select(.key == "error" and .value == "true")] | length > 0)
}' | jq -s 'sort_by(.duration_ms) | reverse'
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
attributes: (.tags | map({key: .key, value: .value}) | from_entries)
}'
# Check for worker selection and dispatch
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName == "worker_registry.select_worker") |
.tags[] |
select(.key | startswith("worker_registry.")) |
{key: .key, value: .value}'
Look for:
worker_registry.total_workers: 0 → No workers startedworker_registry.capable_workers: 0 → No capable workersworker_registry.selection_status: NO_WORKERS_AVAILABLE → All busy# Extract error details
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.tags[] | select(.key == "error" and .value == "true")) |
{
span: .operationName,
service: .process.serviceName,
exception_type: (.tags[] | select(.key == "exception.type") | .value),
exception_message: (.tags[] | select(.key == "exception.message") | .value)
}'
Common errors:
ConnectionRefusedError → Service not running (check http.url)ValueError → Invalid input parametersDataNotFoundError → Data not loaded (check data.symbol, data.timeframe)# Find bottleneck span (longest duration)
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
duration_ms: (.duration / 1000)
}' | jq -s 'sort_by(.duration_ms) | reverse | .[0]'
Common bottlenecks:
training.training_loop → Check training.device (GPU vs CPU)data.fetch → Check ib.latency_msib.fetch_historical → Check data.bars_requested# Check HTTP calls between services
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName | startswith("POST") or startswith("GET")) |
{
http_call: .operationName,
url: (.tags[] | select(.key == "http.url") | .value),
status: (.tags[] | select(.key == "http.status_code") | .value),
error: (.tags[] | select(.key == "error.type") | .value)
}'
Look for:
http.status_code: null → Connection failederror.type: ConnectionRefusedError → Target service not runninghttp.url → Shows which service was being calledoperation.id — Operation identifieroperation.type — TRAINING, BACKTESTING, DATA_DOWNLOADoperation.status — PENDING, RUNNING, COMPLETED, FAILEDworker_registry.total_workers — Total registered workersworker_registry.available_workers — Available workersworker_registry.capable_workers — Capable workers for this operationworker_registry.selected_worker_id — Which worker was chosenworker_registry.selection_status — SUCCESS, NO_WORKERS_AVAILABLE, NO_CAPABLE_WORKERSprogress.percentage — Current progress (0-100)progress.phase — Current execution phaseoperations_service.instance_id — OperationsService instance (check for mismatches)exception.type — Python exception classexception.message — Error messageexception.stacktrace — Full stack traceerror.symbol, error.strategy — Business contexthttp.status_code — HTTP response statushttp.url — Target URL for HTTP callsib.latency_ms — IB Gateway latencytraining.device — cuda:0 or cpugpu.utilization_percent — GPU usageWhen diagnosing with observability, use this structure:
🔍 **Trace Analysis for operation_id: {operation_id}**
**Trace Summary**:
- Trace ID: {trace_id}
- Total Duration: {duration_ms}ms
- Services: {list of services}
- Status: {OK/ERROR}
**Execution Flow**:
1. {span_name} ({service}) - {duration_ms}ms
2. {span_name} ({service}) - {duration_ms}ms
...
**Diagnosis**:
{identified_issue_with_evidence_from_spans}
**Root Cause**:
{root_cause_explanation_with_span_attributes}
**Solution**:
{recommended_fix_with_commands}
Check Grafana for quick diagnostics before diving into traces.
| Dashboard | Path | Use Case |
|---|---|---|
| System Overview | /d/ktrdr-system-overview | Service health, error rates, latency |
| Worker Status | /d/ktrdr-worker-status | Worker capacity, resource usage |
| Operations | /d/ktrdr-operations | Operation counts, success rates |
For comprehensive workflows and scenarios: docs/debugging/observability-debugging-workflows.md