ワンクリックで
wide-event-observability
Design and implement wide-event logging with tail sampling for context-rich, queryable observability
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Design and implement wide-event logging with tail sampling for context-rich, queryable observability
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
SOC 職業分類に基づく
Run one canonical SDLC stage (intake → shape → slice → plan → implement → verify → review → handoff → ship → retro), a perf/observability augmentation (instrument, experiment, benchmark, profile), the compressed design workflow (design), runtime-truth verification (probe), read-only review-and-route triage (simplify), or the end-to-end lifecycle driver (auto), and write its artifact to `.ai/workflows/<slug>/`. The intake stage also dispatches compressed entry modes (fix, rca, investigate, discover, hotfix, refactor, update-deps, ideate). For navigating existing workflows, use $wf-meta; for documentation, use $wf-docs. ($wf-quick is retired — use $wf intake <mode>, $wf probe, and $wf simplify.)
Run one canonical SDLC stage (intake → shape → slice → plan → implement → verify → review → handoff → ship → retro), a perf/observability augmentation (instrument, experiment, benchmark, profile), the compressed design workflow (design), runtime-truth verification (probe), read-only review-and-route triage (simplify), the end-to-end lifecycle driver (auto), or the autonomous lifecycle driver (yolo), and write its artifact to `.ai/workflows/<slug>/`. The intake stage also dispatches compressed entry modes (fix, rca, investigate, discover, hotfix, refactor, update-deps, ideate). For navigating existing workflows, use `/wf-meta`; for documentation, use `/wf-docs`.
Documentation router. Orchestrator mode runs the full discover → audit → plan → generate → review pipeline against a project or workflow slug. Primitive mode writes a single Diátaxis document — tutorial, how-to, reference, explanation, or readme — or runs a docs review.
Navigate, inspect, and meta-control existing SDLC workflows — pick what to run next, check status, resume, sync the registry, amend or extend a plan, skip a stage, close a slug, or explain how something works. Does not produce stage artifacts; for those use $wf.
Navigate, inspect, and meta-control existing SDLC workflows — pick what to run next, check status, resume, sync the registry, amend or extend a plan, skip a stage, close a slug, or explain how something works. Does not produce stage artifacts; for those use `/wf`.
Generate an image from a text prompt using the best available method, with automatic fallback (built-in image_gen → gpt-image-2 → nano-banana-pro → text-only scene sentence). Returns the image file path and the method used. Internal to `/wf design`; not user-invocable.
| name | wide-event-observability |
| description | Design and implement wide-event logging with tail sampling for context-rich, queryable observability |
| when_to_use | When designing logging/observability systems, debugging production incidents with inadequate context, or replacing scattered log statements with canonical log lines |
You are an observability architect implementing wide events / canonical log lines with tail sampling to transform logging from "grep text files" to "query structured events with business context."
Traditional logging is broken because:
The Solution: Emit ONE comprehensive event per request per service containing:
interface WideEvent {
// Correlation & Identity
timestamp: string; // ISO 8601
request_id: string; // Correlation across services
trace_id?: string; // Distributed tracing
span_id?: string;
// Service Context
service: string; // "checkout-api"
version: string; // "2.1.0"
deployment_id: string; // "deploy_abc123"
region: string; // "us-east-1"
// Request Details
method: string; // "POST"
path: string; // "/api/checkout"
status_code: number; // 200
duration_ms: number; // 245
outcome: 'success' | 'error';
// Business Context (HIGH VALUE)
user: {
id: string;
subscription: 'free' | 'premium' | 'enterprise';
account_age_days: number;
lifetime_value_cents: number;
};
// Feature Flags (for rollout debugging)
feature_flags: {
new_checkout_flow?: boolean;
beta_payment_ui?: boolean;
};
// Domain-Specific Context
cart?: {
total_cents: number;
item_count: number;
currency: string;
};
payment?: {
provider: 'stripe' | 'paypal';
method: 'card' | 'bank';
latency_ms: number;
attempt: number;
};
// Error Details (when applicable)
error?: {
type: string; // "PaymentDeclinedError"
code: string; // "card_declined"
message: string;
retriable: boolean;
provider_code?: string; // Stripe/PayPal specific
};
}
Sampling decision happens AFTER request completes:
function shouldSample(event: WideEvent): boolean {
// ALWAYS keep errors (100%)
if (event.status_code >= 500) return true;
if (event.error) return true;
// ALWAYS keep slow requests (tune threshold to your p99)
if (event.duration_ms > 2000) return true;
// ALWAYS keep VIPs / important cohorts
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents > 10000_00) return true;
// ALWAYS keep feature-flagged traffic (for rollout debugging)
if (event.feature_flags?.new_checkout_flow) return true;
// Randomly sample the rest (1-5%)
return Math.random() < 0.05;
}
Why this works:
Replace "diary logs" with a request-scoped event builder that accumulates context during handling and emits once in finally.
BAD: Scattered logs
app.post('/checkout', async (req, res) => {
logger.info('Checkout started');
logger.info(`User: ${req.user.id}`);
const cart = await getCart(req.user.id);
logger.info(`Cart total: ${cart.total}`);
try {
const payment = await processPayment(cart);
logger.info(`Payment successful: ${payment.id}`);
res.json({ ok: true });
} catch (err) {
logger.error(`Payment failed: ${err.message}`);
throw err;
}
});
GOOD: Wide event
app.post('/checkout', async (req, res) => {
const event = req.wideEvent; // Request-scoped builder
const cart = await getCart(req.user.id);
event.cart = {
total_cents: cart.total,
item_count: cart.items.length,
currency: cart.currency
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart);
event.payment = {
provider: payment.provider,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt
};
res.json({ ok: true });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code,
message: err.message,
retriable: err.retriable
};
throw err;
}
// Event emitted automatically in middleware's res.on('finish')
});
Do not log internal step-by-step narration unless absolutely required. The wide event is the authoritative record.
Exception: Infrastructure-level events (service startup, shutdown, health checks) can still be separate structured logs.
If using OTel tracing, enrich spans/events with business fields explicitly:
import { trace } from '@opentelemetry/api';
const span = trace.getActiveSpan();
if (span) {
span.setAttributes({
'user.subscription': user.subscription,
'user.account_age_days': user.accountAgeDays,
'feature_flags.new_checkout_flow': flags.newCheckoutFlow,
'cart.total_cents': cart.totalCents
});
}
OTel is a delivery mechanism, not a decision-maker. You must instrument business context deliberately.
Define a stable schema (even if flexible) and normalize keys:
BAD: Inconsistent keys
{ userId: '123' } // One endpoint
{ user_id: '123' } // Another endpoint
{ id: '123' } // Yet another
GOOD: Consistent schema
{
user: {
id: '123',
subscription: 'premium',
account_age_days: 730
}
}
Use a TypeScript interface or JSON Schema to enforce consistency.
Never log:
Prefer:
function redactSensitive(event: WideEvent): WideEvent {
const redacted = { ...event };
// Remove sensitive fields
delete redacted.password;
delete redacted.creditCard;
delete redacted.ssn;
// Hash PII if needed
if (redacted.email) {
redacted.email_hash = hashEmail(redacted.email);
delete redacted.email;
}
return redacted;
}
// observability/wideEvent.ts
import type { Request, Response, NextFunction } from 'express';
import crypto from 'crypto';
export interface WideEvent {
timestamp: string;
request_id: string;
trace_id?: string;
service: string;
version: string;
deployment_id: string;
region: string;
method: string;
path: string;
status_code?: number;
duration_ms?: number;
outcome?: 'success' | 'error';
user?: {
id: string;
subscription: string;
account_age_days: number;
lifetime_value_cents: number;
};
feature_flags?: Record<string, boolean>;
error?: {
type: string;
code: string;
message: string;
retriable: boolean;
};
[key: string]: any;
}
function getOrCreateRequestId(req: Request): string {
const existing = req.header('x-request-id');
return existing ?? crypto.randomUUID();
}
export function wideEventMiddleware(
logger: { info: (obj: any, msg?: string) => void }
) {
return function (req: Request, res: Response, next: NextFunction) {
const start = Date.now();
const request_id = getOrCreateRequestId(req);
// Build initial event
const event: WideEvent = {
timestamp: new Date().toISOString(),
request_id,
service: process.env.SERVICE_NAME || 'unknown',
version: process.env.SERVICE_VERSION || '0.0.0',
deployment_id: process.env.DEPLOYMENT_ID || 'local',
region: process.env.REGION || 'local',
method: req.method,
path: req.path,
};
// Attach to request for handlers to enrich
(req as any).wideEvent = event;
// Include request id in response headers
res.setHeader('x-request-id', request_id);
// Emit event when response finishes
res.on('finish', () => {
event.status_code = res.statusCode;
event.duration_ms = Date.now() - start;
event.outcome = res.statusCode >= 500 ? 'error' : 'success';
// Tail sampling decision
if (shouldSample(event)) {
logger.info(event, 'request_complete');
}
});
next();
};
}
function shouldSample(event: WideEvent): boolean {
// Always keep errors
if (event.status_code && event.status_code >= 500) return true;
if (event.error) return true;
// Always keep slow requests (tune to your p99)
if (event.duration_ms && event.duration_ms > 2000) return true;
// Always keep VIPs
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents && event.user.lifetime_value_cents > 10000_00) return true;
// Always keep feature-flagged traffic
if (event.feature_flags && Object.keys(event.feature_flags).length > 0) return true;
// Sample the rest (5%)
return Math.random() < 0.05;
}
import express from 'express';
const app = express();
app.use(wideEventMiddleware(logger));
app.post('/api/checkout', async (req, res) => {
const event = (req as any).wideEvent;
// Add user context
const user = req.user; // From auth middleware
event.user = {
id: user.id,
subscription: user.subscription,
account_age_days: daysSince(user.createdAt),
lifetime_value_cents: user.lifetimeValueCents,
};
// Add feature flags
event.feature_flags = req.featureFlags; // From feature flag middleware
// Add business context
const cart = await getCart(user.id);
event.cart = {
total_cents: cart.totalCents,
item_count: cart.items.length,
currency: cart.currency,
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart, user);
event.payment = {
provider: payment.provider,
method: payment.method,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt,
};
res.json({ ok: true, orderId: payment.orderId });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code || 'unknown',
message: err.message,
retriable: err.retriable ?? false,
provider_code: err.providerCode,
};
res.status(err.statusCode || 500).json({
error: err.code,
message: err.message,
});
}
});
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
redact: {
paths: [
'password',
'creditCard',
'ssn',
'authToken',
'apiKey',
'authorization',
'cookie',
],
remove: true,
},
// Send to stdout for CloudWatch/Datadog ingestion
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' }
: undefined,
});
export default logger;
Use the same philosophy: emit fewer, better events with business context.
// observability/clientLogger.ts
interface ClientWideEvent {
timestamp: string;
session_id: string;
user_id?: string;
route: string;
build_version: string;
// Correlate with backend
request_id?: string;
// Device/Network context
device: {
type: 'mobile' | 'tablet' | 'desktop';
os: string;
browser: string;
};
network: {
effectiveType: string; // 4g, 3g, etc.
downlink?: number;
};
// Feature flags
feature_flags?: Record<string, boolean>;
// Error details
error?: {
type: string;
message: string;
stack?: string;
componentStack?: string;
};
// Business context
action?: string; // "checkout_submit", "payment_attempt"
outcome?: 'success' | 'error';
duration_ms?: number;
}
import React from 'react';
import { clientLogger } from './observability/clientLogger';
class ErrorBoundary extends React.Component<
{ children: React.ReactNode },
{ hasError: boolean }
> {
constructor(props: any) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError() {
return { hasError: true };
}
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
clientLogger.logError({
type: 'ReactError',
message: error.message,
stack: error.stack,
componentStack: errorInfo.componentStack,
});
}
render() {
if (this.state.hasError) {
return <div>Something went wrong. Please refresh the page.</div>;
}
return this.props.children;
}
}
These demonstrate the shift from grep to analytics on structured events.
-- CloudWatch Insights / DataDog / Elastic
SELECT
error.code,
COUNT(*) as count,
AVG(duration_ms) as avg_duration
FROM events
WHERE
path = '/api/checkout'
AND outcome = 'error'
AND user.subscription = 'premium'
AND feature_flags.new_checkout_flow = true
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY error.code
ORDER BY count DESC
Before wide events: Multiple searches across logs, manual correlation, no way to filter by feature flag.
After wide events: Single query with all context.
SELECT
payment.provider,
region,
PERCENTILE(payment.latency_ms, 95) as p95,
PERCENTILE(payment.latency_ms, 99) as p99
FROM events
WHERE
path = '/api/checkout'
AND payment.provider IS NOT NULL
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY payment.provider, region
Insight: Identify if Stripe is slower in eu-west-1 than us-east-1.
SELECT
feature_flags.new_checkout_flow as has_flag,
AVG(duration_ms) as avg_duration,
SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) as error_rate
FROM events
WHERE
path = '/api/checkout'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY has_flag
Insight: Compare error rate and latency between flag enabled vs disabled cohorts.
SELECT *
FROM events
WHERE
request_id = 'req_abc123'
ORDER BY timestamp
Insight: See complete request journey across all services with one query using request_id.
Track these to measure improvement:
Response: Structured logging (JSON) is necessary but not sufficient. You must also:
Response: OTel is a delivery mechanism. It doesn't decide:
You must instrument business context explicitly.
Response: Context must be captured at the moment it's available:
Adding context retroactively is impossible.
Response: Tail sampling keeps 100% of:
You only sample the noise (successful fast requests from regular users).
Do not complete until: