원클릭으로
wide-event-observability
Design and implement wide-event logging with tail sampling for context-rich, queryable observability
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Design and implement wide-event logging with tail sampling for context-rich, queryable observability
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
This skill should be used when writing code in any opinionated framework's distinctive style. It applies when writing framework-based applications, creating models/controllers/views, or any framework code. Triggers on code generation, refactoring requests, code review, or when the user mentions framework conventions. Embodies the philosophy of embracing framework conventions, fighting complexity, and choosing simplicity over cleverness.
This skill provides patterns for safe, systematic refactoring including extract, rename, move, and simplification operations with proper testing and rollback strategies.
This skill provides patterns and best practices for generating and organizing tests. It covers unit testing, integration testing, test data factories, and coverage strategies across multiple languages and frameworks.
This skill should be used when analyzing errors, stack traces, and logs to identify root causes and implement fixes.
| name | wide-event-observability |
| description | Design and implement wide-event logging with tail sampling for context-rich, queryable observability |
| when_to_use | When designing logging/observability systems, debugging production incidents with inadequate context, or replacing scattered log statements with canonical log lines |
You are an observability architect implementing wide events / canonical log lines with tail sampling to transform logging from "grep text files" to "query structured events with business context."
Traditional logging is broken because:
The Solution: Emit ONE comprehensive event per request per service containing:
interface WideEvent {
// Correlation & Identity
timestamp: string; // ISO 8601
request_id: string; // Correlation across services
trace_id?: string; // Distributed tracing
span_id?: string;
// Service Context
service: string; // "checkout-api"
version: string; // "2.1.0"
deployment_id: string; // "deploy_abc123"
region: string; // "us-east-1"
// Request Details
method: string; // "POST"
path: string; // "/api/checkout"
status_code: number; // 200
duration_ms: number; // 245
outcome: 'success' | 'error';
// Business Context (HIGH VALUE)
user: {
id: string;
subscription: 'free' | 'premium' | 'enterprise';
account_age_days: number;
lifetime_value_cents: number;
};
// Feature Flags (for rollout debugging)
feature_flags: {
new_checkout_flow?: boolean;
beta_payment_ui?: boolean;
};
// Domain-Specific Context
cart?: {
total_cents: number;
item_count: number;
currency: string;
};
payment?: {
provider: 'stripe' | 'paypal';
method: 'card' | 'bank';
latency_ms: number;
attempt: number;
};
// Error Details (when applicable)
error?: {
type: string; // "PaymentDeclinedError"
code: string; // "card_declined"
message: string;
retriable: boolean;
provider_code?: string; // Stripe/PayPal specific
};
}
Sampling decision happens AFTER request completes:
function shouldSample(event: WideEvent): boolean {
// ALWAYS keep errors (100%)
if (event.status_code >= 500) return true;
if (event.error) return true;
// ALWAYS keep slow requests (tune threshold to your p99)
if (event.duration_ms > 2000) return true;
// ALWAYS keep VIPs / important cohorts
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents > 10000_00) return true;
// ALWAYS keep feature-flagged traffic (for rollout debugging)
if (event.feature_flags?.new_checkout_flow) return true;
// Randomly sample the rest (1-5%)
return Math.random() < 0.05;
}
Why this works:
Replace "diary logs" with a request-scoped event builder that accumulates context during handling and emits once in finally.
❌ BAD: Scattered logs
app.post('/checkout', async (req, res) => {
logger.info('Checkout started');
logger.info(`User: ${req.user.id}`);
const cart = await getCart(req.user.id);
logger.info(`Cart total: ${cart.total}`);
try {
const payment = await processPayment(cart);
logger.info(`Payment successful: ${payment.id}`);
res.json({ ok: true });
} catch (err) {
logger.error(`Payment failed: ${err.message}`);
throw err;
}
});
✅ GOOD: Wide event
app.post('/checkout', async (req, res) => {
const event = req.wideEvent; // Request-scoped builder
const cart = await getCart(req.user.id);
event.cart = {
total_cents: cart.total,
item_count: cart.items.length,
currency: cart.currency
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart);
event.payment = {
provider: payment.provider,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt
};
res.json({ ok: true });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code,
message: err.message,
retriable: err.retriable
};
throw err;
}
// Event emitted automatically in middleware's res.on('finish')
});
Do not log internal step-by-step narration unless absolutely required. The wide event is the authoritative record.
Exception: Infrastructure-level events (service startup, shutdown, health checks) can still be separate structured logs.
If using OTel tracing, enrich spans/events with business fields explicitly:
import { trace } from '@opentelemetry/api';
const span = trace.getActiveSpan();
if (span) {
span.setAttributes({
'user.subscription': user.subscription,
'user.account_age_days': user.accountAgeDays,
'feature_flags.new_checkout_flow': flags.newCheckoutFlow,
'cart.total_cents': cart.totalCents
});
}
OTel is a delivery mechanism, not a decision-maker. You must instrument business context deliberately.
Define a stable schema (even if flexible) and normalize keys:
❌ BAD: Inconsistent keys
{ userId: '123' } // One endpoint
{ user_id: '123' } // Another endpoint
{ id: '123' } // Yet another
✅ GOOD: Consistent schema
{
user: {
id: '123',
subscription: 'premium',
account_age_days: 730
}
}
Use a TypeScript interface or JSON Schema to enforce consistency.
Never log:
Prefer:
function redactSensitive(event: WideEvent): WideEvent {
const redacted = { ...event };
// Remove sensitive fields
delete redacted.password;
delete redacted.creditCard;
delete redacted.ssn;
// Hash PII if needed
if (redacted.email) {
redacted.email_hash = hashEmail(redacted.email);
delete redacted.email;
}
return redacted;
}
// observability/wideEvent.ts
import type { Request, Response, NextFunction } from 'express';
import crypto from 'crypto';
export interface WideEvent {
timestamp: string;
request_id: string;
trace_id?: string;
service: string;
version: string;
deployment_id: string;
region: string;
method: string;
path: string;
status_code?: number;
duration_ms?: number;
outcome?: 'success' | 'error';
user?: {
id: string;
subscription: string;
account_age_days: number;
lifetime_value_cents: number;
};
feature_flags?: Record<string, boolean>;
error?: {
type: string;
code: string;
message: string;
retriable: boolean;
};
[key: string]: any;
}
function getOrCreateRequestId(req: Request): string {
const existing = req.header('x-request-id');
return existing ?? crypto.randomUUID();
}
export function wideEventMiddleware(
logger: { info: (obj: any, msg?: string) => void }
) {
return function (req: Request, res: Response, next: NextFunction) {
const start = Date.now();
const request_id = getOrCreateRequestId(req);
// Build initial event
const event: WideEvent = {
timestamp: new Date().toISOString(),
request_id,
service: process.env.SERVICE_NAME || 'unknown',
version: process.env.SERVICE_VERSION || '0.0.0',
deployment_id: process.env.DEPLOYMENT_ID || 'local',
region: process.env.REGION || 'local',
method: req.method,
path: req.path,
};
// Attach to request for handlers to enrich
(req as any).wideEvent = event;
// Include request id in response headers
res.setHeader('x-request-id', request_id);
// Emit event when response finishes
res.on('finish', () => {
event.status_code = res.statusCode;
event.duration_ms = Date.now() - start;
event.outcome = res.statusCode >= 500 ? 'error' : 'success';
// Tail sampling decision
if (shouldSample(event)) {
logger.info(event, 'request_complete');
}
});
next();
};
}
function shouldSample(event: WideEvent): boolean {
// Always keep errors
if (event.status_code && event.status_code >= 500) return true;
if (event.error) return true;
// Always keep slow requests (tune to your p99)
if (event.duration_ms && event.duration_ms > 2000) return true;
// Always keep VIPs
if (event.user?.subscription === 'enterprise') return true;
if (event.user?.lifetime_value_cents && event.user.lifetime_value_cents > 10000_00) return true;
// Always keep feature-flagged traffic
if (event.feature_flags && Object.keys(event.feature_flags).length > 0) return true;
// Sample the rest (5%)
return Math.random() < 0.05;
}
import express from 'express';
const app = express();
app.use(wideEventMiddleware(logger));
app.post('/api/checkout', async (req, res) => {
const event = (req as any).wideEvent;
// Add user context
const user = req.user; // From auth middleware
event.user = {
id: user.id,
subscription: user.subscription,
account_age_days: daysSince(user.createdAt),
lifetime_value_cents: user.lifetimeValueCents,
};
// Add feature flags
event.feature_flags = req.featureFlags; // From feature flag middleware
// Add business context
const cart = await getCart(user.id);
event.cart = {
total_cents: cart.totalCents,
item_count: cart.items.length,
currency: cart.currency,
};
try {
const paymentStart = Date.now();
const payment = await processPayment(cart, user);
event.payment = {
provider: payment.provider,
method: payment.method,
latency_ms: Date.now() - paymentStart,
attempt: payment.attempt,
};
res.json({ ok: true, orderId: payment.orderId });
} catch (err: any) {
event.error = {
type: err.name,
code: err.code || 'unknown',
message: err.message,
retriable: err.retriable ?? false,
provider_code: err.providerCode,
};
res.status(err.statusCode || 500).json({
error: err.code,
message: err.message,
});
}
});
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
redact: {
paths: [
'password',
'creditCard',
'ssn',
'authToken',
'apiKey',
'authorization',
'cookie',
],
remove: true,
},
// Send to stdout for CloudWatch/Datadog ingestion
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' }
: undefined,
});
export default logger;
Use the same philosophy: emit fewer, better events with business context.
// observability/clientLogger.ts
interface ClientWideEvent {
timestamp: string;
session_id: string;
user_id?: string;
route: string;
build_version: string;
// Correlate with backend
request_id?: string;
// Device/Network context
device: {
type: 'mobile' | 'tablet' | 'desktop';
os: string;
browser: string;
};
network: {
effectiveType: string; // 4g, 3g, etc.
downlink?: number;
};
// Feature flags
feature_flags?: Record<string, boolean>;
// Error details
error?: {
type: string;
message: string;
stack?: string;
componentStack?: string;
};
// Business context
action?: string; // "checkout_submit", "payment_attempt"
outcome?: 'success' | 'error';
duration_ms?: number;
}
class ClientLogger {
private sessionId: string;
constructor() {
this.sessionId = this.getOrCreateSessionId();
this.setupErrorHandlers();
}
private getOrCreateSessionId(): string {
let sessionId = sessionStorage.getItem('session_id');
if (!sessionId) {
sessionId = crypto.randomUUID();
sessionStorage.setItem('session_id', sessionId);
}
return sessionId;
}
private setupErrorHandlers() {
// Unhandled errors
window.addEventListener('error', (event) => {
this.logError({
type: 'UnhandledError',
message: event.message,
stack: event.error?.stack,
});
});
// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
this.logError({
type: 'UnhandledRejection',
message: String(event.reason),
});
});
}
logEvent(event: Partial<ClientWideEvent>) {
const fullEvent: ClientWideEvent = {
timestamp: new Date().toISOString(),
session_id: this.sessionId,
user_id: this.getUserId(),
route: window.location.pathname,
build_version: process.env.REACT_APP_VERSION || 'unknown',
device: this.getDeviceContext(),
network: this.getNetworkContext(),
feature_flags: this.getFeatureFlags(),
...event,
};
// Send to backend or observability service
this.send(fullEvent);
}
logError(error: { type: string; message: string; stack?: string }) {
this.logEvent({
error,
outcome: 'error',
});
}
private getUserId(): string | undefined {
// Get from auth context
return window.__USER_ID__;
}
private getDeviceContext() {
const ua = navigator.userAgent;
return {
type: this.detectDeviceType(ua),
os: this.detectOS(ua),
browser: this.detectBrowser(ua),
};
}
private getNetworkContext() {
const connection = (navigator as any).connection;
return {
effectiveType: connection?.effectiveType || 'unknown',
downlink: connection?.downlink,
};
}
private getFeatureFlags(): Record<string, boolean> {
return window.__FEATURE_FLAGS__ || {};
}
private send(event: ClientWideEvent) {
// Send to backend or observability service
if (navigator.sendBeacon) {
navigator.sendBeacon('/api/events', JSON.stringify(event));
} else {
fetch('/api/events', {
method: 'POST',
body: JSON.stringify(event),
headers: { 'Content-Type': 'application/json' },
}).catch(() => {
// Ignore errors in logging
});
}
}
}
export const clientLogger = new ClientLogger();
import React from 'react';
import { clientLogger } from './observability/clientLogger';
class ErrorBoundary extends React.Component<
{ children: React.ReactNode },
{ hasError: boolean }
> {
constructor(props: any) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError() {
return { hasError: true };
}
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
clientLogger.logError({
type: 'ReactError',
message: error.message,
stack: error.stack,
componentStack: errorInfo.componentStack,
});
}
render() {
if (this.state.hasError) {
return <div>Something went wrong. Please refresh the page.</div>;
}
return this.props.children;
}
}
function CheckoutButton({ cart }: { cart: Cart }) {
const handleCheckout = async () => {
const start = Date.now();
try {
const response = await fetch('/api/checkout', {
method: 'POST',
body: JSON.stringify(cart),
});
const requestId = response.headers.get('x-request-id');
if (!response.ok) {
throw new Error('Checkout failed');
}
clientLogger.logEvent({
action: 'checkout_submit',
outcome: 'success',
duration_ms: Date.now() - start,
request_id: requestId,
cart: {
total_cents: cart.totalCents,
item_count: cart.items.length,
},
});
} catch (error: any) {
clientLogger.logEvent({
action: 'checkout_submit',
outcome: 'error',
duration_ms: Date.now() - start,
error: {
type: error.name,
message: error.message,
},
});
}
};
return <button onClick={handleCheckout}>Checkout</button>;
}
These demonstrate the shift from grep → analytics on structured events.
-- CloudWatch Insights / DataDog / Elastic
SELECT
error.code,
COUNT(*) as count,
AVG(duration_ms) as avg_duration
FROM events
WHERE
path = '/api/checkout'
AND outcome = 'error'
AND user.subscription = 'premium'
AND feature_flags.new_checkout_flow = true
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY error.code
ORDER BY count DESC
Before wide events: Multiple searches across logs, manual correlation, no way to filter by feature flag.
After wide events: Single query with all context.
SELECT
payment.provider,
region,
PERCENTILE(payment.latency_ms, 95) as p95,
PERCENTILE(payment.latency_ms, 99) as p99
FROM events
WHERE
path = '/api/checkout'
AND payment.provider IS NOT NULL
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY payment.provider, region
Insight: Identify if Stripe is slower in eu-west-1 than us-east-1.
SELECT
feature_flags.new_checkout_flow as has_flag,
AVG(duration_ms) as avg_duration,
SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) / COUNT(*) as error_rate
FROM events
WHERE
path = '/api/checkout'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY has_flag
Insight: Compare error rate and latency between flag enabled vs disabled cohorts.
SELECT *
FROM events
WHERE
request_id = 'req_abc123'
ORDER BY timestamp
Insight: See complete request journey across all services with one query using request_id.
Track these to measure improvement:
Response: Structured logging (JSON) is necessary but not sufficient. You must also:
Response: OTel is a delivery mechanism. It doesn't decide:
You must instrument business context explicitly.
Response: Context must be captured at the moment it's available:
Adding context retroactively is impossible.
Response: Tail sampling keeps 100% of:
You only sample the noise (successful fast requests from regular users).
Do not complete until: