name	ML Workflow
department	alchemist
description	Design ML workflows — experiment tracking, feature stores, model training, serving, monitoring for drift
version	1
triggers	["ML","machine learning","model","training","feature store","MLflow","experiment","drift","serving","inference","W&B","Weights & Biases","model deployment"]

ML Workflow

Purpose

Design end-to-end ML workflows covering experiment tracking, feature engineering and storage, model training pipelines, model serving and deployment, A/B testing for models, and monitoring for data and model drift. Produces a workflow architecture, tool selection rationale, and operational runbook.

Inputs

ML problem type (classification, regression, ranking, recommendation, NLP, CV)
Data sources and feature candidates
Model complexity range (linear/tree-based vs deep learning)
Serving requirements (batch predictions, real-time inference, edge deployment)
Team size and ML maturity (first model vs established ML platform)
Infrastructure constraints (cloud provider, GPU availability, budget)

Process

Step 1: Define the ML Problem Clearly

Before any tooling decisions, formalize:

What is the prediction target? What does "correct" look like?
What is the business metric this model optimizes? (Not just accuracy — revenue, conversion, engagement)
What is the baseline? (Rule-based heuristic, current model, random chance)
What is the minimum viable performance to ship?

Document the problem statement, target variable, evaluation metric, and success threshold.

Step 2: Design the Feature Engineering Pipeline

Map raw data to model-ready features:

Feature identification: Which raw fields become features? What transformations are needed (encoding, scaling, windowing, embedding)?
Temporal features: Aggregations over time windows (last 7 days, last 30 days). Guard against leakage — never use future data to predict the past.
Feature store evaluation: Does this project warrant a feature store (Feast, Tecton, Hopsworks)? Feature stores add value when: features are shared across models, real-time features are needed, or training-serving skew is a risk.
Feature documentation: Each feature should have: name, description, data type, source, transformation logic, and expected distribution.

Step 3: Design Experiment Tracking

Set up reproducible experiment management:

Tool selection: MLflow (open-source, self-hosted), Weights & Biases (managed, rich visualization), Neptune, or ClearML.
What to track: Hyperparameters, metrics (train/val/test), dataset version, code version (git SHA), environment (dependencies), artifacts (model files, plots).
Experiment organization: Project → Experiment group → Individual runs. Name runs meaningfully (not "run_42").
Comparison workflow: How does the team compare runs? Dashboard? Automated reports?

Step 4: Design the Training Pipeline

Build a reproducible, automated training workflow:

Data split strategy: Time-based splits for temporal data, stratified splits for imbalanced classes. Never random-split time-series data.
Training orchestration: Single script, or DAG-based (Airflow, Kubeflow Pipelines, SageMaker Pipelines)?
Hyperparameter tuning: Grid search, random search, Bayesian optimization (Optuna, Ray Tune)?
Validation strategy: Cross-validation, holdout, or time-series walk-forward?
Model registry: Where are trained models stored? How are they versioned? Who approves promotion to production?

Step 5: Design Model Serving

Plan how predictions reach users:

Batch serving: Run predictions on a schedule, store results in a table. Best for recommendations, risk scores, daily reports.
Real-time serving: Model behind an API endpoint. Best for search ranking, fraud detection, dynamic pricing.
Streaming serving: Model embedded in a stream processor. Best for event-driven predictions on Kafka/Kinesis streams.
Edge serving: Model deployed to device/browser. Best for latency-critical or offline-capable applications.

For real-time serving, specify: latency SLA (p50/p99), throughput (requests/second), scaling strategy (auto-scale triggers), and fallback behavior (what happens if the model is unavailable?).

Step 6: Design A/B Testing for Models

Plan controlled rollout of model changes:

Traffic splitting: How is traffic divided between control (current model) and treatment (new model)?
Metric selection: Primary metric (business KPI), guardrail metrics (latency, error rate), and minimum detectable effect.
Duration calculation: How long must the test run to reach statistical significance?
Rollback criteria: What triggers an automatic rollback?

Step 7: Design Monitoring and Drift Detection

Plan ongoing model health monitoring:

Data drift: Monitor input feature distributions for shifts. Tool options: Evidently, WhyLabs, Great Expectations.
Model drift: Monitor prediction distribution and performance metrics over time. Alert when performance degrades below threshold.
Concept drift: Monitor the relationship between features and target. Retrain triggers when the world changes (seasonality, market shifts).
Operational monitoring: Latency, error rates, throughput, GPU utilization for serving infrastructure.

Define retraining policy: scheduled (weekly/monthly), triggered (drift detected), or continuous (online learning).

Output Format

# ML Workflow: [Project/Model Name]

## Problem Definition

| Aspect | Detail |
|--------|--------|
| Problem type | ... |
| Target variable | ... |
| Business metric | ... |
| Evaluation metric | ... |
| Baseline performance | ... |
| Success threshold | ... |

## Feature Engineering

| Feature | Source | Transformation | Type | Leakage Risk |
|---------|--------|---------------|------|-------------|
| ...     | ...    | ...           | ...  | Low/Med/High |

**Feature store:** [Yes/No — tool choice and rationale]

## Experiment Tracking

| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Tool | ... | ... |
| What's tracked | ... | ... |
| Organization | ... | ... |

## Training Pipeline

[ASCII diagram showing data → features → train → evaluate → register]


| Stage | Tool/Method | Notes |
|-------|------------|-------|
| Data split | ... | ... |
| Training | ... | ... |
| Tuning | ... | ... |
| Validation | ... | ... |
| Registry | ... | ... |

## Model Serving

| Aspect | Detail |
|--------|--------|
| Serving mode | Batch / Real-time / Streaming / Edge |
| Latency SLA | ... |
| Throughput | ... |
| Scaling | ... |
| Fallback | ... |

## A/B Testing

| Aspect | Detail |
|--------|--------|
| Traffic split | ... |
| Primary metric | ... |
| Guardrail metrics | ... |
| Min duration | ... |
| Rollback criteria | ... |

## Monitoring and Drift

| Monitor | Tool | Threshold | Action |
|---------|------|-----------|--------|
| Data drift | ... | ... | ... |
| Model drift | ... | ... | ... |
| Concept drift | ... | ... | ... |
| Operational | ... | ... | ... |

**Retraining policy:** [Scheduled / Triggered / Continuous — details]

Quality Checks

Evolution Notes

Más de este repositorio

mismo repositorio

git-workflows

dtsong/claude-code-wsl-setup

Local git operations for syncing, branching, merging, and conflict resolution

2026-03-181

github-workflow

dtsong/claude-code-wsl-setup

GitHub interactions for issues, PRs, releases, and repository management

2026-03-181

soc-security-skills

dtsong/claude-code-wsl-setup

Use this skill when performing hardware security analysis for System-on-Chip components — threat modeling, verification scaffolding, compliance mapping, executive briefing, microarchitectural attack analysis, physical side-channel assessment, kernel security analysis, emerging hardware security, or TLA+ formal specification. Routes to the appropriate specialist. Trigger phrases include "threat model my SoC", "run STRIDE analysis", "generate SVA assertions", "compliance check against FIPS", "executive summary of findings", "Spectre analysis for cache", "DPA attack assessment", "kernel hardening review", "PQC hardware review", "TLA+ spec for access control". Do NOT use for software-only security, network security, or web application security.

2026-03-181

terraform-skill

dtsong/claude-code-wsl-setup

Use when working with Terraform or OpenTofu - creating modules, writing tests (native test framework, Terratest), setting up CI/CD pipelines, reviewing configurations, choosing between testing approaches, debugging state issues, implementing security scanning (trivy, checkov), or making infrastructure-as-code architecture decisions

2026-03-181

web-security-hardening

dtsong/claude-code-wsl-setup

Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.

2026-03-181

ai-data-integration

dtsong/claude-code-wsl-setup

Use this skill when connecting AI or LLMs to data platforms. Covers MCP servers for warehouses, natural-language-to-SQL, embeddings for data discovery, LLM-powered enrichment, and AI agent data access patterns. Common phrases: "text-to-SQL", "MCP server for Snowflake", "LLM data enrichment", "AI agent access". Do NOT use for general data integration (use data-integration) or dbt modeling (use dbt-transforms).

2026-03-081