name	ai-ml-engineering
description	Build ML systems with disciplined training, evaluation, deployment, and safety practices
difficulty	staff
domains	["ai-ml"]

Overview

ML engineering failures are silent and delayed. A model that scores well on the benchmark can fail badly in production. This skill enforces the practices that catch these failures before they reach users: proper evaluation harnesses, data leakage detection, distribution shift monitoring, and safety checks.

When to Use

Process

Step 1: Define the task and success metric precisely

Before any code: what is the exact prediction task? What metric proves the model is good enough? What metric proves it is safe enough? Document these as your evaluation contract.

Step 2: Establish the baseline

Compute a simple baseline (majority class, rule-based system, GPT-4 zero-shot). Your model must beat this baseline by a meaningful margin to justify the complexity.

Step 3: Audit the training data

Step 4: Implement a reproducible training pipeline

Step 5: Build the evaluation harness before training

Step 6: Train with monitoring

Track: training loss, validation loss, gradient norms. Flag: loss spikes, NaN gradients, overfitting (train loss << val loss), underfitting.

Step 7: Run the full evaluation suite

Compare against: baseline, previous model version, human performance (if applicable). Document every dimension. Declare the threshold required for deployment.

Step 8: Safety evaluation

Step 9: Production readiness

Step 10: Staged rollout

Deploy to 1% of traffic. Monitor key metrics for 24 hours. Roll out to 10%, then 100%. Have a rollback procedure.

Anti-Rationalizations

"The eval numbers look good" Eval numbers on a curated test set are necessary but not sufficient. Production distribution ≠ test distribution.

"We'll add safety checks after launch" Safety issues discovered after launch are incidents. Safety checks added before launch are requirements.

"The model improved so we should ship it" Improved on which metric? Under which conditions? Improvements in accuracy can come with regressions in latency, fairness, or safety.

ai-ml-engineering

Overview

When to Use

Process

Step 1: Define the task and success metric precisely

Step 2: Establish the baseline

Step 3: Audit the training data

Step 4: Implement a reproducible training pipeline

Step 5: Build the evaluation harness before training

Step 6: Train with monitoring

Step 7: Run the full evaluation suite

Step 8: Safety evaluation

Step 9: Production readiness

Step 10: Staged rollout

Anti-Rationalizations

Red Flags

Verification Requirements

Overview

When to Use

Process

Step 1: Define the task and success metric precisely

Step 2: Establish the baseline

Step 3: Audit the training data

Step 4: Implement a reproducible training pipeline

Step 5: Build the evaluation harness before training

Step 6: Train with monitoring

Step 7: Run the full evaluation suite

Step 8: Safety evaluation

Step 9: Production readiness

Step 10: Staged rollout

Anti-Rationalizations

Red Flags

Verification Requirements

ai-ml-engineering

Overview

When to Use

Process

Step 1: Define the task and success metric precisely

Step 2: Establish the baseline

Step 3: Audit the training data

Step 4: Implement a reproducible training pipeline

Step 5: Build the evaluation harness before training

Step 6: Train with monitoring

Step 7: Run the full evaluation suite

Step 8: Safety evaluation

Step 9: Production readiness

Step 10: Staged rollout

Anti-Rationalizations

Red Flags

Verification Requirements

このリポジトリの他の Skills

Overview

When to Use

Process

Step 1: Define the task and success metric precisely

Step 2: Establish the baseline

Step 3: Audit the training data

Step 4: Implement a reproducible training pipeline

Step 5: Build the evaluation harness before training

Step 6: Train with monitoring

Step 7: Run the full evaluation suite

Step 8: Safety evaluation

Step 9: Production readiness

Step 10: Staged rollout

Anti-Rationalizations

Red Flags

Verification Requirements

このリポジトリの他の Skills