원클릭으로
mle-workflow
// Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks.
// Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks.
[HINT] SKILL.md 및 모든 관련 파일을 포함한 전체 스킬 디렉토리를 다운로드합니다
| name | mle-workflow |
| description | Production machine-learning engineering workflow for data contracts, reproducible training, model evaluation, deployment, monitoring, and rollback. Use when building, reviewing, or hardening ML systems beyond one-off notebooks. |
| allowed-tools | Read, Write, Edit, Bash, Grep, Glob |
Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.
Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.
python-patterns and python-testing for Python implementation and pytest coveragepytorch-patterns for deep learning models, data loaders, device handling, and training loopseval-harness and ai-regression-testing for promotion gates and agent-assisted regression checksdatabase-migrations, postgres-patterns, and clickhouse-io for data storage and analytics surfacesdeployment-patterns, docker-patterns, and security-review for serving, secrets, containers, and production hardeningDo not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:
The recommended minimal --with capability:machine-learning install keeps the core agent surface available alongside this skill. For skill-only or agent-limited harnesses, pair skill:mle-workflow with agent:mle-reviewer where the target supports agents.
| SWE surface | MLE use |
|---|---|
product-capability / architecture-decision-records | Turn model work into explicit product contracts and record irreversible data, model, and rollout choices |
repo-scan / codebase-onboarding / code-tour | Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack |
plan / feature-dev | Scope model changes as product capabilities with data, eval, serving, and rollback phases |
tdd-workflow / python-testing | Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation |
code-reviewer / mle-reviewer | Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks |
build-fix / pr-test-analyzer | Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures |
quality-gate / test-coverage | Require automated evidence for transforms, metrics, inference contracts, promotion gates, and rollback behavior |
eval-harness / verification-loop | Turn offline metrics, slice checks, latency budgets, and rollback drills into repeatable gates |
ai-regression-testing | Preserve every production bug as a regression: missing feature, stale label, bad artifact, schema drift, or serving mismatch |
api-design / backend-patterns | Design prediction APIs, batch jobs, idempotent retraining endpoints, and response envelopes |
database-migrations / postgres-patterns / clickhouse-io | Version labels, feature snapshots, prediction logs, experiment metrics, and drift analytics |
deployment-patterns / docker-patterns | Package reproducible training and serving images with health checks, resource limits, and rollback |
canary-watch / dashboard-builder | Make rollout health visible with model-version, slice, drift, latency, cost, and delayed-label dashboards |
security-review / security-scan | Check model artifacts, notebooks, prompts, datasets, and logs for secrets, PII, unsafe deserialization, and supply-chain risk |
e2e-testing / browser-qa / accessibility | Test critical product flows that consume predictions, including explainability and fallback UI states |
benchmark / performance-optimizer | Measure throughput, p95 latency, memory, GPU utilization, and cost per prediction or retrain |
cost-aware-llm-pipeline / token-budget-advisor | Route LLM/embedding workloads by quality, latency, and budget instead of defaulting to the largest model |
documentation-lookup / search-first | Verify current library behavior for model serving, feature stores, vector DBs, and eval tooling before coding |
git-workflow / github-ops / opensource-pipeline | Package MLE changes for review with crisp scope, generated artifacts excluded, and reproducible test evidence |
strategic-compact / dmux-workflows | Split long ML work into parallel tracks: data contract, eval harness, serving path, monitoring, and docs |
Use these simulations as coverage checks when planning or reviewing MLE work. A strong MLE workflow should reduce each task to explicit contracts, reusable SWE surfaces, automated evidence, and a reviewable artifact.
| ID | Common MLE task | Streamlined ECC path | Required output | Pipeline lanes covered |
|---|---|---|---|---|
| MLE-01 | Frame an ambiguous prediction, ranking, recommender, classifier, embedding, or forecast capability | product-capability, plan, architecture-decision-records, mle-workflow | Iteration Compact naming who cares, decision owner, success metric, unacceptable mistakes, assumptions, constraints, and first experiment | product contract, stakeholder loss, risk, rollout |
| MLE-02 | Define metric goals, labels, data sources, and the mistake budget | repo-scan, database-reviewer, database-migrations, postgres-patterns, clickhouse-io | Data and metric contract with entity grain, label timing, label confidence, feature timing, point-in-time joins, split policy, and dataset snapshot | data contract, metric design, leakage, reproducibility |
| MLE-03 | Build a baseline model and scoring path before adding complexity | tdd-workflow, python-testing, python-patterns, code-reviewer | Baseline scorer with confusion matrix, calibration notes, latency/cost estimate, known weaknesses, and tests for score shape and determinism | baseline, scoring, testing, serving parity |
| MLE-04 | Generate features from hypotheses about what separates outcomes | python-patterns, pytorch-patterns, docker-patterns, deployment-patterns | Feature plan and transform module covering signal source, missing values, outliers, correlations, leakage checks, and train/serve equivalence | feature pipeline, leakage, training, artifacts |
| MLE-05 | Tune thresholds, configs, and model complexity under tradeoffs | eval-harness, ai-regression-testing, quality-gate, test-coverage | Threshold/config report comparing precision, recall, F1, AUC, calibration, group slices, latency, cost, complexity, and acceptable error classes | evaluation, threshold, promotion, regression |
| MLE-06 | Run error analysis and turn mistakes into the next experiment | eval-harness, ai-regression-testing, mle-reviewer, silent-failure-hunter | Error cluster report for false positives, false negatives, ambiguous labels, stale features, missing signals, and bug traces with lessons captured | error analysis, bug trace, iteration, regression |
| MLE-07 | Package a model artifact for batch or online inference | api-design, backend-patterns, security-review, security-scan | Versioned artifact bundle with preprocessing, config, dependency constraints, schema validation, safe loading, and PII-safe logs | artifact, security, inference contract |
| MLE-08 | Ship online serving or batch scoring with feedback capture | api-design, backend-patterns, e2e-testing, browser-qa, accessibility | Prediction endpoint or batch job with response envelope, timeout, batching, fallback, model version, confidence, feedback logging, and product-flow tests | serving, batch inference, fallback, user workflow |
| MLE-09 | Roll out a model with shadow traffic, canary, A/B test, or rollback | canary-watch, dashboard-builder, verification-loop, performance-optimizer | Rollout plan naming traffic split, dashboards, p95 latency, cost, quality guardrails, rollback artifact, and rollback trigger | deployment, canary, rollback |
| MLE-10 | Operate, debug, and refresh a production model after launch | silent-failure-hunter, dashboard-builder, mle-reviewer, doc-updater, github-ops | Observation ledger and refresh plan with drift checks, delayed-label health, alert owners, runbook updates, retrain criteria, and PR evidence | monitoring, incident response, retraining |
Before touching model code, compress the work into one reviewable artifact. This should be short enough to fit in a PR description and precise enough that another engineer can challenge the tradeoffs.
Goal:
Who cares:
Decision owner:
User or system action changed by the model:
Success metric:
Guardrail metrics:
Mistake budget:
Unacceptable mistakes:
Acceptable mistakes:
Assumptions:
Constraints:
Labels and data snapshot:
Baseline:
Candidate signals:
Threshold or config plan:
Eval slices:
Known risks:
Next experiment:
Rollback or fallback:
This compact is the MLE equivalent of a strong SWE design note. It keeps the team from optimizing a metric no one trusts, adding features that do not address the real error mode, or shipping complexity without a rollback.
Use this loop whenever the task is ambiguous, high-impact, or metric-heavy:
(probability, confidence) x (cost, severity, importance, impact).Choose metrics from failure costs, not habit:
Every metric choice should state which mistake it makes cheaper, which mistake it makes more likely, and who absorbs that cost.
Features should come from a theory of separation:
Do not add model complexity until error analysis shows that the baseline is failing for a reason additional signal or capacity can plausibly fix.
After each baseline, training run, threshold change, or config change:
The strongest MLE loop is not train -> metric -> ship. It is mistake -> cluster -> hypothesis -> experiment -> evidence -> simpler system.
Keep a compact decision and evidence trail beside the code, PR, experiment report, or runbook:
Iteration:
Change:
Why this mattered:
Metric movement:
Slice movement:
False positives:
False negatives:
Unexpected errors:
Decision:
Tradeoff accepted:
Lesson captured:
Regression added:
Debt created:
Next iteration:
Use the ledger to make model work cumulative. The goal is for each iteration to make the next decision easier, not merely to produce another artifact.
Capture the product-level contract before writing model code:
Do not accept "improve the model" as a requirement. Tie the model to an observable product behavior and a measurable acceptance gate.
Every ML task needs an explicit data contract:
Guard against leakage first. If a feature is not available at prediction time, or is joined using future information, remove it or move it to an analysis-only path.
Training code should be runnable by another engineer without hidden notebook state:
Prefer immutable values and pure transformation functions. Avoid mutating shared data frames or global config during feature generation.
import hashlib
from dataclasses import dataclass
from pathlib import Path
@dataclass(frozen=True)
class TrainingConfig:
dataset_uri: str
model_dir: Path
seed: int
learning_rate: float
batch_size: int
def artifact_name(config: TrainingConfig, code_sha: str) -> str:
config_key = f"{config.dataset_uri}:{config.seed}:{config.learning_rate}:{config.batch_size}"
config_hash = hashlib.sha256(config_key.encode("utf-8")).hexdigest()[:12]
return f"{code_sha[:12]}-{config_hash}"
Promotion criteria should be declared before training finishes:
PROMOTION_GATES = {
"auc": ("min", 0.82),
"calibration_error": ("max", 0.04),
"p95_latency_ms": ("max", 80),
}
def assert_promotion_ready(metrics: dict[str, float]) -> None:
missing = sorted(name for name in PROMOTION_GATES if name not in metrics)
if missing:
raise ValueError(f"Model promotion metrics missing required gates: {missing}")
failures = {
name: value
for name, (direction, threshold) in PROMOTION_GATES.items()
for value in [metrics[name]]
if (direction == "min" and value < threshold)
or (direction == "max" and value > threshold)
}
if failures:
raise ValueError(f"Model failed promotion gates: {failures}")
Use offline metrics as gates, not guarantees. When the model changes product behavior, plan shadow evaluation, canary rollout, or A/B testing before full rollout.
An ML artifact is production-ready only when the serving contract is testable:
Never let training-only feature code diverge from serving feature code without a test that proves equivalence.
Model monitoring needs both system and quality signals:
Every deployment should have a rollback plan that names the previous artifact, config, data dependency, and traffic-switch mechanism.
When using this skill, return concrete artifacts: data contract, promotion gates, pipeline steps, test plan, deployment plan, or review findings. Call out unknowns that block production readiness instead of filling them with assumptions.