ワンクリックで
ai-ml-engineering
Build ML systems with disciplined training, evaluation, deployment, and safety practices
メニュー
Build ML systems with disciplined training, evaluation, deployment, and safety practices
| name | ai-ml-engineering |
| description | Build ML systems with disciplined training, evaluation, deployment, and safety practices |
| difficulty | staff |
| domains | ["ai-ml"] |
ML engineering failures are silent and delayed. A model that scores well on the benchmark can fail badly in production. This skill enforces the practices that catch these failures before they reach users: proper evaluation harnesses, data leakage detection, distribution shift monitoring, and safety checks.
Before any code: what is the exact prediction task? What metric proves the model is good enough? What metric proves it is safe enough? Document these as your evaluation contract.
Compute a simple baseline (majority class, rule-based system, GPT-4 zero-shot). Your model must beat this baseline by a meaningful margin to justify the complexity.
Write your evaluation pipeline before training. Evaluations should be:
Track: training loss, validation loss, gradient norms. Flag: loss spikes, NaN gradients, overfitting (train loss << val loss), underfitting.
Compare against: baseline, previous model version, human performance (if applicable). Document every dimension. Declare the threshold required for deployment.
For LLM applications:
Deploy to 1% of traffic. Monitor key metrics for 24 hours. Roll out to 10%, then 100%. Have a rollback procedure.
"The eval numbers look good" Eval numbers on a curated test set are necessary but not sufficient. Production distribution ≠ test distribution.
"We'll add safety checks after launch" Safety issues discovered after launch are incidents. Safety checks added before launch are requirements.
"The model improved so we should ship it" Improved on which metric? Under which conditions? Improvements in accuracy can come with regressions in latency, fairness, or safety.
Build UIs that work for all users including keyboard navigation, screen readers, and WCAG 2.2
Design multi-agent systems with robust tool interfaces, state management, and failure handling
Design APIs that are stable, ergonomic, and evolvable
Design systems at the right scale with explicit trade-off documentation
Design services that are reliable, observable, secure, and maintainable
Design CI/CD pipelines with fast feedback, quality gates, and reliable deployments