| name | e2e-ml-problem-solver |
| description | Walk through an end-to-end machine learning problem from a vague business prompt all the way to implementation, deployment, and monitoring — narrating your thought process at every step like a senior ML engineer in an interview or design review. Use this skill whenever the user gives a business problem, product scenario, or open-ended prompt that could benefit from an ML solution (e.g., 'How would you build a recommendation system for X?', 'Design a churn prediction model', 'How would you detect fraud at Y?', 'Build me a pricing algorithm'). Also trigger when the user asks for help with ML system design interviews, case study walkthroughs, ML project planning, or any request that involves taking a fuzzy real-world problem and turning it into a working ML pipeline. Even partial triggers like 'How would you approach this ML problem?' or 'Walk me through building a model for...' should activate this skill. This skill produces narrated, structured walkthroughs with working code — not just theory. |
End-to-End ML Problem Solver
Purpose
This skill turns a vague business prompt into a fully narrated, structured ML solution — covering problem framing, data strategy, feature engineering, model selection, evaluation, deployment, monitoring, and iteration. The narration style mirrors how a senior ML engineer would talk through a case study interview or present a design review: thinking out loud, justifying decisions, acknowledging trade-offs, and showing pragmatic judgment.
The output is both educational (narrated reasoning) and practical (working code). The user should feel like they are pair-programming with someone who has shipped ML systems at scale.
How This Skill Works
When triggered, follow the 10-step workflow below. For each step:
- Narrate your reasoning before writing any code — explain why you're making each choice, what alternatives exist, and what trade-offs you're accepting.
- Write working code that demonstrates the step concretely.
- Flag decision points where the user's specific context would change your approach.
The narration should feel conversational and opinionated — not like a textbook. Use phrases like "I'd start by...", "The reason I'd pick X over Y here is...", "A common mistake here is...", "In production, you'd want to...".
The 10-Step End-to-End ML Workflow
Step 1: Clarify the Problem and Constraints
Before touching any data or models, frame the business problem correctly. This is the most important step — getting this wrong wastes everything downstream.
What to narrate:
- What is the dependent variable / outcome we're predicting or optimizing?
- How has this problem been approached before? Is there a baseline to beat?
- Is ML even needed, or would a heuristic / rules-based system work?
- Are there legal, ethical, or regulatory constraints on what data or models you can use?
- Who are the end users and how will they consume the model's output?
- What's the cost of wrong predictions? (A spam email in your inbox vs. a bad loan approval are very different.)
- Can the problem be decomposed into smaller sub-problems?
Technical constraints to surface:
- Latency requirements (real-time inference vs. batch)
- Throughput (predictions per second)
- Deployment target (cloud, on-device, edge)
- Budget and compute constraints
The mindset: Pretend you've already been hired. You're scoping this with a PM in a meeting. Ask the questions a thoughtful engineer would ask — don't just jump to "let's use XGBoost."
Step 2: Establish Metrics
Pick metrics that are simple, observable, and attributable. Align model metrics with business outcomes.
What to narrate:
- Start with a single north star metric, then mention secondary metrics
- Explain how model performance (e.g., precision/recall) translates to business impact (e.g., "90% accuracy means 50% fewer misrouted tickets, saving 10% resolution time")
- Mention guardrail / counter metrics — metrics that must NOT degrade while optimizing the primary one
- Consider satisficing: optimize one metric subject to a constraint on another (e.g., "maximize precision at recall ≥ 0.95")
- Establish what "good enough" looks like — what baseline do we need to beat?
What makes a good metric: meaningful (tied to business goals, not easily gamed), measurable (simple to track), understandable (stakeholders get it), timely (collectible in a reasonable timeframe).
What makes a bad metric: vanity metrics (sound nice, mean nothing), irrelevant metrics (not tied to the goal), impractical metrics (can't actually measure them), overly complex metrics, delayed metrics (takes too long to observe).
Step 3: Understand Your Data Sources
The model is only as good as the data. Think creatively about what data to use.
What to narrate:
- What internal data is available and relevant?
- Can you augment with external data (crowdsourcing, purchased datasets, scraped data, user-provided data during onboarding)?
- For edge cases and rare events, consider data augmentation and synthetic data generation
- Data freshness: how often is data updated?
- Data provenance: how was it collected? Any sampling, selection, or response bias?
- Talk to domain experts about what the columns mean
Step 4: Explore Your Data (EDA)
Profile the data before doing anything else.
What to narrate:
- Column-level profiling: useful vs. low-variance vs. noisy vs. lots of missing values
- Summary statistics (mean, median, quantiles)
- Distributions and skewness
- Correlation matrix to spot relationships and multicollinearity
- Visualizations that reveal structure
Code pattern: Use pandas profiling, histograms, correlation heatmaps, and scatter matrices.
Step 5: Clean Your Data
The unglamorous but critical step.
What to narrate:
- Drop irrelevant or duplicated rows/columns
- Handle incorrect values and schema mismatches
- Missing data strategy: understand why data is missing (MCAR, MAR, MNAR) before choosing imputation
- Simple imputation (mean/median/mode)
- Model-based imputation (predict missing values from other features)
- Dropping rows (last resort)
- Outlier handling: investigate first, then decide
- Remove, winsorize/cap, transform (log), or leave as-is depending on source and business implications
- Consider multivariate outliers, not just univariate
Step 6: Feature Engineering
The art of presenting data to models in the best way possible.
What to narrate:
- Feature selection based on domain knowledge
- For numerical data: transformations (log, capping), binning/discretization, dimensionality reduction (PCA), scaling/normalization (min-max, z-score)
- For categorical data: one-hot encoding, hashing (for high cardinality), target encoding
- For text: stemming, lemmatization, stop-word removal, bag-of-words, TF-IDF, n-grams, word embeddings (word2vec, GloVe), or transformer embeddings
- Feature interactions and domain-specific features
- Always explain why a feature transformation helps the model
Step 7: Model Selection
Choose models based on constraints, not hype.
What to narrate:
- Start simple (Occam's Razor) — linear/logistic regression as a baseline
- Factors to consider: training speed, prediction speed, budget, data volume/dimensionality, feature types (categorical vs. numerical), explainability requirements
- Common progression: linear model → tree-based ensemble (Random Forest, XGBoost) → neural network (only if data is abundant and interpretability isn't critical)
- Mention alternatives you'd consider and why you'd pick one over another
- For unsupervised problems: k-means, DBSCAN, GMMs, PCA, LDA depending on the goal
Step 8: Model Training & Evaluation
Train, validate, and assess whether the model is good enough to ship.
What to narrate:
- Train/validation/test split strategy
- Cross-validation approach
- Hyperparameter tuning method (grid search, random search, Bayesian optimization)
- How to compare models
- Handling biased training data and class imbalance (resampling, SMOTE, class weights, adjusted thresholds)
- Feature importance analysis
- Learning curves to diagnose overfitting/underfitting
- Regularization strategy (L1/L2)
- For large datasets: sampling strategies (random, stratified, under/oversampling)
Step 9: Deployment
Operationalizing the model — where MLOps meets reality.
What to narrate:
- Online (real-time) vs. batch vs. hybrid deployment
- Online: low latency, needs caching layer for features, robust monitoring, more infrastructure cost
- Batch: periodic predictions, good for non-urgent use cases (recommendations), can't handle brand-new data until next batch
- Hybrid: batch for most cases, online for time-sensitive predictions
- Model degradation and training-serving skew: the underlying data distribution changes over time (e.g., winter model recommending jackets in July)
- How often to retrain, what triggers a model refresh, how much new vs. historical data to use
- Logging and monitoring for catching degradation
- A/B testing the new model against the baseline before full rollout
- Run for at least 2 weeks to account for day-of-week effects
- Check counter metrics, not just the primary metric
- Statistical significance ≠ practical significance — check effect size
- Consider holdout groups for measuring long-term lift
- Canary deployments and gradual rollouts
Step 10: Iterate
Deployment is not the end. Plan for continuous improvement.
What to narrate:
- Error analysis: manually inspect wrong predictions, bucket them by failure type, prioritize fixes
- Feedback loops: can user actions (clicks, ratings, corrections) become training data?
- When to collect more data vs. when to engineer better features vs. when to try a new model architecture
- Monitor for data drift, concept drift, and feature drift
- Set up alerts for metric degradation
- Plan the next iteration based on the highest-leverage improvement
Output Format
For each problem, produce a single cohesive artifact (Python file or Jupyter-style script) that walks through all 10 steps with:
- Markdown-style comments narrating the reasoning at each step (using
# --- section headers)
- Working code demonstrating each step with synthetic or realistic sample data
- Decision annotations marked with
# DECISION: explaining why a particular choice was made
- Trade-off callouts marked with
# TRADE-OFF: when there's a meaningful alternative
- Production notes marked with
# IN PRODUCTION: for things you'd do differently at scale
The code should be runnable end-to-end. Use common libraries: pandas, numpy, scikit-learn, matplotlib, seaborn. For deep learning problems, use PyTorch or TensorFlow. For NLP, use transformers/huggingface or spaCy.
Adaptation Rules
- If the user gives a specific business problem: Tailor every step to that problem. Use realistic feature names, domain-appropriate metrics, and plausible data shapes.
- If the user asks for interview prep: Add interviewer-style follow-up questions and how to handle them at the end of each step.
- If the user wants just the code: Reduce narration but keep the
# DECISION: and # TRADE-OFF: annotations.
- If the user wants just the framework: Skip the code, produce a structured narrative walkthrough.
- If the problem is clearly not suited for ML: Say so! Suggest heuristics or rules-based approaches. Showing judgment about when NOT to use ML is a sign of seniority.
Common Pitfalls to Flag
Throughout the walkthrough, proactively call out common mistakes:
- Jumping to model selection before understanding the problem
- Using accuracy as a metric for imbalanced datasets
- Not establishing a baseline before building complex models
- Ignoring the cost asymmetry of different types of errors
- Overfitting to the validation set through excessive hyperparameter tuning
- Not thinking about deployment constraints during model selection
- Treating model deployment as the finish line instead of the starting line
- Not monitoring for data/concept drift in production
- Cherry-picking metrics or stopping A/B tests early when results look good
For Deeper Reference
For detailed guidance on specific sub-topics, read the reference files:
references/product-sense-and-metrics.md — Deep dive on metric selection, AARRR framework, diagnosing metric changes, A/B testing pitfalls, and building product/business intuition. Read this when the problem involves product metrics, A/B test design, or metric trade-off decisions.
references/case-study-patterns.md — Patterns and anti-patterns from real ML case study interviews (recommendation systems, pricing algorithms, fraud detection, NLP sentiment analysis, social graph features, revenue prediction). Read this when you need concrete examples of how senior engineers approach specific problem domains.