with one click
sota-data-cleaning-feature-selection-eda
// Master SOTA data prep for Kaggle comps: automated EDA (Sweetviz), cleaning (Pyjanitor), and feature selection (Polars + XGBoost) for medium datasets (100MB–5GB) in Colab.
// Master SOTA data prep for Kaggle comps: automated EDA (Sweetviz), cleaning (Pyjanitor), and feature selection (Polars + XGBoost) for medium datasets (100MB–5GB) in Colab.
Use this skill to create new Agent Skills for GitHub Copilot. It guides you through the process of setting up the directory structure and the SKILL.md file.
Guide for using and supporting the AGENTS.md standard in VS Code. Use this when asked about AGENTS.md, custom instructions, or repo-level AI agent configuration.
Use this skill for systematic problem-solving through structured, iterative thinking. Break down complex problems, iterate on understanding, catch edge cases, and validate solutions comprehensively.
| name | sota-data-cleaning-feature-selection-eda |
| description | Master SOTA data prep for Kaggle comps: automated EDA (Sweetviz), cleaning (Pyjanitor), and feature selection (Polars + XGBoost) for medium datasets (100MB–5GB) in Colab. |
| Benefit | Impact |
|---|---|
| Faster iteration | Automated EDA (Sweetviz) reveals patterns 50% quicker than manual inspection. |
| Accuracy boost | Feature selection cuts 1000s of features to dozens; model accuracy gains 5–15% on tabular tasks. |
| Fits Colab | Polars processes medium data 2× faster than pandas; stays within free tier RAM (12–16GB). |
| Prevents drudgery | Automated cleaning (Pyjanitor, DataPrep) handles inconsistencies 3× faster. |
| Catches bias early | EDA visuals (heatmaps, correlations) spot data leakage and imbalances before training. |
| Future-proof | Integrates AI-assisted cleaning (e.g., COMET), aligned with 2025 ML trends. |
| Good fit | Bad fit |
|---|---|
| Medium tabular datasets (100MB–5GB) with noise, missing values, or high dims where automated cleaning and selection streamline to fit Colab. | Massive datasets (>10GB) needing distributed tools like Dask—use Spark instead. |
| Imbalanced or skewed data in comps, solved via EDA-driven resampling and statistical selection for better generalization. | Unstructured data (images/text) without prior vectorization—these require domain-specific preprocessing. |
| Quick iteration in time-limited comps, using SOTA automation to sense-make before training. | Pure inference optimization; focus here is pre-training prep, not deployment. |
Real-world scenarios
Core primitives: EDA (visualize distributions/relations to understand data), cleaning (fix errors/missing via automation), feature selection (prune via statistical/embedded methods).
They interact sequentially: EDA informs cleaning targets. Clean data enables accurate selection. Selected features feed training without overfitting.
flowchart TD
A[Load\nPolars/pandas] --> B[EDA\nSweetviz / seaborn]
B --> C[Clean\nPyjanitor / DataPrep]
C --> D[Feature selection\nFilter + Embedded]
D --> E[Train\nXGBoost + CV]
E --> F[Iterate\nerror analysis + new hypotheses]
F --> B
Pipeline: Load (Polars/pandas) --> EDA (Sweetviz/seaborn) --> Clean (Pyjanitor/impute) --> Select (Dynamic threshold/Featuretools) --> Train (XGBoost with CV)
flowchart TD
S{Dataset size?} -->|< 100MB| P1[pandas OK]
S -->|100MB–5GB| P2[Prefer Polars]
S -->|> 10GB| P3[Spark / distributed]
P2 --> R{RAM pressure?}
R -->|Yes| L1[Use lazy scans\npl.scan_csv + filters]
R -->|No| L2[Use eager load\npl.read_csv + immediate ops]
.fit() / .predict()).Prioritized checklist
!pip install). Load a medium Kaggle dataset. Run a basic EDA report.20% of features for 80% results
Polars for loading and EDA. Pyjanitor for cleaning. scikit-learn’s SelectKBest plus embedded (XGBoost importances) for selection.
Common pitfalls + avoidance
Debugging / observability tips
Sweetviz.compare() for pre/post clean diffs.print(df.memory_usage()).Performance gotchas
Security: Sanitize user inputs in pipelines to prevent injection.
Example 1: Hello, core primitive (Basic EDA on medium data)
Problem: Understand distributions in a 500MB CSV for patterns.
!pip install sweetviz polars -q
import polars as pl
import sweetviz as sv
# Load fast with Polars
df = pl.read_csv('train.csv')
# Auto-generate EDA report
report = sv.analyze(df, target_name='target')
report.show_notebook() # Shows distributions, correlations, missing patterns
Example 2: Typical workflow (Automated cleaning post-EDA)
Problem: Fix missing/inconsistent values identified in EDA.
!pip install pyjanitor -q
import janitor # Just importing adds .clean_names() to pandas DFs
import pandas as pd
df = pd.read_csv('train.csv')
# Chain cleaning operations (no leakage if done before split)
df = (df
.clean_names() # lowercase + underscores
.remove_empty() # drop all-null cols
.fillna(df.median(numeric_only=True)) # numeric median fill
)
Example 3: Production-ish pattern (SOTA hybrid feature selection)
Problem: Prune 1k features to avoid overfitting on cleaned data.
from sklearn.feature_selection import SelectKBest, f_regression, SelectFromModel
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np
X = df.drop('target', axis=1)
y = df['target']
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Hybrid: Filter + Embedded
# 1) Fast statistical filter (reduces to 100 features)
kbest = SelectKBest(score_func=f_regression, k=100)
X_train_filtered = kbest.fit_transform(X_train, y_train)
X_valid_filtered = kbest.transform(X_valid)
# 2) Embedded selection via XGBoost importance
xgb = XGBRegressor(n_estimators=100, random_state=42, max_depth=5)
xgb.fit(X_train_filtered, y_train, eval_set=[(X_valid_filtered, y_valid)], verbose=False)
# Keep top 50 features by importance
feature_importance = np.argsort(xgb.feature_importances_)[-50:]
X_train_final = X_train_filtered[:, feature_importance]
# Validate with CV
score = cross_val_score(XGBRegressor(max_depth=5), X_train_final, y_train, cv=5, scoring='r2')
print(f"5-fold CV R²: {score.mean():.4f} (+/- {score.std():.4f})")
Example 4: Advanced but common (Integrate with Featuretools auto-engineering)
Problem: Automated end-to-end for sense-making with relational data.
!pip install featuretools -q
import featuretools as ft
# For relational data: organize into an EntitySet
es = ft.EntitySet(id='retail_data')
es = es.add_dataframe(dataframe_name='customers', dataframe=customer_df, index='customer_id')
es = es.add_dataframe(dataframe_name='transactions', dataframe=transaction_df, index='trans_id')
# Define relationship (1 customer has many transactions)
es.add_relationship(ft.Relationship(es['customers']['customer_id'], es['transactions']['customer_id']))
# Auto-synthesize features (aggregations, transforms, etc.)
feature_matrix, feature_names = ft.dfs(
entityset=es,
target_dataframe_name='customers',
max_depth=2, # Control feature complexity
trans_primitives=['sum', 'mean', 'max'], # Aggregations over transactions
)
print(f"Auto-generated {len(feature_names)} features from relational structure.")
# Now proceed to selection on feature_matrix
pl.read_csv('file.csv')sv.analyze(df).show_notebook()df.clean_names().fill_empty('col', 'mean')df.null_count()from statsmodels.stats.outliers_influence import variance_inflation_factorSelectKBest(f_regression, k=100)SelectFromModel(XGBClassifier(), threshold=0.01)ft.dfs(entityset=es, target_dataframe_name='main')df.corr().style.background_gradient()import seaborn as sns; sns.boxplot(df)df.cast(pl.Float32)for batch in df.iter_slices(10000): process(batch)from sklearn.model_selection import cross_val_scoremodel.feature_importances_; sns.barplot()df.drop_duplicates(subset=['key'])from sklearn.impute import KNNImputerdf.skew()If you only remember 5 things
The "Gene Expression Cancer RNA-Seq" dataset on Kaggle is ideal for practicing feature elimination on high-dimensional data. It contains 801 samples across 5 cancer types (BRCA, KIRC, COAD, LUAD, PRAD) and approximately 20,531 features representing gene expression levels from RNA-Seq data. This makes it excellent for testing techniques like PCA, mutual information, or recursive feature elimination to identify predictive genes while managing multicollinearity and noise. The task is multi-class classification to predict tumor type, with a medium file size (~17MB CSV). Derived from TCGA data, this dataset exemplifies the "curse of dimensionality"—where features far exceed samples—making it perfect for evaluating SOTA feature selection methods on gene expression data. Find it by searching "Gene Expression Cancer RNA-Seq" on Kaggle or directly at https://www.kaggle.com/datasets/waalbannyantudre/gene-expression-cancer-rna-seq-donated-on-682016.
In one sentence: for medium tabular datasets (roughly 100MB–5GB, often run in Colab), you load data efficiently (pandas or Polars), run EDA (summary stats and plots to surface distribution quirks, missingness, outliers, imbalance, and multicollinearity), apply automated data cleaning (deduplication, type fixes, consistent naming, and principled imputation) to reduce noise and bias, then perform feature selection (filter methods like F-regression or mutual information plus embedded methods like Lasso or XGBoost importances, sometimes with adaptive thresholds or representation reduction such as PCA) so the final model (validated with cross-validation) learns signal instead of overfitting and you can iterate faster with reliable, interpretable improvements.