Run any Skill in Manus with one click

$pwd:

sota-data-cleaning-feature-selection-eda

Name: Sota Data Cleaning Feature Selection Eda
Author: raphaelmansuy

// Master SOTA data prep for Kaggle comps: automated EDA (Sweetviz), cleaning (Pyjanitor), and feature selection (Polars + XGBoost) for medium datasets (100MB–5GB) in Colab.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:December 26, 2025 at 05:41

SKILL.md

readonly

related-skills.json

same repository

skill-creator.md

from "raphaelmansuy/machine-learning-feature-selection"

Use this skill to create new Agent Skills for GitHub Copilot. It guides you through the process of setting up the directory structure and the SKILL.md file.

2025-12-260

agents-md-guide.md

from "raphaelmansuy/machine-learning-feature-selection"

Guide for using and supporting the AGENTS.md standard in VS Code. Use this when asked about AGENTS.md, custom instructions, or repo-level AI agent configuration.

2025-12-260

sequentialthinking.md

from "raphaelmansuy/machine-learning-feature-selection"

Use this skill for systematic problem-solving through structured, iterative thinking. Break down complex problems, iterate on understanding, catch edge cases, and validate solutions comprehensively.

2025-12-260

package.json

"author": "raphaelmansuy"

"repository": "raphaelmansuy/machine-learning-feature-selection"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

Benefit

Impact

Faster iteration

Automated EDA (Sweetviz) reveals patterns 50% quicker than manual inspection.

Accuracy boost

Feature selection cuts 1000s of features to dozens; model accuracy gains 5–15% on tabular tasks.

Fits Colab

Polars processes medium data 2× faster than pandas; stays within free tier RAM (12–16GB).

Prevents drudgery

Automated cleaning (Pyjanitor, DataPrep) handles inconsistencies 3× faster.

Catches bias early

EDA visuals (heatmaps, correlations) spot data leakage and imbalances before training.

Future-proof

Integrates AI-assisted cleaning (e.g., COMET), aligned with 2025 ML trends.

Good fit

Bad fit

Medium tabular datasets (100MB–5GB) with noise, missing values, or high dims where automated cleaning and selection streamline to fit Colab.

Massive datasets (>10GB) needing distributed tools like Dask—use Spark instead.

Imbalanced or skewed data in comps, solved via EDA-driven resampling and statistical selection for better generalization.

Unstructured data (images/text) without prior vectorization—these require domain-specific preprocessing.

Quick iteration in time-limited comps, using SOTA automation to sense-make before training.

Pure inference optimization; focus here is pre-training prep, not deployment.

flowchart TD A[Load\nPolars/pandas] --> B[EDA\nSweetviz / seaborn] B --> C[Clean\nPyjanitor / DataPrep] C --> D[Feature selection\nFilter + Embedded] D --> E[Train\nXGBoost + CV] E --> F[Iterate\nerror analysis + new hypotheses] F --> B

flowchart TD S{Dataset size?} -->|< 100MB| P1[pandas OK] S -->|100MB–5GB| P2[Prefer Polars] S -->|> 10GB| P3[Spark / distributed] P2 --> R{RAM pressure?} R -->|Yes| L1[Use lazy scans\npl.scan_csv + filters] R -->|No| L2[Use eager load\npl.read_csv + immediate ops]

!pip install sweetviz polars -q import polars as pl import sweetviz as sv # Load fast with Polars df = pl.read_csv('train.csv') # Auto-generate EDA report report = sv.analyze(df, target_name='target') report.show_notebook() # Shows distributions, correlations, missing patterns

!pip install pyjanitor -q import janitor # Just importing adds .clean_names() to pandas DFs import pandas as pd df = pd.read_csv('train.csv') # Chain cleaning operations (no leakage if done before split) df = (df .clean_names() # lowercase + underscores .remove_empty() # drop all-null cols .fillna(df.median(numeric_only=True)) # numeric median fill )

from sklearn.feature_selection import SelectKBest, f_regression, SelectFromModel from xgboost import XGBRegressor from sklearn.model_selection import train_test_split, cross_val_score import numpy as np X = df.drop('target', axis=1) y = df['target'] X_train, X_valid, y_train, y_valid = train_test_split( X, y, test_size=0.2, random_state=42 ) # Hybrid: Filter + Embedded # 1) Fast statistical filter (reduces to 100 features) kbest = SelectKBest(score_func=f_regression, k=100) X_train_filtered = kbest.fit_transform(X_train, y_train) X_valid_filtered = kbest.transform(X_valid) # 2) Embedded selection via XGBoost importance xgb = XGBRegressor(n_estimators=100, random_state=42, max_depth=5) xgb.fit(X_train_filtered, y_train, eval_set=[(X_valid_filtered, y_valid)], verbose=False) # Keep top 50 features by importance feature_importance = np.argsort(xgb.feature_importances_)[-50:] X_train_final = X_train_filtered[:, feature_importance] # Validate with CV score = cross_val_score(XGBRegressor(max_depth=5), X_train_final, y_train, cv=5, scoring='r2') print(f"5-fold CV R²: {score.mean():.4f} (+/- {score.std():.4f})")

!pip install featuretools -q import featuretools as ft # For relational data: organize into an EntitySet es = ft.EntitySet(id='retail_data') es = es.add_dataframe(dataframe_name='customers', dataframe=customer_df, index='customer_id') es = es.add_dataframe(dataframe_name='transactions', dataframe=transaction_df, index='trans_id') # Define relationship (1 customer has many transactions) es.add_relationship(ft.Relationship(es['customers']['customer_id'], es['transactions']['customer_id'])) # Auto-synthesize features (aggregations, transforms, etc.) feature_matrix, feature_names = ft.dfs( entityset=es, target_dataframe_name='customers', max_depth=2, # Control feature complexity trans_primitives=['sum', 'mean', 'max'], # Aggregations over transactions ) print(f"Auto-generated {len(feature_names)} features from relational structure.") # Now proceed to selection on feature_matrix

Benefit

Impact

Faster iteration

Automated EDA (Sweetviz) reveals patterns 50% quicker than manual inspection.

Accuracy boost

Feature selection cuts 1000s of features to dozens; model accuracy gains 5–15% on tabular tasks.

Fits Colab

Polars processes medium data 2× faster than pandas; stays within free tier RAM (12–16GB).

Prevents drudgery

Automated cleaning (Pyjanitor, DataPrep) handles inconsistencies 3× faster.

Catches bias early

EDA visuals (heatmaps, correlations) spot data leakage and imbalances before training.

Future-proof

Integrates AI-assisted cleaning (e.g., COMET), aligned with 2025 ML trends.

Good fit

Bad fit

Medium tabular datasets (100MB–5GB) with noise, missing values, or high dims where automated cleaning and selection streamline to fit Colab.

Massive datasets (>10GB) needing distributed tools like Dask—use Spark instead.

Imbalanced or skewed data in comps, solved via EDA-driven resampling and statistical selection for better generalization.

Unstructured data (images/text) without prior vectorization—these require domain-specific preprocessing.

Quick iteration in time-limited comps, using SOTA automation to sense-make before training.

Pure inference optimization; focus here is pre-training prep, not deployment.

sota-data-cleaning-feature-selection-eda

Quick Wins: Why This Skill Matters

What Problems It Solves (and What It Doesn't)

Mental Model & Key Concepts (The Minimum to Think Correctly)

Glossary (Concepts You Must Not Be Fuzzy About)

The Survival Kit: Actionable Fastest Path to Proficiency

Progressive Complexity Examples (High Value, Minimal but Real)

Cheat Sheet: One-Liners for Speed

Related Technologies & Concepts (Map of the Neighborhood)

Resources

Quick Wins: Why This Skill Matters

What Problems It Solves (and What It Doesn't)

Mental Model & Key Concepts (The Minimum to Think Correctly)

Glossary (Concepts You Must Not Be Fuzzy About)

The Survival Kit: Actionable Fastest Path to Proficiency

Progressive Complexity Examples (High Value, Minimal but Real)

Cheat Sheet: One-Liners for Speed

Related Technologies & Concepts (Map of the Neighborhood)

Resources

name	sota-data-cleaning-feature-selection-eda
description	Master SOTA data prep for Kaggle comps: automated EDA (Sweetviz), cleaning (Pyjanitor), and feature selection (Polars + XGBoost) for medium datasets (100MB–5GB) in Colab.

sota-data-cleaning-feature-selection-eda

More from this repository

More from this repository

Quick Wins: Why This Skill Matters

What Problems It Solves (and What It Doesn't)

Mental Model & Key Concepts (The Minimum to Think Correctly)

Glossary (Concepts You Must Not Be Fuzzy About)

The Survival Kit: Actionable Fastest Path to Proficiency

Progressive Complexity Examples (High Value, Minimal but Real)

Cheat Sheet: One-Liners for Speed

Related Technologies & Concepts (Map of the Neighborhood)

Resources

Quick Wins: Why This Skill Matters

What Problems It Solves (and What It Doesn't)

Mental Model & Key Concepts (The Minimum to Think Correctly)

Glossary (Concepts You Must Not Be Fuzzy About)

The Survival Kit: Actionable Fastest Path to Proficiency

Progressive Complexity Examples (High Value, Minimal but Real)

Cheat Sheet: One-Liners for Speed

Related Technologies & Concepts (Map of the Neighborhood)

Resources