Run any Skill in Manus with one click

engineering-ml-features

Feature engineering for machine learning: encoding categorical variables, scaling numeric features, datetime transformations, text features, and leakage-safe preprocessing pipelines. Use when preparing data for modeling or improving model performance through better representations.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/legout/data-agent-skills --skill engineering-ml-features

Copy and paste this command into Claude Code to install the skill

Source

legout/data-agent-skills

Stars0

Forks0

UpdatedMarch 11, 2026 at 17:53

File Explorer

5 files

SKILL.md

readonly

More from this repository

same repository

accessing-cloud-storage

legout/data-agent-skills

Access cloud storage (S3, GCS, Azure) in Python using fsspec, pyarrow.fs, or obstore. Includes DataFrame integrations (Polars, DuckDB, Pandas, PyArrow), performance optimization, patterns for incremental loading, partitioned writes, and cross-cloud copy.

2026-03-110

analyzing-data

legout/data-agent-skills

Exploratory data analysis and visualization: profiling datasets, choosing appropriate charts, applying statistical tests, and creating effective visualizations for insight communication. Use when understanding data structure, exploring distributions and relationships, selecting visualization libraries, or producing analysis-ready charts.

2026-03-110

assuring-data-pipelines

legout/data-agent-skills

Data quality validation and observability for data pipelines. Combines Great Expectations and Pandera for data validation with OpenTelemetry and Prometheus for monitoring and alerting.

2026-03-110

designing-data-storage

legout/data-agent-skills

File formats and lakehouse table formats for data lakes: Parquet, Arrow, Lance, Zarr, Avro, ORC, Delta Lake, Apache Iceberg, and Apache Hudi. Covers compression, partitioning, ACID transactions, schema evolution, and format selection.

2026-03-110

engineering-ai-pipelines

legout/data-agent-skills

AI/ML production workflows: embedding generation, vector storage, RAG patterns, LLM monitoring, and batch inference.

2026-03-110

evaluating-ml-models

legout/data-agent-skills

Model evaluation and validation: cross-validation strategies, metrics selection, hyperparameter tuning, experiment tracking, and model comparison. Use when assessing model performance, diagnosing issues, selecting models, or optimizing hyperparameters.

2026-03-110

Source

legout

legout/data-agent-skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	engineering-ml-features
description	Feature engineering for machine learning: encoding categorical variables, scaling numeric features, datetime transformations, text features, and leakage-safe preprocessing pipelines. Use when preparing data for modeling or improving model performance through better representations.
dependsOn	["@analyzing-data","@building-data-pipelines"]

Engineering ML Features

Use this skill for creating, transforming, and selecting features that improve model performance. Covers categorical encoding, numeric scaling, datetime engineering, text features, and building leakage-safe pipelines.

When to use this skill

Categorical variables need encoding for ML algorithms
Numeric features require scaling or transformation
Datetime columns need conversion to meaningful features
Text data needs to be converted to numerical representations
Preventing data leakage during feature engineering
Selecting the most predictive features from a large set
Building reusable, production-ready preprocessing pipelines

When NOT to use this skill

General data exploration → use analyzing-data
Model evaluation and selection → use @evaluating-ml-models
Building interactive data apps → use @building-data-apps
Notebook setup and workflows → use @working-in-notebooks

Quick tool selection

Task	Default choice	Notes
Categorical encoding	category_encoders	Beyond sklearn's limited options
Feature scaling	sklearn.preprocessing	Standard, Robust, Power transforms
Pipeline composition	sklearn.pipeline + ColumnTransformer	Reproducible, CV-safe
Text vectorization	sklearn.feature_extraction.text	TF-IDF, CountVectorizer
Text embeddings	sentence-transformers	Pre-trained semantic embeddings
Feature selection	sklearn.feature_selection	Mutual info, RFE, SelectFromModel

Feature engineering workflows

1. Categorical encoding

Low cardinality (< 10-15 categories): One-hot encoding High cardinality (> 15-100): Target encoding or frequency encoding Ordinal: Ordinal encoding with explicit category order

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder

# One-hot for low cardinality
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Target encoding for high cardinality
te = TargetEncoder(smoothing=10)

# Ordinal for ordered categories
ord_enc = OrdinalEncoder(categories=[['low', 'medium', 'high']])

2. Numeric scaling and transformation

Method	Use When	Algorithm Impact
StandardScaler	Features normally distributed, outliers rare	Required for SVM, neural nets, PCA
RobustScaler	Outliers present, want median/IQR centering	Same as Standard, more robust
MinMaxScaler	Need bounded range [0,1] or [-1,1]	Neural nets, image data
PowerTransformer	Skewed distributions, want normality	Improves linear model performance
QuantileTransformer	Heavy tails, want uniform/normal	Tree models unaffected, linear improves

from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer

# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Power transform for skewness
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_train)

3. Datetime feature engineering

Extract components and encode cyclical patterns:

import numpy as np

# Component extraction
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['hour'] = df['timestamp'].dt.hour

# Cyclical encoding (preserves circular nature)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Duration features
df['days_since_start'] = (df['timestamp'] - df['timestamp'].min()).dt.days

4. Text feature engineering

from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

# TF-IDF for classical NLP
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(texts)

# Embeddings for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, show_progress_bar=True)

# Basic text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

5. Leakage-safe pipelines

Critical rule: Always fit on training data only, transform on all data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Define preprocessing for each column type
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])

# Correct: fit on train only
pipeline.fit(X_train, y_train)

# Transform train and test separately through the fitted pipeline
y_pred = pipeline.predict(X_test)  # No manual transform needed

CV-safe cross-validation:

from sklearn.model_selection import cross_val_score

# Pipeline ensures preprocessing happens within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5)

6. Feature selection

Method	Description	Best For
Filter (mutual_info)	Statistical measure vs target	Quick screening, many features
Filter (correlation)	Linear correlation with target	Linear models, fast baseline
Wrapper (RFE)	Recursive feature elimination	Small-medium feature sets
Embedded (L1)	Lasso zeroes out features	Linear models with sparsity
Embedded (tree)	Feature importance from trees	Tree-based models

from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import Lasso

# Mutual information filter
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)

# Recursive feature elimination
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=20)
X_rfe = rfe.fit_transform(X_train, y_train)

# L1 regularization (embedded)
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
selected_features = X_train.columns[lasso.coef_ != 0]

Core implementation rules

1. Prevent data leakage

❌ Wrong: Fitting encoders/scalers on full dataset ✅ Right: fit_transform() on train, transform() on test

# Train
scaler.fit_transform(X_train)
# Test - ONLY transform!
scaler.transform(X_test)

2. Handle unknown categories

# Unknown categories become all zeros
OneHotEncoder(handle_unknown='ignore')

# Unknown categories grouped with rare ones
OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=0.01)

3. Track feature names through pipelines

# Get feature names after ColumnTransformer
feature_names = preprocessor.get_feature_names_out()

4. Document feature importance

Track which features were created, why, and their expected impact on model performance.

Common anti-patterns

Anti-pattern	Solution
❌ Fitting preprocessors on full dataset	Use train/test split before any fitting
❌ One-hot encoding high-cardinality features (>100 categories)	Use target encoding or frequency encoding
❌ Ignoring scaling for distance-based models	Always scale for SVM, k-NN, neural nets, PCA
❌ Creating features without domain reasoning	Validate features make business sense
❌ Not validating feature distributions match between train/test	Use distribution tests or visual comparison
❌ Target encoding without smoothing	Use smoothing parameter to handle rare categories
❌ Forgetting cyclical encoding for time	Use sin/cos for hour, dayofweek, month

Progressive disclosure

Reference guides for detailed implementations:

references/categorical-encoding.md — Comprehensive encoding strategies and selection guidance
references/datetime-features.md — Time-based feature patterns and cyclical encoding
references/text-features.md — NLP feature engineering with TF-IDF and embeddings
references/feature-selection.md — Selection strategies and implementation patterns

Related skills

analyzing-data — Understand data before engineering features
@evaluating-ml-models — Validate feature impact on model performance
@building-data-pipelines — Data processing fundamentals and pipeline patterns

engineering-ml-features

More from this repository

More from this repository

Engineering ML Features

When to use this skill

When NOT to use this skill

Quick tool selection

Feature engineering workflows

1. Categorical encoding

2. Numeric scaling and transformation

3. Datetime feature engineering

4. Text feature engineering

5. Leakage-safe pipelines

6. Feature selection

Core implementation rules

1. Prevent data leakage

2. Handle unknown categories

3. Track feature names through pipelines

4. Document feature importance

Common anti-patterns

Progressive disclosure

Related skills

References

Engineering ML Features

When to use this skill

When NOT to use this skill

Quick tool selection

Feature engineering workflows

1. Categorical encoding

2. Numeric scaling and transformation

3. Datetime feature engineering

4. Text feature engineering

5. Leakage-safe pipelines

6. Feature selection

Core implementation rules

1. Prevent data leakage

2. Handle unknown categories

3. Track feature names through pipelines

4. Document feature importance

Common anti-patterns

Progressive disclosure

Related skills

References