with one click
data-cleaning
// Clean and preprocess datasets by handling missing values, removing duplicates, correcting types, resolving outliers, and enforcing validation schemas.
// Clean and preprocess datasets by handling missing values, removing duplicates, correcting types, resolving outliers, and enforcing validation schemas.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | data-cleaning |
| description | Clean and preprocess datasets by handling missing values, removing duplicates, correcting types, resolving outliers, and enforcing validation schemas. |
| license | MIT |
| metadata | {"author":"awesome-ai-agent-skills","version":"1.0.0"} |
This skill enables an AI agent to systematically clean and preprocess raw datasets into analysis-ready form. The agent handles missing values, duplicate records, data type mismatches, inconsistent formats, outlier treatment, and normalization. It can also enforce validation schemas to ensure ongoing data quality. The primary toolchain is pandas with support from pyjanitor and great_expectations for advanced validation.
Ingest and profile the raw data. Load the dataset and immediately generate a quality report: count nulls per column, identify duplicate rows, check data types against expected schema, and flag columns with mixed types. This profile drives every subsequent cleaning decision.
Handle missing values. Apply strategy per column based on data type and missingness pattern. For numeric columns with less than 5% missing, use median imputation. For categorical columns, use mode or a dedicated "Unknown" category. For columns missing more than 40%, flag them for potential removal and consult the user before dropping.
Remove duplicates and resolve conflicts. Identify exact duplicates and near-duplicates (e.g., rows differing only in whitespace or casing). For exact duplicates, keep the first occurrence. For near-duplicates, apply fuzzy matching with a configurable similarity threshold and merge conflicting values by recency or completeness.
Correct data types and standardize formats. Coerce columns to their intended types — parse date strings into datetime objects, convert numeric strings to floats, and normalize categorical values to a canonical form. Standardize formats such as phone numbers, postal codes, and currency representations.
Detect and treat outliers. Use the IQR method (1.5x) for symmetric distributions and z-scores for normally distributed data. Offer three treatment options: cap at boundary values (winsorization), replace with null for later imputation, or flag-only mode that annotates but preserves original values.
Validate the cleaned output. Run the cleaned dataset through validation rules — non-null constraints, range checks, uniqueness constraints, and referential integrity. Report any remaining violations and save the clean dataset alongside a cleaning log that documents every transformation applied.
Provide the agent with the file path to the raw dataset and optionally a schema definition specifying expected column types, valid ranges, and uniqueness constraints. The agent will produce a cleaned file and a transformation log.
import pandas as pd
import numpy as np
# Load raw data
df = pd.read_csv("messy_orders.csv")
print(f"Raw shape: {df.shape}") # (2340, 8)
print(df.isnull().sum())
# order_id 0
# customer_name 12
# email 45
# order_date 18
# amount 23
# status 0
# region 67
# discount 0
# 1. Fix data types — order_date has mixed formats
df["order_date"] = pd.to_datetime(df["order_date"], format="mixed", dayfirst=False)
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
# 2. Handle missing values
df["customer_name"] = df["customer_name"].fillna("Unknown")
df["email"] = df["email"].fillna("missing@placeholder.com")
df["amount"] = df["amount"].fillna(df["amount"].median())
df["region"] = df["region"].fillna(df["region"].mode()[0])
df["order_date"] = df["order_date"].fillna(method="ffill")
# 3. Remove duplicates
before = len(df)
df = df.drop_duplicates(subset=["order_id"], keep="first")
print(f"Removed {before - len(df)} duplicate orders") # Removed 34 duplicate orders
# 4. Standardize categorical values
df["status"] = df["status"].str.strip().str.lower().replace({
"shipped": "shipped", "ship": "shipped",
"cancelled": "cancelled", "canceled": "cancelled",
"pending": "pending", "pend": "pending"
})
df["region"] = df["region"].str.strip().str.title()
# 5. Outlier treatment — cap amounts at IQR bounds
Q1 = df["amount"].quantile(0.25)
Q3 = df["amount"].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
df["amount"] = df["amount"].clip(lower=lower, upper=upper)
print(f"Clean shape: {df.shape}") # (2306, 8)
df.to_csv("clean_orders.csv", index=False)
import great_expectations as gx
context = gx.get_context()
# Define a validation suite
suite = context.add_expectation_suite("orders_validation")
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="amount", min_value=0.01, max_value=50000.00
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeInSet(
column="status", value_set=["pending", "shipped", "delivered", "cancelled"]
)
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToMatchRegex(
column="email", regex=r"^[^@]+@[^@]+\.[^@]+$"
)
)
# Run validation against cleaned data
results = context.run_validation(suite, batch=gx.read_csv("clean_orders.csv"))
print(f"Success: {results.success}")
print(f"Passed: {results.statistics['successful_expectations']}/{results.statistics['evaluated_expectations']}")
# Success: True
# Passed: 5/5
errors="coerce" with pd.to_numeric and pd.to_datetime to surface conversion failures as NaNs rather than crashing._1, _2) before any operations.read_csv raises a UnicodeDecodeError, retry with encoding="latin-1" then encoding="cp1252" and log which encoding succeeded.pd.to_datetime(col, format="mixed") and verify the parsed results with spot checks.$, €, commas, and whitespace before type coercion: df["price"].str.replace(r"[$€,\s]", "", regex=True).astype(float).