| name | python-regression-statistics |
| description | Expert guidance for regression analysis, statistical modeling, and outlier detection in Python using statsmodels, scikit-learn, scipy, and PyOD - includes model diagnostics, assumption checking, robust methods, and comprehensive outlier detection strategies |
| allowed-tools | ["*"] |
Python Regression & Statistical Analysis
Expert assistant for regression analysis, statistical modeling, and outlier detection in Python. This skill covers statsmodels for statistical regression, scikit-learn for machine learning approaches, scipy for statistical methods, and PyOD for comprehensive outlier detection.
When to Use This Skill
- Fitting linear and nonlinear regression models with proper statistical inference
- Checking and validating regression assumptions (LINE: Linearity, Independence, Normality, Equal variance)
- Performing model diagnostics: residual analysis, influential points, multicollinearity
- Detecting and handling outliers using statistical, proximity-based, and ensemble methods
- Comparing statistical vs machine learning regression approaches
- Building robust regression pipelines with proper validation
- Time series modeling and forecasting
- Generalized linear models (GLM) for non-normal response variables
Philosophy: Statistical vs Machine Learning Approaches
Statistical Regression (statsmodels, scipy.stats)
Use when:
- You need p-values, confidence intervals, and hypothesis testing
- Understanding which variables are significant is important
- You want to interpret coefficients causally
- Sample size is moderate and you need inference guarantees
- You need diagnostic tests for assumptions
Advantages:
- Rich statistical inference (t-tests, F-tests, likelihood ratios)
- Comprehensive diagnostics and assumption tests
- Formula interface for easy model specification
- Standard errors, confidence intervals, prediction intervals
- Well-understood theoretical properties
Limitations:
- Less flexible for complex nonlinear relationships
- Requires careful assumption checking
- May not handle high-dimensional data well
- Less automated feature selection
Machine Learning Regression (scikit-learn)
Use when:
- Prediction accuracy is the primary goal
- You have many features and complex interactions
- Relationships are highly nonlinear
- You don't need statistical inference (p-values)
- You have sufficient data for train/test splits
Advantages:
- Handles complex nonlinear patterns
- Built-in regularization (Ridge, Lasso, ElasticNet)
- Excellent for high-dimensional data
- Automated cross-validation and hyperparameter tuning
- Ensemble methods for improved accuracy
Limitations:
- No automatic p-values or statistical inference
- Less interpretable coefficients
- Requires more data for complex models
- Black-box nature for some algorithms
Recommended Workflow
Statistical Regression with statsmodels
Ordinary Least Squares (OLS)
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
X = sm.add_constant(X_raw)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'category': cat})
model = smf.ols('y ~ x1 + x2 + C(category)', data=df)
results = model.fit()
results.params
results.pvalues
results.conf_int()
results.rsquared
results.rsquared_adj
results.aic, results.bic
results.resid
results.fittedvalues
Weighted Least Squares (WLS)
weights = 1 / variance_array
model = sm.WLS(y, X, weights=weights)
results = model.fit()
ols_results = sm.OLS(y, X).fit()
resid_squared = ols_results.resid**2
variance_model = sm.OLS(np.log(resid_squared), X).fit()
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()
Robust Regression (RLM)
import statsmodels.robust.robust_linear_model as rlm
model = rlm.RLM(y, X, M=sm.robust.norms.HuberT())
results = model.fit()
model = rlm.RLM(y, X, M=sm.robust.norms.TukeyBiweight())
results = model.fit()
weights = results.weights
Generalized Linear Models (GLM)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
Time Series Models
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR
model = ARIMA(y, order=(p, d, q))
results = model.fit()
model = SARIMAX(y, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()
forecast = results.forecast(steps=10)
forecast_with_ci = results.get_forecast(steps=10)
forecast_df = forecast_with_ci.summary_frame()
model = VAR(df[['y1', 'y2', 'y3']])
results = model.fit(maxlags=5)
results.forecast(df.values[-results.k_ar:], steps=10)
Model Diagnostics - statsmodels
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
qqplot(results.resid, line='s')
plt.scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
plt.xlabel('Fitted values')
plt.ylabel('√|Standardized Residuals|')
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
fig, ax = plt.subplots()
ax.scatter(influence.hat_matrix_diag, results.resid_pearson)
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import durbin_watson
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {lm_pval}")
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw}")
lb_test = acorr_ljungbox(results.resid, lags=10)
from scipy.stats import shapiro
stat, pval = shapiro(results.resid)
print(f"Shapiro-Wilk p-value: {pval}")
Influential Points and Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
influence = OLSInfluence(results)
cooks_d = influence.cooks_distance[0]
dffits = influence.dffits[0]
leverage = influence.hat_matrix_diag
student_resid = influence.resid_studentized_internal
influence_summary = influence.summary_frame()
print(influence_summary)
Machine Learning Regression with scikit-learn
Linear Models with Regularization
from sklearn.linear_model import (
LinearRegression, Ridge, Lasso, ElasticNet,
Lars, LassoLars, BayesianRidge, HuberRegressor,
RANSACRegressor, TheilSenRegressor
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.coef_, model.intercept_
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
selected_features = X.columns[lasso.coef_ != 0]
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)
ransac = RANSACRegressor(min_samples=0.5, residual_threshold=2.0)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_
theil = TheilSenRegressor()
theil.fit(X_train, y_train)
Cross-Validation and Hyperparameter Tuning
from sklearn.model_selection import (
cross_val_score, cross_validate,
GridSearchCV, RandomizedSearchCV,
KFold, RepeatedKFold
)
from sklearn.metrics import (
mean_squared_error, mean_absolute_error,
r2_score, mean_absolute_percentage_error
)
scores = cross_val_score(model, X, y, cv=5,
scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")
cv_results = cross_validate(
model, X, y, cv=5,
scoring=['neg_mean_squared_error', 'r2', 'neg_mean_absolute_error'],
return_train_score=True
)
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid_search = GridSearchCV(
ElasticNet(),
param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {np.sqrt(-grid_search.best_score_):.3f}")
best_model = grid_search.best_estimator_
Tree-Based and Ensemble Methods
from sklearn.ensemble import (
RandomForestRegressor, GradientBoostingRegressor,
AdaBoostRegressor, BaggingRegressor, ExtraTreesRegressor
)
from sklearn.tree import DecisionTreeRegressor
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
gb = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
random_state=42
)
gb.fit(X_train, y_train)
try:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
xgb_model.fit(X_train, y_train)
except ImportError:
print("XGBoost not installed. Use: pip install xgboost")
Other ML Regression Methods
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)
kr = KernelRidge(alpha=1.0, kernel='rbf')
kr.fit(X_train, y_train)
knn = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn.fit(X_train, y_train)
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)
gp = GaussianProcessRegressor(kernel=kernel, random_state=42)
gp.fit(X_train, y_train)
y_pred, sigma = gp.predict(X_test, return_std=True)
Building Complete Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
log_model = TransformedTargetRegressor(
regressor=Ridge(alpha=1.0),
func=np.log1p,
inverse_func=np.expm1
)
log_model.fit(X_train, y_train)
Outlier Detection
Statistical Methods (scipy, statsmodels)
import numpy as np
from scipy import stats
def detect_outliers_zscore(data, threshold=3):
"""
Outliers are points with |z-score| > threshold
Works well for normally distributed data
"""
z_scores = np.abs(stats.zscore(data))
return z_scores > threshold
def detect_outliers_mad(data, threshold=3.5):
"""
More robust than Z-score, uses median instead of mean
Recommended threshold: 3.5
"""
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z_scores = 0.6745 * (data - median) / mad
return np.abs(modified_z_scores) > threshold
def detect_outliers_iqr(data, factor=1.5):
"""
Outliers are below Q1 - factor*IQR or above Q3 + factor*IQR
factor=1.5: outliers, factor=3.0: extreme outliers
"""
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - factor * iqr
upper_bound = q3 + factor * iqr
return (data < lower_bound) | (data > upper_bound)
def grubbs_test(data, alpha=0.05):
"""
Tests if maximum or minimum is an outlier
Assumes data is normally distributed
"""
from scipy.stats import t
n = len(data)
mean = np.mean(data)
std = np.std(data, ddof=1)
G_max = np.max(np.abs(data - mean)) / std
t_crit = t.ppf(1 - alpha / (2 * n), n - 2)
G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
return G_max > G_crit
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])
outliers_z = detect_outliers_zscore(data)
outliers_mad = detect_outliers_mad(data)
outliers_iqr = detect_outliers_iqr(data)
PyOD - Comprehensive Outlier Detection
PyOD (Python Outlier Detection) provides 40+ algorithms. Install with: pip install pyod
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA as PCA_OD
from pyod.models.mcd import MCD
from pyod.models.abod import ABOD
from pyod.models.copod import COPOD
from pyod.models.ecod import ECOD
from pyod.models.combination import average, maximization
detector = KNN(contamination=0.1)
detector.fit(X)
outlier_labels = detector.labels_
outlier_scores = detector.decision_scores_
new_labels = detector.predict(X_new)
new_scores = detector.decision_function(X_new)
Proximity-Based Detectors
knn = KNN(contamination=0.1, n_neighbors=5, method='mean')
knn.fit(X)
lof = LOF(contamination=0.1, n_neighbors=20)
lof.fit(X)
from pyod.models.cof import COF
cof = COF(contamination=0.1, n_neighbors=20)
cof.fit(X)
Probabilistic Detectors
copod = COPOD(contamination=0.1)
copod.fit(X)
ecod = ECOD(contamination=0.1)
ecod.fit(X)
abod = ABOD(contamination=0.1)
abod.fit(X)
Linear Model Detectors
pca = PCA_OD(contamination=0.1, n_components=None)
pca.fit(X)
mcd = MCD(contamination=0.1)
mcd.fit(X)
ocsvm = OCSVM(contamination=0.1, kernel='rbf', nu=0.1)
ocsvm.fit(X)
Isolation-Based Detectors
iforest = IForest(
contamination=0.1,
n_estimators=100,
max_features=1.0,
random_state=42
)
iforest.fit(X)
Ensemble Methods
from pyod.models.lscp import LSCP
from pyod.models.feature_bagging import FeatureBagging
fb = FeatureBagging(
base_estimator=LOF(contamination=0.1),
n_estimators=10,
contamination=0.1,
random_state=42
)
fb.fit(X)
detectors = [KNN(), LOF(), IForest(), COPOD()]
scores_list = []
for detector in detectors:
detector.fit(X)
scores_list.append(detector.decision_scores_)
avg_scores = average(scores_list)
max_scores = maximization(scores_list)
scikit-learn Outlier Detection
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
iso_forest = IsolationForest(
contamination=0.1,
random_state=42,
n_estimators=100
)
outlier_labels = iso_forest.fit_predict(X)
outlier_scores = iso_forest.score_samples(X)
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.1,
novelty=True
)
lof.fit(X_train)
outlier_labels = lof.predict(X_test)
outlier_scores = lof.score_samples(X_test)
elliptic = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels = elliptic.fit_predict(X)
oc_svm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.1)
outlier_labels = oc_svm.fit_predict(X)
Choosing the Right Outlier Detection Method
Use IQR or Z-score when:
- Univariate data (single variable)
- Quick screening needed
- Interpretability is important
Use Modified Z-score (MAD) when:
- Data may already contain outliers (MAD is robust)
- Distribution is roughly symmetric
Use Isolation Forest when:
- High-dimensional data
- No assumptions about data distribution
- Fast computation needed
- Good all-purpose choice
Use LOF when:
- Varying density clusters in data
- Local structure is important
- Willing to tune n_neighbors parameter
Use COPOD/ECOD when:
- Very fast detection needed
- High-dimensional tabular data
- Don't want to tune parameters
Use PCA-based when:
- Data lies in lower-dimensional manifold
- Want to understand which features contribute to outlierness
Use MCD when:
- Multivariate Gaussian assumption reasonable
- Robust covariance estimation needed
Use ensemble methods when:
- Maximum robustness needed
- Can afford computational cost
- Different detector types disagree
Complete Workflow Examples
Example 1: Statistical Regression with Full Diagnostics
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
print(df.describe())
df.hist(bins=30, figsize=(15, 10))
pd.plotting.scatter_matrix(df, figsize=(15, 15))
model = smf.ols('price ~ area + bedrooms + age + location', data=df)
results = model.fit()
print(results.summary())
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].scatter(results.fittedvalues, results.resid)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')
influence = OLSInfluence(results)
axes[1, 1].scatter(influence.hat_matrix_diag, results.resid_pearson)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
plt.tight_layout()
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"\nBreusch-Pagan test p-value: {lm_pval:.4f}")
dw = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw:.4f}")
X = df[['area', 'bedrooms', 'age']]
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\n", vif_data)
influence_df = influence.summary_frame()
print("\nInfluential points (Cook's D > 4/n):")
n = len(df)
influential = influence_df[influence_df['cooks_d'] > 4/n]
print(influential[['cooks_d', 'student_resid', 'hat_diag']])
if lm_pval < 0.05:
print("\nFitting robust regression (RLM)...")
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(df['price'], X, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()
print(robust_results.summary())
Example 2: ML Regression with Cross-Validation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
models = {
'Ridge': Ridge(),
'Lasso': Lasso(),
'ElasticNet': ElasticNet(),
'RandomForest': RandomForestRegressor(random_state=42),
'GradientBoosting': GradientBoostingRegressor(random_state=42)
}
results = {}
for name, model in models.items():
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])
scores = cross_val_score(
pipeline, X_train, y_train,
cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)
results[name] = {
'mean_rmse': rmse_scores.mean(),
'std_rmse': rmse_scores.std()
}
print(f"{name}: RMSE = {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")
param_grid = {
'model__n_estimators': [50, 100, 200],
'model__max_depth': [5, 10, 15, None],
'model__min_samples_split': [2, 5, 10]
}
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(random_state=42))
])
grid_search = GridSearchCV(
pipeline, param_grid, cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.3f}")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"\nTest set performance:")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
if hasattr(best_model.named_steps['model'], 'feature_importances_'):
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': best_model.named_steps['model'].feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 features:")
print(importance_df.head(10))
Example 3: Outlier Detection Pipeline
import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.copod import COPOD
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
detectors = {
'Isolation Forest': IForest(contamination=0.1, random_state=42),
'LOF': LOF(contamination=0.1, n_neighbors=20),
'COPOD': COPOD(contamination=0.1)
}
scores_df = pd.DataFrame(index=df.index)
labels_df = pd.DataFrame(index=df.index)
for name, detector in detectors.items():
detector.fit(X_scaled)
scores_df[name] = detector.decision_scores_
labels_df[name] = detector.labels_
consensus_outliers = (labels_df.sum(axis=1) >= 2)
print(f"Outliers detected by consensus: {consensus_outliers.sum()}")
outlier_indices = df[consensus_outliers].index
print("\nOutlier summary statistics:")
print(df.loc[outlier_indices].describe())
print("\nNormal data summary statistics:")
print(df.loc[~consensus_outliers].describe())
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (name, scores) in enumerate(scores_df.items()):
axes[i].hist(scores, bins=50, alpha=0.7)
threshold = np.percentile(scores, 90)
axes[i].axvline(threshold, color='r', linestyle='--', label='Threshold')
axes[i].set_title(f'{name} Scores')
axes[i].set_xlabel('Outlier Score')
axes[i].set_ylabel('Frequency')
axes[i].legend()
plt.tight_layout()
df_clean = df[~consensus_outliers].copy()
df['is_outlier'] = consensus_outliers
Example 4: Combined Regression + Outlier Detection
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyod.models.iforest import IForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
X = df[['x1', 'x2', 'x3']].values
y = df['y'].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
detector = IForest(contamination=0.05, random_state=42)
detector.fit(X_scaled)
outliers_X = detector.labels_ == 1
print(f"Outliers in predictor space: {outliers_X.sum()}")
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const)
results = model.fit()
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
studentized_resid = influence.resid_studentized_internal
outliers_y = np.abs(studentized_resid) > 3
print(f"Outliers in response space: {outliers_y.sum()}")
cooks_d = influence.cooks_distance[0]
n, p = X_with_const.shape
influential = cooks_d > 4/n
print(f"Influential points: {influential.sum()}")
df['outlier_X'] = outliers_X
df['outlier_y'] = outliers_y
df['influential'] = influential
df['any_flag'] = outliers_X | outliers_y | influential
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
colors = ['red' if flag else 'blue' for flag in df['any_flag']]
axes[0].scatter(results.fittedvalues, results.resid, c=colors, alpha=0.6)
axes[0].axhline(y=0, color='black', linestyle='--')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (red = flagged)')
axes[1].stem(range(len(cooks_d)), cooks_d, markerfmt=',')
axes[1].axhline(y=4/n, color='r', linestyle='--', label='Threshold (4/n)')
axes[1].set_xlabel('Observation index')
axes[1].set_ylabel("Cook's distance")
axes[1].set_title("Cook's Distance")
axes[1].legend()
plt.tight_layout()
print("\n=== Original Model ===")
print(results.summary())
df_clean = df[~df['any_flag']].copy()
X_clean = df_clean[['x1', 'x2', 'x3']].values
y_clean = df_clean['y'].values
X_clean_const = sm.add_constant(X_clean)
model_clean = sm.OLS(y_clean, X_clean_const)
results_clean = model_clean.fit()
print("\n=== Model After Removing Flagged Points ===")
print(results_clean.summary())
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(y, X_with_const, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()
print("\n=== Robust Regression (All Data) ===")
print(robust_results.summary())
Common Pitfalls and Solutions
1. P-hacking and Multiple Testing
Problem: Testing many models/features and reporting only significant ones.
Solution:
- Pre-specify your model based on theory/prior knowledge
- Use adjusted p-values (Bonferroni, Benjamini-Hochberg) for multiple comparisons
- Focus on effect sizes and confidence intervals, not just p-values
- Use train/test split or cross-validation for model selection
from statsmodels.stats.multitest import multipletests
p_values = [0.04, 0.03, 0.001, 0.15, 0.08]
reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')
2. Overfitting with Many Features
Problem: Model fits training data perfectly but generalizes poorly.
Solution:
- Use regularization (Ridge, Lasso, ElasticNet)
- Cross-validation to estimate generalization performance
- Feature selection based on domain knowledge
- Simpler models when sample size is small
n_samples, n_features = X.shape
if n_samples < 10 * n_features:
print(f"Warning: Only {n_samples/n_features:.1f} samples per feature")
print("Consider: (1) regularization, (2) feature selection, (3) collect more data")
3. Ignoring Assumptions
Problem: Using OLS when assumptions are violated leads to invalid inference.
Solution:
- Always check diagnostics (residual plots, Q-Q plots, tests)
- Transform variables if needed (log, Box-Cox, square root)
- Use robust methods when outliers present
- Use WLS when heteroskedasticity detected
- Use GLM when response distribution is non-normal
from scipy.stats import boxcox
y_transformed, lambda_param = boxcox(y)
print(f"Optimal lambda: {lambda_param}")
4. Removing Outliers Without Justification
Problem: Arbitrarily removing points to improve fit.
Solution:
- Investigate WHY points are outliers (data errors vs real extremes)
- Document all data cleaning decisions
- Try robust methods before removing
- Compare results with/without outliers
- Never remove points just because they don't fit the model
def investigate_outliers(df, outlier_mask):
"""
Print detailed information about detected outliers
"""
outliers = df[outlier_mask]
normal = df[~outlier_mask]
print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.1f}%)")
print("\nOutlier characteristics:")
print(outliers.describe())
print("\nNormal data characteristics:")
print(normal.describe())
print("\nPotential data errors:")
for col in df.columns:
if df[col].dtype in [np.float64, np.int64]:
min_val, max_val = df[col].min(), df[col].max()
suspicious = (outliers[col] < min_val * 0.1) | (outliers[col] > max_val * 0.9)
if suspicious.any():
print(f" {col}: {suspicious.sum()} suspicious values")
5. Confusing Outliers vs Influential Points
Problem: Not all outliers are influential, and not all influential points are outliers.
Concepts:
- Outlier in X-space: Unusual predictor values (high leverage)
- Outlier in Y-space: Unusual response value (large residual)
- Influential point: Removing it significantly changes model (high Cook's D)
Solution:
- Check multiple diagnostics: leverage, residuals, Cook's D
- High leverage + small residual = not very influential
- Low leverage + large residual = not very influential
- High leverage + large residual = very influential
def categorize_points(influence, results, n, p):
"""
Categorize observations based on influence measures
"""
leverage = influence.hat_matrix_diag
studentized_resid = influence.resid_studentized_internal
cooks_d = influence.cooks_distance[0]
high_leverage = leverage > 2 * p / n
outlier_y = np.abs(studentized_resid) > 3
influential = cooks_d > 4 / n
categories = []
for i in range(n):
if influential[i]:
categories.append('Influential')
elif high_leverage[i] and outlier_y[i]:
categories.append('High Leverage Outlier')
elif high_leverage[i]:
categories.append('High Leverage')
elif outlier_y[i]:
categories.append('Outlier')
else:
categories.append('Normal')
return pd.Series(categories, name='Category')
6. Not Checking Prediction Intervals
Problem: Reporting point predictions without uncertainty.
Solution:
- Use prediction intervals (statistical models) or quantile regression
- Report confidence intervals for coefficients
- Visualize uncertainty in predictions
predictions = results.get_prediction(X_new)
pred_df = predictions.summary_frame(alpha=0.05)
import matplotlib.pyplot as plt
plt.scatter(X_test, y_test, label='Actual')
plt.plot(X_test, pred_df['mean'], 'r-', label='Prediction')
plt.fill_between(X_test, pred_df['obs_ci_lower'], pred_df['obs_ci_upper'],
alpha=0.2, label='95% Prediction Interval')
plt.legend()
Visualization Best Practices
Diagnostic Plots
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot
def plot_diagnostics(results, figsize=(12, 10)):
"""
Create comprehensive diagnostic plots for regression
"""
fig, axes = plt.subplots(2, 2, figsize=figsize)
axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed = lowess(results.resid, results.fittedvalues, frac=0.3)
axes[0, 0].plot(smoothed[:, 0], smoothed[:, 1], 'g-', lw=2)
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
resid_std = np.sqrt(np.abs(results.resid_pearson))
axes[1, 0].scatter(results.fittedvalues, resid_std, alpha=0.6)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')
smoothed = lowess(resid_std, results.fittedvalues, frac=0.3)
axes[1, 0].plot(smoothed[:, 0], smoothed[:, 1], 'r-', lw=2)
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]
axes[1, 1].scatter(leverage, results.resid_pearson, alpha=0.6)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
n = len(results.resid)
high_cooks = cooks_d > 4/n
axes[1, 1].scatter(leverage[high_cooks], results.resid_pearson[high_cooks],
color='red', s=100, alpha=0.8, label="High Cook's D")
axes[1, 1].legend()
plt.tight_layout()
return fig
Outlier Score Distributions
def plot_outlier_scores(detectors_dict, X, figsize=(15, 5)):
"""
Plot outlier score distributions for multiple detectors
"""
n_detectors = len(detectors_dict)
fig, axes = plt.subplots(1, n_detectors, figsize=figsize)
if n_detectors == 1:
axes = [axes]
for ax, (name, detector) in zip(axes, detectors_dict.items()):
detector.fit(X)
scores = detector.decision_scores_
labels = detector.labels_
ax.hist(scores[labels == 0], bins=50, alpha=0.6, label='Inliers')
ax.hist(scores[labels == 1], bins=50, alpha=0.6, label='Outliers')
ax.set_xlabel('Outlier Score')
ax.set_ylabel('Frequency')
ax.set_title(name)
ax.legend()
plt.tight_layout()
return fig
Quick Reference
When to Use Which Method
| Goal | Method | Library |
|---|
| Simple linear regression with inference | OLS | statsmodels |
| Regression with correlated predictors | Ridge | sklearn |
| Feature selection in regression | Lasso | sklearn |
| Robust to outliers | RLM, HuberRegressor, RANSACRegressor | statsmodels, sklearn |
| Count data | GLM(Poisson), GLM(NegativeBinomial) | statsmodels |
| Time series forecasting | ARIMA, SARIMAX | statsmodels |
| High-dimensional data | Lasso, ElasticNet, Ridge | sklearn |
| Complex nonlinear patterns | RandomForest, GradientBoosting | sklearn |
| Fast outlier detection | COPOD, ECOD, IsolationForest | PyOD, sklearn |
| Robust outlier detection | LOF, Ensemble methods | PyOD, sklearn |
| Univariate outliers | IQR, MAD, Z-score | scipy |
Key Diagnostics Checklist
Installation Commands
pip install statsmodels scipy
pip install scikit-learn
pip install pyod
pip install xgboost lightgbm
Additional Resources
Documentation
Key Statsmodels Pages
Key Sklearn Pages