| name | statsmodels-statistical-modeling |
| description | Python statistical modeling: regression (OLS, WLS, GLM), discrete (Logit, Poisson, NegBin), time series (ARIMA, SARIMAX, VAR), with rigorous inference, diagnostics, and hypothesis tests. Use scikit-learn for ML; statistical-analysis for test choice. |
| license | BSD-3-Clause |
statsmodels
Overview
Statsmodels provides classical statistical modeling with rigorous inference for Python. It covers linear models, generalized linear models, discrete choice, time series, and comprehensive diagnostics. Unlike scikit-learn (prediction-focused), statsmodels emphasizes coefficient interpretation, p-values, confidence intervals, and model diagnostics.
When to Use
- Fitting linear regression (OLS, WLS, GLS) with detailed coefficient tables and diagnostics
- Running logistic regression with odds ratios and marginal effects for clinical/epidemiological studies
- Analyzing count data with Poisson or negative binomial regression
- Time series forecasting with ARIMA, SARIMAX, or exponential smoothing
- Performing ANOVA, t-tests, or non-parametric tests with proper corrections
- Testing model assumptions (heteroskedasticity, autocorrelation, normality of residuals)
- Model comparison using AIC/BIC or likelihood ratio tests
- Using R-style formula interface (
y ~ x1 + x2 + C(group)) for intuitive model specification
- For prediction-focused ML with cross-validation and hyperparameter tuning, use
scikit-learn instead
- For Bayesian modeling with posterior inference, use
pymc instead
Prerequisites
- Python packages:
statsmodels, numpy, pandas, scipy
- Optional:
matplotlib (for diagnostic plots), patsy (for formula API, included with statsmodels)
- Data: Tabular data as pandas DataFrames or NumPy arrays
pip install statsmodels numpy pandas matplotlib
Quick Start
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(42)
n = 100
df = pd.DataFrame({
"x1": np.random.randn(n),
"x2": np.random.randn(n),
"group": np.random.choice(["A", "B"], n)
})
df["y"] = 2 + 3 * df["x1"] - 1.5 * df["x2"] + np.random.randn(n)
results = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(results.summary())
print(f"R²: {results.rsquared:.3f}, AIC: {results.aic:.1f}")
Core API
Module 1: Linear Regression (OLS, WLS, GLS)
Standard linear models with comprehensive diagnostics.
import statsmodels.api as sm
import numpy as np
np.random.seed(42)
X = np.random.randn(200, 3)
y = 1 + 2*X[:, 0] - 0.5*X[:, 1] + np.random.randn(200)
X_const = sm.add_constant(X)
results = sm.OLS(y, X_const).fit()
print(results.summary())
print(f"\nCoefficients: {results.params}")
print(f"P-values: {results.pvalues}")
print(f"R²: {results.rsquared:.4f}")
pred = results.get_prediction(X_const[:5])
print(pred.summary_frame())
results_robust = sm.OLS(y, X_const).fit(cov_type="HC3")
print("Robust SEs:", results_robust.bse)
weights = 1 / np.abs(results.resid + 0.1)
results_wls = sm.WLS(y, X_const, weights=weights).fit()
print(f"WLS R²: {results_wls.rsquared:.4f}")
Module 2: Generalized Linear Models (GLM)
Extend regression to non-normal outcomes (binary, count, continuous-positive).
import statsmodels.api as sm
import numpy as np
np.random.seed(42)
X = np.random.randn(200, 2)
X_const = sm.add_constant(X)
y_counts = np.random.poisson(np.exp(0.5 + 0.3*X[:, 0]))
model = sm.GLM(y_counts, X_const, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
rate_ratios = np.exp(results.params)
print(f"Rate ratios: {rate_ratios}")
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion ratio: {overdispersion:.2f}")
if overdispersion > 1.5:
print("→ Consider Negative Binomial model")
Module 3: Discrete Choice Models (Logit, Probit, Count)
Binary, multinomial, and count outcome models.
import statsmodels.api as sm
import numpy as np
np.random.seed(42)
X = np.random.randn(300, 2)
X_const = sm.add_constant(X)
prob = 1 / (1 + np.exp(-(0.5 + X[:, 0] - 0.5*X[:, 1])))
y_binary = np.random.binomial(1, prob)
logit_results = sm.Logit(y_binary, X_const).fit()
print(logit_results.summary())
odds_ratios = np.exp(logit_results.params)
print(f"Odds ratios: {odds_ratios}")
margeff = logit_results.get_margeff()
print(margeff.summary())
probs = logit_results.predict(X_const[:5])
print(f"Predicted P(Y=1): {probs}")
Module 4: Time Series (ARIMA, SARIMAX)
Univariate and multivariate time series modeling and forecasting.
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import numpy as np
import pandas as pd
np.random.seed(42)
dates = pd.date_range("2020-01-01", periods=200, freq="D")
y = np.cumsum(np.random.randn(200)) + 50
ts = pd.Series(y, index=dates)
adf_result = adfuller(ts)
print(f"ADF statistic: {adf_result[0]:.4f}, p-value: {adf_result[1]:.4f}")
print("Stationary" if adf_result[1] < 0.05 else "Non-stationary → difference")
model = ARIMA(ts, order=(1, 1, 1))
results = model.fit()
print(results.summary())
forecast = results.get_forecast(steps=30)
forecast_df = forecast.summary_frame()
print(f"30-day forecast:\n{forecast_df.head()}")
from statsmodels.tsa.statespace.sarimax import SARIMAX
model_sarima = SARIMAX(ts, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results_sarima = model_sarima.fit(disp=False)
print(f"AIC: {results_sarima.aic:.1f}")
results_sarima.plot_diagnostics(figsize=(12, 8))
Module 5: Statistical Tests and Diagnostics
Assumption tests, hypothesis tests, and model validation.
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import jarque_bera
import numpy as np
np.random.seed(42)
X = sm.add_constant(np.random.randn(200, 2))
y = 1 + 2*X[:, 1] + np.random.randn(200) * X[:, 1]
results = sm.OLS(y, X).fit()
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_p:.4f} {'→ heteroskedastic' if bp_p < 0.05 else '→ OK'}")
jb_stat, jb_p, _, _ = jarque_bera(results.resid)
print(f"Jarque-Bera p-value: {jb_p:.4f} {'→ non-normal' if jb_p < 0.05 else '→ OK'}")
lb_result = acorr_ljungbox(results.resid, lags=[10], return_df=True)
print(f"Ljung-Box p-value (lag 10): {lb_result['lb_pvalue'].values[0]:.4f}")
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame({
"Variable": [f"x{i}" for i in range(X.shape[1])],
"VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})
print(vif_data)
Module 6: Formula API (R-style)
Intuitive model specification using formulas with automatic dummy coding.
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"y": np.random.randn(100),
"x1": np.random.randn(100),
"x2": np.random.randn(100),
"group": np.random.choice(["A", "B", "C"], 100),
})
res = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(res.summary())
res2 = smf.ols("y ~ x1 * x2", data=df).fit()
print(f"Interaction term p-value: {res2.pvalues['x1:x2']:.4f}")
df["binary"] = (df["y"] > 0).astype(int)
logit_res = smf.logit("binary ~ x1 + x2 + C(group)", data=df).fit()
print(f"Logit AIC: {logit_res.aic:.1f}")
Common Workflows
Workflow 1: Complete Regression Analysis
Goal: Fit OLS, validate assumptions, use robust SEs if needed.
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200), "x2": np.random.randn(200)})
df["y"] = 2 + 3*df["x1"] - df["x2"] + np.random.randn(200)
results = smf.ols("y ~ x1 + x2", data=df).fit()
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p: {bp_p:.4f}")
if bp_p < 0.05:
results = smf.ols("y ~ x1 + x2", data=df).fit(cov_type="HC3")
print("Using HC3 robust standard errors")
X = results.model.exog
for i in range(1, X.shape[1]):
print(f"VIF x{i}: {variance_inflation_factor(X, i):.2f}")
print(results.summary())
print(f"\nAIC: {results.aic:.1f}, BIC: {results.bic:.1f}")
Workflow 2: Model Comparison
Goal: Compare nested and non-nested models using appropriate criteria.
import statsmodels.formula.api as smf
from scipy import stats
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200),
"x2": np.random.randn(200), "x3": np.random.randn(200)})
df["y"] = 1 + 2*df["x1"] - df["x2"] + 0.1*df["x3"] + np.random.randn(200)
m1 = smf.ols("y ~ x1", data=df).fit()
m2 = smf.ols("y ~ x1 + x2", data=df).fit()
m3 = smf.ols("y ~ x1 + x2 + x3", data=df).fit()
comparison = pd.DataFrame({
"R²": [m.rsquared for m in [m1, m2, m3]],
"AIC": [m.aic for m in [m1, m2, m3]],
"BIC": [m.bic for m in [m1, m2, m3]],
}, index=["y~x1", "y~x1+x2", "y~x1+x2+x3"])
print(comparison)
lr_stat = 2 * (m3.llf - m2.llf)
p_val = 1 - stats.chi2.cdf(lr_stat, df=m3.df_model - m2.df_model)
print(f"\nLR test (m3 vs m2): stat={lr_stat:.2f}, p={p_val:.4f}")
Workflow 3: Time Series Forecasting Pipeline
Goal: Test stationarity, identify model order, forecast.
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
ts = pd.Series(np.cumsum(np.random.randn(200)) + 100,
index=pd.date_range("2020-01-01", periods=200, freq="D"))
adf_p = adfuller(ts)[1]
print(f"ADF p-value: {adf_p:.4f} → {'stationary' if adf_p < 0.05 else 'non-stationary'}")
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 6))
plot_acf(ts.diff().dropna(), lags=20, ax=ax1)
plot_pacf(ts.diff().dropna(), lags=20, ax=ax2)
plt.savefig("acf_pacf.png", dpi=150, bbox_inches="tight")
model = ARIMA(ts[:180], order=(1, 1, 1))
results = model.fit()
forecast = results.get_forecast(steps=20)
fc_df = forecast.summary_frame()
print(f"ARIMA AIC: {results.aic:.1f}")
print(f"Forecast (first 5 days):\n{fc_df.head()}")
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|
cov_type | OLS/WLS/GLM | "nonrobust" | "HC0"-"HC3", "HAC", "cluster" | Robust covariance estimator |
family | GLM | required | Poisson(), Binomial(), Gamma(), etc. | Distribution family |
order | ARIMA | required | (p, d, q) tuple | AR order, differencing, MA order |
seasonal_order | SARIMAX | (0,0,0,0) | (P, D, Q, s) tuple | Seasonal ARIMA parameters |
alpha | summary(), conf_int() | 0.05 | 0.01-0.10 | Significance level for CIs |
maxiter | All .fit() | 35-100 | 50-1000 | Max optimization iterations |
method | .fit() | model-dependent | "newton", "bfgs", "lbfgs", "powell" | Optimization algorithm |
lags | ACF/PACF | None | 10-50 | Number of lags to display |
Best Practices
-
Always add a constant for OLS/GLM: sm.add_constant(X) or use formula API (adds intercept automatically)
-
Match model to outcome type: Binary → Logit/Probit, Counts → Poisson/NegBin, Continuous → OLS/WLS, Time series → ARIMA
-
Check diagnostics before interpreting: Run Breusch-Pagan (heteroskedasticity), Jarque-Bera (normality), Ljung-Box (autocorrelation) on residuals
-
Use robust SEs when assumptions fail: results = model.fit(cov_type="HC3") for heteroskedasticity-robust inference
-
Report effect sizes, not just p-values: Include coefficients, confidence intervals, and R² alongside significance tests
-
Prefer formula API for exploratory work: smf.ols("y ~ x1 * x2 + C(group)", data=df) is more readable and handles categoricals automatically
-
Test stationarity before time series modeling: Use ADF test; difference if non-stationary
Common Recipes
Recipe: ANOVA with Post-hoc Tests
When to use: Comparing means across 3+ groups.
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"value": np.concatenate([np.random.normal(m, 1, 30) for m in [5, 6, 7]]),
"group": np.repeat(["A", "B", "C"], 30)})
anova = smf.ols("value ~ C(group)", data=df).fit()
print(sm.stats.anova_lm(anova))
tukey = pairwise_tukeyhsd(df["value"], df["group"], alpha=0.05)
print(tukey)
Recipe: Power Analysis for Sample Size
When to use: Determining required sample size before a study.
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.8)
print(f"Required n per group: {n:.0f}")
power = analysis.solve_power(effect_size=0.5, alpha=0.05, nobs1=50)
print(f"Power with n=50: {power:.3f}")
Recipe: Mixed Effects Model
When to use: Hierarchical/clustered data (patients within hospitals, students within schools).
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"y": np.random.randn(100), "x": np.random.randn(100),
"group": np.repeat(range(10), 10)
})
model = smf.mixedlm("y ~ x", data=df, groups=df["group"])
results = model.fit()
print(results.summary())
Troubleshooting
| Problem | Cause | Solution |
|---|
MissingDataError | NaN values in data | Drop NAs: df.dropna() or impute before fitting |
| No intercept in results | Forgot sm.add_constant() | Always add constant, or use smf.ols() formula API |
ConvergenceWarning | Optimization failed | Increase maxiter, try different method, or scale variables |
| Overdispersion in Poisson | Variance > mean | Switch to NegativeBinomial or use GLM(family=NegativeBinomial()) |
| Non-stationary time series | Trend or unit root | Difference the series (ts.diff()) or increase d in ARIMA |
| Singular matrix error | Perfect multicollinearity | Remove redundant variables; check VIF > 10 |
| Different results from R | Default settings differ | Check: constant term, link function, optimizer, SE type |
PerfectSeparationError in Logit | Predictor perfectly separates classes | Use regularized logistic (penalized MLE) or Firth's method |
References