| name | ml-causal |
| description | Econometrics skill for machine learning methods in causal inference. Activates when the user asks about:
"causal forest", "generalized random forest", "GRF", "double machine learning", "DML",
"debiased machine learning", "LASSO for variable selection", "post-LASSO",
"heterogeneous treatment effects", "CATE", "conditional average treatment effect",
"BLP analysis", "CLAN analysis", "causal tree", "honest estimation",
"因果森林", "双重机器学习", "异质性处理效应", "条件平均处理效应",
"LASSO变量选择", "机器学习因果推断", "去偏机器学习"
|
Machine Learning for Causal Inference Skill
This skill covers modern ML-based causal inference methods: Causal Forests (GRF) for heterogeneous treatment effects, Double/Debiased Machine Learning (DML) for partially linear models, and LASSO-based variable selection. These methods combine the flexibility of ML with the rigor of econometric identification.
When to Use ML Causal Methods
| Goal | Method |
|---|
| Estimate average treatment effect with many controls | Double ML (DML) |
| Discover treatment effect heterogeneity | Causal Forest (GRF) |
| Variable selection for high-dimensional controls | Post-LASSO |
| Best linear predictor of CATE | BLP analysis |
| Subgroup with largest/smallest effects | CLAN analysis |
Key principle: ML is used for nuisance parameter estimation (predicting Y and D), not for identifying causal effects directly. Identification still requires valid research design (RCT, IV, DID, etc.).
Double/Debiased Machine Learning (DML)
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey & Robins (2018)
Partially Linear Model
Y = θ·D + g(X) + ε (structural equation)
D = m(X) + v (treatment equation)
θ = causal parameter of interest
g(X), m(X) = unknown nuisance functions estimated by ML
DML Procedure
- Cross-fitting: Split sample into K folds (typically K=5)
- Nuisance estimation: On each fold k, use remaining folds to estimate ĝ(X) and m̂(X) using ML
- Residualize: Compute Ỹ = Y − ĝ(X) and D̃ = D − m̂(X)
- Final estimation: Regress Ỹ on D̃ to obtain θ̂
R — DoubleML
library(DoubleML)
library(mlr3)
library(mlr3learners)
dml_data <- DoubleMLData$new(
data = df,
y_col = "outcome",
d_cols = "treatment",
x_cols = c("x1", "x2", "x3", "x4", "x5")
)
ml_g <- lrn("regr.ranger", num.trees = 500)
ml_m <- lrn("classif.ranger", num.trees = 500)
dml_plr <- DoubleMLPLR$new(dml_data, ml_g, ml_m,
n_folds = 5, n_rep = 10)
dml_plr$fit()
print(dml_plr)
dml_irm <- DoubleMLIRM$new(dml_data, ml_g, ml_m,
n_folds = 5, score = "ATE")
dml_irm$fit()
print(dml_irm)
Python — DoubleML
import doubleml as dml
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
dml_data = dml.DoubleMLData(
df, y_col='outcome', d_cols='treatment',
x_cols=['x1', 'x2', 'x3', 'x4', 'x5']
)
ml_g = RandomForestRegressor(n_estimators=500, max_depth=5)
ml_m = RandomForestClassifier(n_estimators=500, max_depth=5)
dml_plr = dml.DoubleMLPLR(dml_data, ml_g, ml_m,
n_folds=5, n_rep=10)
dml_plr.fit()
print(dml_plr.summary)
print(dml_plr.confint())
Stata — ddml
* Stata — ddml (Ahrens et al. 2024)
ssc install ddml
ssc install pystacked
* Partially linear model with cross-fitting
ddml init partial, kfolds(5) reps(10)
ddml E[outcome]: pystacked outcome x1 x2 x3 x4 x5, ///
type(reg) methods(rf gradboost lassocv)
ddml E[treatment]: pystacked treatment x1 x2 x3 x4 x5, ///
type(class) methods(rf gradboost lassocv)
ddml crossfit
ddml estimate, robust
Causal Forest (Generalized Random Forest)
Athey, Tibshirani & Wager (2019)
Estimates Conditional Average Treatment Effects (CATE): τ(x) = E[Y(1) − Y(0) | X = x]
R — grf
library(grf)
X <- as.matrix(df[, c("x1", "x2", "x3", "x4", "x5")])
Y <- df$outcome
W <- df$treatment
cf <- causal_forest(X, Y, W,
num.trees = 4000,
honesty = TRUE,
tune.parameters = "all")
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", ate["estimate"], "SE:", ate["std.err"], "\n")
att <- average_treatment_effect(cf, target.sample = "treated")
cat("ATT:", att["estimate"], "SE:", att["std.err"], "\n")
cate <- predict(cf, estimate.variance = TRUE)
df$cate_hat <- cate$predictions
df$cate_se <- sqrt(cate$variance.estimates)
varimp <- variable_importance(cf)
names(varimp) <- colnames(X)
sort(varimp, decreasing = TRUE)
Python — econml / grf
from econml.dml import CausalForestDML
cf = CausalForestDML(
model_y=RandomForestRegressor(n_estimators=500),
model_t=RandomForestClassifier(n_estimators=500),
n_estimators=4000,
cv=5,
random_state=42
)
cf.fit(Y=df['outcome'].values,
T=df['treatment'].values,
X=df[['x1', 'x2', 'x3', 'x4', 'x5']].values)
ate = cf.ate_inference()
print(f"ATE: {ate.mean_point:.4f} (SE: {ate.stderr_mean:.4f})")
cate = cf.effect(df[['x1', 'x2', 'x3', 'x4', 'x5']].values)
df['cate_hat'] = cate
BLP Analysis (Best Linear Predictor)
Tests whether CATE varies with observables. From Chernozhukov, Demirer, Duflo & Fernandez-Val (2020).
library(grf)
blp <- best_linear_projection(cf, A = X)
print(blp)
CLAN Analysis (Classification Analysis)
Identifies subgroups with highest/lowest treatment effects.
df$cate_quartile <- cut(df$cate_hat,
breaks = quantile(df$cate_hat, c(0, 0.25, 0.5, 0.75, 1)),
labels = c("Q1 (lowest)", "Q2", "Q3", "Q4 (highest)"),
include.lowest = TRUE)
library(fixest)
gates <- feols(outcome ~ i(cate_quartile, treatment),
data = df, vcov = "HC1")
summary(gates)
clan_table <- df %>%
group_by(cate_quartile) %>%
summarise(across(c(x1, x2, x3, age, income), mean))
print(clan_table)
AIPW / Augmented IPW (Doubly Robust Estimator)
Combines outcome regression and propensity score weighting. Doubly robust: consistent if either the outcome model or the propensity score model is correctly specified (but not necessarily both). Particularly natural for binary treatment and binary/continuous outcomes.
Estimator:
τ_AIPW = E[ μ₁(X) − μ₀(X) + D(Y − μ₁(X))/e(X) − (1−D)(Y − μ₀(X))/(1−e(X)) ]
where μ_d(X) = E[Y|D=d, X] and e(X) = P(D=1|X) are estimated by ML.
Difference from DML: AIPW is more natural for binary treatment/outcome; DML is better suited for continuous treatment or partially linear structural models. Both use cross-fitting.
from econml.dr import DRLearner
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
import numpy as np
Y = df['outcome'].values
T = df['treatment'].values
X = df[['x1', 'x2', 'x3']].values
dr_learner = DRLearner(
model_propensity=LogisticRegressionCV(cv=5),
model_regression=RandomForestRegressor(n_estimators=200),
model_final=RandomForestRegressor(n_estimators=200),
cv=5,
random_state=42
)
dr_learner.fit(Y, T, X=X)
ate = dr_learner.ate_inference(X=X)
print(f"AIPW ATE: {ate.mean_point:.4f} (SE: {ate.stderr_mean:.4f})")
print(f"95% CI: {ate.conf_int_mean()}")
cate = dr_learner.effect(X)
library(AIPW)
library(SuperLearner)
aipw_obj <- AIPW$new(
Y = df$outcome,
A = df$treatment,
W = df[, c("x1", "x2", "x3")],
Q.SL.library = c("SL.ranger", "SL.glm"),
g.SL.library = c("SL.ranger", "SL.glm"),
k_split = 5,
verbose = FALSE
)
aipw_obj$fit()
aipw_obj$summary()
Meta-Learners for CATE Estimation
Meta-learners are general frameworks for estimating CATE that wrap any base ML model. They differ in how they use the treatment variable.
| Learner | Approach | Best When |
|---|
| S-Learner | Single model: fit μ(X, D), then CATE = μ(X,1) − μ(X,0) | Simple baseline; may shrink CATE to zero if D is weak signal |
| T-Learner | Two separate models: μ₁(X) and μ₀(X) | Unequal sample sizes; less regularization shrinkage on treatment |
| X-Learner | Imputes counterfactuals, then fits CATE on imputed residuals | Unbalanced treatment (very few treated or controls) |
| R-Learner | Residualizes Y and D, then fits CATE on residuals | High confounding; closely related to DML |
from econml.metalearners import TLearner, SLearner, XLearner
from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import numpy as np
Y = df['outcome'].values
T = df['treatment'].values
X = df[['x1', 'x2', 'x3']].values
t_learner = TLearner(models=RandomForestRegressor(n_estimators=500))
t_learner.fit(Y, T, X=X)
cate_t = t_learner.effect(X)
print(f"T-Learner ATE: {np.mean(cate_t):.4f}")
s_learner = SLearner(overall_model=RandomForestRegressor(n_estimators=500))
s_learner.fit(Y, T, X=X)
cate_s = s_learner.effect(X)
x_learner = XLearner(
models=RandomForestRegressor(n_estimators=500),
propensity_model=RandomForestClassifier(n_estimators=500)
)
x_learner.fit(Y, T, X=X)
cate_x = x_learner.effect(X)
r_learner = LinearDML(
model_y=RandomForestRegressor(n_estimators=500),
model_t=RandomForestClassifier(n_estimators=500),
cv=5, random_state=42
)
r_learner.fit(Y, T, X=X, W=None)
cate_r = r_learner.effect(X)
print(f"S: {np.mean(cate_s):.4f} | T: {np.mean(cate_t):.4f} | "
f"X: {np.mean(cate_x):.4f} | R: {np.mean(cate_r):.4f}")
library(grf)
X_mat <- as.matrix(df[, c("x1", "x2", "x3")])
Y <- df$outcome; W <- df$treatment
rf1 <- regression_forest(X_mat[W==1, ], Y[W==1])
rf0 <- regression_forest(X_mat[W==0, ], Y[W==0])
mu1 <- predict(rf1, X_mat)$predictions
mu0 <- predict(rf0, X_mat)$predictions
D1 <- Y[W==1] - predict(rf0, X_mat[W==1,])$predictions
D0 <- predict(rf1, X_mat[W==0,])$predictions - Y[W==0]
tau1 <- regression_forest(X_mat[W==1,], D1)
tau0 <- regression_forest(X_mat[W==0,], D0)
e_hat <- regression_forest(X_mat, W)$predictions
cate_x <- e_hat * predict(tau0, X_mat)$predictions +
(1 - e_hat) * predict(tau1, X_mat)$predictions
cat("X-Learner ATE:", mean(cate_x), "\n")
LASSO for Variable Selection
Post-LASSO (Belloni, Chernozhukov & Hansen 2014)
Use LASSO to select controls, then run OLS with selected variables.
library(hdm)
pds <- rlassoEffect(x = X, y = Y, d = W, method = "double selection")
summary(pds)
from sklearn.linear_model import LassoCV
import statsmodels.api as sm
lasso_y = LassoCV(cv=5).fit(X, Y)
selected_y = np.where(lasso_y.coef_ != 0)[0]
lasso_d = LassoCV(cv=5).fit(X, W)
selected_d = np.where(lasso_d.coef_ != 0)[0]
selected = np.union1d(selected_y, selected_d)
X_selected = sm.add_constant(np.column_stack([W, X[:, selected]]))
ols_result = sm.OLS(Y, X_selected).fit(cov_type='HC1')
print(f"Post-LASSO ATE: {ols_result.params[1]:.4f} (SE: {ols_result.bse[1]:.4f})")
* Stata — Post-double-selection LASSO
ssc install lassopack
* pdslasso: post-double-selection
pdslasso outcome treatment (x1-x50), robust
Diagnostics and Validation
Calibration Test for Causal Forest
calibration <- test_calibration(cf)
print(calibration)
Cross-Validated Performance
oob_predictions <- predict(cf)$predictions
cor(oob_predictions, df$true_cate)
Reporting Standards
- Method description: State the ML method used for nuisance estimation (RF, LASSO, boosting)
- Cross-fitting: Report number of folds (K) and repetitions
- ATE with CI: Report point estimate, SE, 95% CI
- Heterogeneity evidence: BLP table, GATES plot, variable importance
- Robustness: Compare DML with different ML methods; compare GRF with traditional subgroup analysis
Key sentence template (DML):
"We estimate the treatment effect using Double Machine Learning (Chernozhukov et al. 2018) with random forests for nuisance estimation, 5-fold cross-fitting, and 10 repetitions. The estimated ATE is [β] (SE = [se], 95% CI: [lb, ub])."
Key sentence template (Causal Forest):
"We estimate heterogeneous treatment effects using a causal forest (Athey et al. 2019) with [N] trees and honest splitting. The calibration test confirms significant heterogeneity (p = [p]). Units in the top quartile of predicted CATE have an estimated effect of [β_Q4] compared to [β_Q1] in the bottom quartile."
Common Pitfalls
- Using ML for identification: ML estimates nuisance parameters, not causal effects. You still need exogenous variation (RCT, IV, etc.)
- Overfitting CATE: Always use honest estimation (separate splitting and estimation samples)
- Interpreting variable importance causally: Variable importance in GRF shows predictive power for heterogeneity, not causal mediation
- Ignoring cross-fitting: Without cross-fitting, DML estimates are biased
See references/ml-causal-reference.md for IV-based causal forests, DML with IV, and simulation studies.