Collaborators: Victor Chernozhukov (MIT), Malte Kurz (TUM)
Present a simple, yet general framework for learning and bounding causal effects, that utilizes machine learning (aka adaptive statistical learning methods)
List of examples all for general, nonseparable models:
(weighted) average potential outcomes; e.g. policy values
average treatment effects, including subgroup effects such as for the treated,
(weighed) average derivatives
average effects from transporting covariates
distributional changes in covariates
Many other examples fall in or extend this framework (mediators, surrogates, dynamic effects)
Using machine learning is great, because we can learn regression functions very well
However, since ML has to "shrink, chop, and throw out" variables to perform prediction well in high-dimensional settings, the learners are biased. These biases transmit into estimation of main causal effects
The set-up uses potential outcomes framework (Imbens and Rubin, 2015). Let Y(d) denote the potential outcome in policy state d. The chosen policy D is assumed to be independent of potential outcomes conditional on controls X and A: Y(d)⊥D∣X,A. The observed outcome Y is generated via Y:=Y(D).
Under the conditional exogeneity condition, E[Y(d)∣X,A]=E[Y∣D=d,X,A]=:g(d,X,A), that is the conditional average potential outcome coincides with the regression function
The key examples of causal parameters include the average causal effect (ACE):
θ=E[Y(1)−Y(0)]=E[g(1,X,A)−g(0,X,A)]
for the case of the binary d, and the average causal derivative (ACD), for the case of continuous d:
θ=E[∂dY(d)∣d=D]=E[∂dg(D,X,A)]
Average Incremental Effect (AIE): θ=E[Y(D+1)−Y(D)]=E[g(D+1,X,A)]−EY.
Average Policy Effect (APEC) from covariate shift: θ=∫E[Y(d)∣X=x]d(F1(x)−dF0(x)). =∫E[g(d,x,A)]d(F1(x)−dF0(x))
See others in Chernozhukov et al. (2018b), Chernozhukov et al. (2020), Chernozhukov et al. (2021a), Chernozhukov et al. (2022), Singh (2021)
Let W:=(D,X,A) be all observed variables.
Assumption Target Parameter The target parameter can be expressed as a continuous linear functional of long regression: θ:=Em(W;g), where g↦Em(W;g) is continuous in g with respect to the L2(P) norm
In working examples above
m(W,g(W))=g(1,X,A)−g(0,X,A) for ACE and
m(W,g(W))=∂dg(D,X,A) for ACD.
Weak overlap conditions are required to make the continuity hold
Lemma: Riesz Representation (Cherozhukov et al. 2021a, 2021b)
There exist unique square integrable random variables α(W) such that
Em(W,g)=Eg(W)α(W), for all square-integrable g
(Frisch-Waugh) For partially linear models, g(W)=θD+f(X,A), then for either ACE or ACD we have that α(W)=D−E[D∣X,A]E(D−E[D∣X,A])2,
(Horwitz-Thomposon). In the case of ACE, α(W)=1(D=1)P(D=1∣X,A)−1(D=0)P(D=0∣X,A).
(Powell-Stock-Stocker) For the case of ACD, α(W)=−∂dlogf(D∣X,A)
Lemma: Auto Characterization for Representers (Chernozhukov et al. 2021a, 2021b)
α=argmin
Three potential representations for the target parameter, \theta: \theta = E m(W,g) = E Y \alpha = E g \alpha. the "regression matching", "propensity score", and "mixed" approaches respectively
Which one should we use?
In parametric models, wide path, because can use parametric learning of g or \alpha and use expression above
In low-dimensional nonparametric models, still a wide path using flexible parametric approximations (series and sieve methods)
What about modern high-dimensional nonparametric problems, when we are forced to use machine learning to learn g or learn \alpha?
Narrow the Path: Use all three of the learning approaches: \theta = E m(W,g) + E Y \alpha - E g \alpha.
(Intuitively, each part corrects the bias in the other.)
Narrow the Path Even More: Use Cross-Fitting to Eliminate Overfitting Biases (Entropy Bias)
Debiased machine learning is a generic recipe that isolates the narrow path. It is a
method-of-moments estimator
that utilizes any debiased/orthogonal moment scores
together with cross-fitting
automatic learning of representers aids the construction
Delivers standard approximately normal inference on main parameters
Applies more broadly than the setting discussed here, for example for economic models identified through conditional method of moments (Chamberlain, 1987), albeit much more work is needed in this area.
For debiased machine learning we use representations: \theta = E [m(W, g) + (Y- g) \alpha], We have that
\theta - E [m(W, \bar g) + (Y- \bar g) \bar \alpha] = - E (\bar g- g) (\bar \alpha - \alpha). Therefore, this representation has the Neyman orthogonality property:
\partial_{\bar g, \bar \alpha} E [m(W, \bar g) + (Y- \bar g) \bar \alpha] \Big |_{\bar \alpha = \alpha, \bar g = g} =0,
where \partial is the Gateaux (pathwise derivative) operator.
Therefore the estimators are defined as \widehat \theta := DML (\psi_{\theta}); for the score:
\psi_{\theta} (Z; \theta; \alpha, g) := \theta - m(W, g) + (Y- g) \alpha(W); \
Generic DML is a method-of-moments estimator that utilizes any Neyman orthogonal score, together with cross-fitting.
R notebook (http://www.kaggle.com/r4hu15in9h/auto-dml)
\min \sum_{i \not \in I_l} (Y_i - g(W_i))^2 + \mathrm{pen} (g):
g(W_i)= b(W_i)'\gamma; \ \ \mathrm{pen} (g) = \lambda_g \sum_j | \gamma_j|,
where b(W_i) is dictionary of transformations of W_i, for example polynomials and interactions, and \lambda_g is penalty level.
\min \sum_{i \in I^c_l} a^2(W_i) - 2 m(W_i, a) + \mathrm{pen} (a):
a(W_i)= b(W_i)'\rho; \ \ \mathrm{pen} (a) = \lambda_a \sum_j | \rho_j|, where \lambda_a is penalty level.
We say that an estimator \hat \beta of \beta is asymptotically linear and Gaussian with the centered influence function \psi^o(Z) if
\sqrt{n} (\hat \beta - \beta) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \psi^o(Z_i) + o_{P} (1) \leadsto N(0, E \psi^2_0(Z)).
The application of the results in Chernozhukov et al. (2018a) for linear score functions yields the following result.
Theorem: DML for CEs
Suppose that we can learn g and \alpha sufficiently well, at o_P(n^{-1/4}) rates in L^2(P) norm. Then the DML estimator \hat \theta is asymptotically linear and Gaussian with influence functions:
\psi^o_\theta(Z) := \psi_{\theta} (Z; \theta, g, \alpha), evaluated at the true parameter values. Efficiency follows from Newey (1994). The covariance of the scores can be estimated by the empirical analogues using the covariance of the estimated scores.
Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
Victor Chernozhukov et al. ‘Automatic Debiased Machine Learning via Neural Nets for Generalized Linear Regression’. In: arXiv preprint arXiv:2104.14737 (2021a).
Victor Chernozhukov et al. ‘Adversarial estimation of riesz representers’. In: arXiv preprint arXiv:2101.00009 (2020).
Rahul Singh. ‘A Finite Sample Theorem for Longitudinal Causal Inference with Machine Learning: Long Term, Dynamic, and Mediated Effects’. In: arXiv preprint arXiv:2112.14249 (2021).
Victor Chernozhukov, Whitney Newey, and Rahul Singh. ‘De-biased machine learning of global and local parameters using regularized Riesz representers’. In: arXiv preprint arXiv:1802.08667 (2018b).
Victor Chernozhukov et al. ‘Automatic Debiased Machine Learning for Dynamic Treatment Effects’. In: arXiv preprint arXiv:2203.13887 (2022).
Victor Chernozhukov et al. RieszNet and ForestRiesz: Automatic Debiased Machine Learning with Neural Nets and Random Forests (2021b).
Victor Chernozhukov, Whitney K Newey, and Rahul Singh. ‘Automatic debiased machine learning of causal and structural effects’. In: arXiv preprint arXiv:1809.05224 (2018c).
G. Chamberlain. ‘Asymptotic Efficiency in Estimation with Conditional Moment Restrictions’. In: Journal of Econometrics 34 (1987), pp. 305–334.
Victor Chernozhukov et al. ‘Double/debiased machine learning for treatment and structural parameters’. In: The Econometrics Journal (2018a). ArXiv 2016; arXiv:1608.00060.
Victor Chernozhukov et al. ‘Long Story Short: Omitted Variable Bias in Causal Machine Learning’. In: arXiv preprint arXiv:2112.13398 (2021c)
Input the Neyman-orthogonal score \psi(Z; \beta, \eta), where \eta = (g, \alpha) are nuisance parameters and \beta is the target parameter. Input random sample (Z_i:=(Y_i,D_i, X_i,A_i))_{i=1}^n. Then
Randomly partition \{1,...,n\} into folds (I_{\ell})_{\ell=1}^L of approximately equal size. For each \ell, estimate \widehat \eta_\ell = (\widehat{g}_{\ell}, \widehat{\alpha}_{\ell}) from observations excluding I_{\ell}.
Estimate \beta as a root of:
0 = n^{-1} \sum_{l=1}^L \sum_{i \in I_l} \psi (\beta, Z_i; \widehat \eta_l)
Output \widehat \beta and the estimated scores \widehat \psi^o (Z_i) =\psi(\widehat \beta, Z_i; \widehat \eta_{\ell})
We often do not observe A, and therefore we can only identify the short regression:
g_s(D, X) := E [Y \mid D, X] = E [g(D,X,A) |D,X].
Given this short regression, we can compute "short" parameters (or approximation) \theta_s for \theta: for ACE \theta_s = E [g_s(1,X) - g_s(0,X)], and for ACD,
\theta_s = E [\partial_d g_s(D,X)].
Our goal therefore is to provide bounds on the omitted variable bias (OMVB): \theta_s - \theta, under the assumptions that limit strength of confounding, and provide DML inference on its size.
For this kind of work we refer to Chernozhukov et al. (2021c).
Present a simple, yet general framework for learning and bounding causal effects, that utilizes machine learning (aka adaptive statistical learning methods)
List of examples all for general, nonseparable models:
(weighted) average potential outcomes; e.g. policy values
average treatment effects, including subgroup effects such as for the treated,
(weighed) average derivatives
average effects from transporting covariates
distributional changes in covariates
Many other examples fall in or extend this framework (mediators, surrogates, dynamic effects)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |