Processing math: 34%
+ - 0:00:00
Notes for current slide
Notes for next slide

Double ML: Causal Inference based on ML

Part III: AutoDML

uai2022, August 1, 2022, Eindhoven

Philipp Bach, Martin Spindler (UHH & Economic AI)

Collaborators: Victor Chernozhukov (MIT), Malte Kurz (TUM)

1

Outline

  • Present a simple, yet general framework for learning and bounding causal effects, that utilizes machine learning (aka adaptive statistical learning methods)

  • List of examples all for general, nonseparable models:

    • (weighted) average potential outcomes; e.g. policy values

    • average treatment effects, including subgroup effects such as for the treated,

    • (weighed) average derivatives

    • average effects from transporting covariates

    • distributional changes in covariates

Many other examples fall in or extend this framework (mediators, surrogates, dynamic effects)

2

Outline

  • Using machine learning is great, because we can learn regression functions very well

  • However, since ML has to "shrink, chop, and throw out" variables to perform prediction well in high-dimensional settings, the learners are biased. These biases transmit into estimation of main causal effects

  • The biases can be eliminated by using carefully crafted -- Neyman orthogonal -- score functions. Additional overfitting biases are dealt with by using cross-fitting
  • Another source of biases is presence of unobserved confounders. These biases can not be eliminated, but we can bound the biases and perform inference on the size of the bias under the hypotheses that limit the strength of confounding
3

Set-up: Causal Inference via Regression

The set-up uses potential outcomes framework (Imbens and Rubin, 2015). Let Y(d) denote the potential outcome in policy state d. The chosen policy D is assumed to be independent of potential outcomes conditional on controls X and A: Y(d)DX,A. The observed outcome Y is generated via Y:=Y(D).

Under the conditional exogeneity condition, E[Y(d)X,A]=E[YD=d,X,A]=:g(d,X,A), that is the conditional average potential outcome coincides with the regression function

4

Running Example

The key examples of causal parameters include the average causal effect (ACE):

θ=E[Y(1)Y(0)]=E[g(1,X,A)g(0,X,A)]

for the case of the binary d, and the average causal derivative (ACD), for the case of continuous d:

θ=E[dY(d)d=D]=E[dg(D,X,A)]

5

More Examples

  • Average Incremental Effect (AIE): θ=E[Y(D+1)Y(D)]=E[g(D+1,X,A)]EY.

  • Average Policy Effect (APEC) from covariate shift: θ=E[Y(d)X=x]d(F1(x)dF0(x)). =E[g(d,x,A)]d(F1(x)dF0(x))

  • See others in Chernozhukov et al. (2018b), Chernozhukov et al. (2020), Chernozhukov et al. (2021a), Chernozhukov et al. (2022), Singh (2021)

6

Case I: No Unobserved Confounders

Let W:=(D,X,A) be all observed variables.

Assumption Target Parameter The target parameter can be expressed as a continuous linear functional of long regression: θ:=Em(W;g), where gEm(W;g) is continuous in g with respect to the L2(P) norm

In working examples above

  • m(W,g(W))=g(1,X,A)g(0,X,A) for ACE and

  • m(W,g(W))=dg(D,X,A) for ACD.

Weak overlap conditions are required to make the continuity hold

7

Case I: No Unobserved Confounders

Lemma: Riesz Representation (Cherozhukov et al. 2021a, 2021b)

There exist unique square integrable random variables α(W) such that

Em(W,g)=Eg(W)α(W), for all square-integrable g


Partially linear models

(Frisch-Waugh) For partially linear models, g(W)=θD+f(X,A), then for either ACE or ACD we have that α(W)=DE[DX,A]E(DE[DX,A])2,

8

Case I: No Unobserved Confounders

General nonparametric models:

  • (Horwitz-Thomposon). In the case of ACE, α(W)=1(D=1)P(D=1X,A)1(D=0)P(D=0X,A).

  • (Powell-Stock-Stocker) For the case of ACD, α(W)=dlogf(DX,A)

9

Case I: No Unobserved Confounders

  • It turns out, we don't need closed form solutions for RRs for each new problem. We can obtain them automatically

Lemma: Auto Characterization for Representers (Chernozhukov et al. 2021a, 2021b)

α=argmin


  • Chernozhukov et al. (2021a) and Chernozhukov et al. (2018c) employ this formulation to learn the Riesz representer without knowing the functional form
  • Chernozhukov et al. (2020) and Chernozhukov et al. (2018b) employ adversarial method of moments to learn the representer without knowing the functional form
10

Case I: No Unobserved Confounders

Three potential representations for the target parameter, \theta: \theta = E m(W,g) = E Y \alpha = E g \alpha. the "regression matching", "propensity score", and "mixed" approaches respectively

Which one should we use?

  • In parametric models, wide path, because can use parametric learning of g or \alpha and use expression above

  • In low-dimensional nonparametric models, still a wide path using flexible parametric approximations (series and sieve methods)

11

Case I: No Unobserved Confounders

What about modern high-dimensional nonparametric problems, when we are forced to use machine learning to learn g or learn \alpha?

  • Problem: shrinkage biases create inferential problems for either one of the approaches
12

Case I: No Unobserved Confounders

Debiased Learning: Narrow the Path

Narrow the Path: Use all three of the learning approaches: \theta = E m(W,g) + E Y \alpha - E g \alpha.

(Intuitively, each part corrects the bias in the other.)

13

Case I: No Unobserved Confounders

Narrow the Path Even More: Use Cross-Fitting to Eliminate Overfitting Biases (Entropy Bias)

14

Big Picture

Debiased machine learning is a generic recipe that isolates the narrow path. It is a

  • method-of-moments estimator

  • that utilizes any debiased/orthogonal moment scores

  • together with cross-fitting

  • automatic learning of representers aids the construction

  • Delivers standard approximately normal inference on main parameters

Applies more broadly than the setting discussed here, for example for economic models identified through conditional method of moments (Chamberlain, 1987), albeit much more work is needed in this area.

15

Theoretical Details

For debiased machine learning we use representations: \theta = E [m(W, g) + (Y- g) \alpha], We have that

\theta - E [m(W, \bar g) + (Y- \bar g) \bar \alpha] = - E (\bar g- g) (\bar \alpha - \alpha). Therefore, this representation has the Neyman orthogonality property:

\partial_{\bar g, \bar \alpha} E [m(W, \bar g) + (Y- \bar g) \bar \alpha] \Big |_{\bar \alpha = \alpha, \bar g = g} =0,

where \partial is the Gateaux (pathwise derivative) operator.

16

Theoretical Details

Therefore the estimators are defined as \widehat \theta := DML (\psi_{\theta}); for the score:

\psi_{\theta} (Z; \theta; \alpha, g) := \theta - m(W, g) + (Y- g) \alpha(W); \

Generic DML is a method-of-moments estimator that utilizes any Neyman orthogonal score, together with cross-fitting.

17

Theoretical Details

R notebook (http://www.kaggle.com/r4hu15in9h/auto-dml)

Lasso Learner for Nuisance Parameters:

  • Regression Learner: Over a subset of data excluding I_{\ell}:

\min \sum_{i \not \in I_l} (Y_i - g(W_i))^2 + \mathrm{pen} (g):

g(W_i)= b(W_i)'\gamma; \ \ \mathrm{pen} (g) = \lambda_g \sum_j | \gamma_j|,

where b(W_i) is dictionary of transformations of W_i, for example polynomials and interactions, and \lambda_g is penalty level.

18

Theoretical Details

  • Representer Learner: Over a subset of data excluding I_{\ell}:

\min \sum_{i \in I^c_l} a^2(W_i) - 2 m(W_i, a) + \mathrm{pen} (a):

a(W_i)= b(W_i)'\rho; \ \ \mathrm{pen} (a) = \lambda_a \sum_j | \rho_j|, where \lambda_a is penalty level.

  • Can use any high-quality regression learner in place of lasso.
  • Can use random forest and neural network learners of RR. See Chernozhukov et al. (2018c)
19

Theoretical Details

We say that an estimator \hat \beta of \beta is asymptotically linear and Gaussian with the centered influence function \psi^o(Z) if

\sqrt{n} (\hat \beta - \beta) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \psi^o(Z_i) + o_{P} (1) \leadsto N(0, E \psi^2_0(Z)).

The application of the results in Chernozhukov et al. (2018a) for linear score functions yields the following result.

Theorem: DML for CEs

Suppose that we can learn g and \alpha sufficiently well, at o_P(n^{-1/4}) rates in L^2(P) norm. Then the DML estimator \hat \theta is asymptotically linear and Gaussian with influence functions:

\psi^o_\theta(Z) := \psi_{\theta} (Z; \theta, g, \alpha), evaluated at the true parameter values. Efficiency follows from Newey (1994). The covariance of the scores can be estimated by the empirical analogues using the covariance of the estimated scores.

20

References

  • Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

  • Victor Chernozhukov et al. ‘Automatic Debiased Machine Learning via Neural Nets for Generalized Linear Regression’. In: arXiv preprint arXiv:2104.14737 (2021a).

  • Victor Chernozhukov et al. ‘Adversarial estimation of riesz representers’. In: arXiv preprint arXiv:2101.00009 (2020).

  • Rahul Singh. ‘A Finite Sample Theorem for Longitudinal Causal Inference with Machine Learning: Long Term, Dynamic, and Mediated Effects’. In: arXiv preprint arXiv:2112.14249 (2021).

  • Victor Chernozhukov, Whitney Newey, and Rahul Singh. ‘De-biased machine learning of global and local parameters using regularized Riesz representers’. In: arXiv preprint arXiv:1802.08667 (2018b).

  • Victor Chernozhukov et al. ‘Automatic Debiased Machine Learning for Dynamic Treatment Effects’. In: arXiv preprint arXiv:2203.13887 (2022).

  • Victor Chernozhukov et al. RieszNet and ForestRiesz: Automatic Debiased Machine Learning with Neural Nets and Random Forests (2021b).

22

References

  • Victor Chernozhukov, Whitney K Newey, and Rahul Singh. ‘Automatic debiased machine learning of causal and structural effects’. In: arXiv preprint arXiv:1809.05224 (2018c).

  • G. Chamberlain. ‘Asymptotic Efficiency in Estimation with Conditional Moment Restrictions’. In: Journal of Econometrics 34 (1987), pp. 305–334.

  • Victor Chernozhukov et al. ‘Double/debiased machine learning for treatment and structural parameters’. In: The Econometrics Journal (2018a). ArXiv 2016; arXiv:1608.00060.

  • Victor Chernozhukov et al. ‘Long Story Short: Omitted Variable Bias in Causal Machine Learning’. In: arXiv preprint arXiv:2112.13398 (2021c)

23

Algorithm

Definition: DML ( \psi )

Input the Neyman-orthogonal score \psi(Z; \beta, \eta), where \eta = (g, \alpha) are nuisance parameters and \beta is the target parameter. Input random sample (Z_i:=(Y_i,D_i, X_i,A_i))_{i=1}^n. Then

  • Randomly partition \{1,...,n\} into folds (I_{\ell})_{\ell=1}^L of approximately equal size. For each \ell, estimate \widehat \eta_\ell = (\widehat{g}_{\ell}, \widehat{\alpha}_{\ell}) from observations excluding I_{\ell}.

  • Estimate \beta as a root of:

0 = n^{-1} \sum_{l=1}^L \sum_{i \in I_l} \psi (\beta, Z_i; \widehat \eta_l)

Output \widehat \beta and the estimated scores \widehat \psi^o (Z_i) =\psi(\widehat \beta, Z_i; \widehat \eta_{\ell})

25

Case II: Subset A of Confounders is not observed

We often do not observe A, and therefore we can only identify the short regression:

g_s(D, X) := E [Y \mid D, X] = E [g(D,X,A) |D,X].

Given this short regression, we can compute "short" parameters (or approximation) \theta_s for \theta: for ACE \theta_s = E [g_s(1,X) - g_s(0,X)], and for ACD,

\theta_s = E [\partial_d g_s(D,X)].

Our goal therefore is to provide bounds on the omitted variable bias (OMVB): \theta_s - \theta, under the assumptions that limit strength of confounding, and provide DML inference on its size.

For this kind of work we refer to Chernozhukov et al. (2021c).

26

Outline

  • Present a simple, yet general framework for learning and bounding causal effects, that utilizes machine learning (aka adaptive statistical learning methods)

  • List of examples all for general, nonseparable models:

    • (weighted) average potential outcomes; e.g. policy values

    • average treatment effects, including subgroup effects such as for the treated,

    • (weighed) average derivatives

    • average effects from transporting covariates

    • distributional changes in covariates

Many other examples fall in or extend this framework (mediators, surrogates, dynamic effects)

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow