class: center, middle, inverse, title-slide .title[ # Causal Machine Learning with DoubleML ] .subtitle[ ## Introduction to Double Machine Learning ] .author[ ### UseR!2022, June 20, 2022, online ] .date[ ### Philipp Bach, Martin Spindler, Oliver Schacht (Uni Hamburg) ] --- class: inverse center middle, hide-logo ## Motivation for Causal Machine Learning --- ## Motivation for Causal ML <center><img src="figures/causalml.png" height="550px" alt="An illustration of Causal Machine Learning based on three circles. The circle in the middle has the title 'Causal Machine Learning'. There are arrows going from the left and right circle pointing at the circle in the center. Below the left circle that carries the title 'Why' a box lists three points: Learning causal relationships, Going beyond correlations, and Pioneers: Pearl, Rubin, Imbens (Nobel Prize 2021). Below the circle on the right which has the title 'ML', three points are listed: Learning complex patterns in the data, correlation based and good at forecasting." /></center> --- ## Predictive vs. Causal ML .pull-left[ ### Predictive ML #### How can we build a good prediction rule, `\(f(X)\)`, that uses features `\(X\)` to predict `\(Y\)`? Example: **Customer Churn** > "*How well can we predict whether customers churn?*" ] .pull-right[ ### Causal ML #### What is the causal effect of a treatment `\(D\)` on an outcome `\(Y\)`? <br> > "*Why do customer churn?*" > "*How can we retain customers?*" ] --- ## Motivation for Causal ML ### Typical (research) questions in industry, business and economics: .pull-left[ * What is the **effect** of the **new website (feature)** on our **sales**? * Will our new **app design** **increase** the **time spent** in the app? ] .pull-right[ * How much **additional revenue** did our **latest newsletter** **generate**? * Which product would **benefit** most from a **marketing campaign**? ] <br> ### General Question #### What is the causal effect of a treatment `\(D\)` on an outcome `\(Y\)`? --- ## Application: Randomized Experiments .pull-left[ <center><img src="figures/abtest.png" height="165px" alt="An illustration of AB testing. A stylized browser window shows a double-headed rhino which is a variant of the DoubleML package logo. The screen is divided vertically in two parts. The left part of the screen has the tag 'A' and differs from the right part called 'B' in that the colors are inverted." /></center> <br> #### Problem in practice 1. No pure A/B-testing experiments possible 2. A/B test suffers from low power ] .pull-right[ * **General:** What is the **effect** of a **certain variable** `\(D\)` on a relevant **outcome variable** `\(Y\)`? * **Randomized experiments** are a direct way to estimate such effects (assuming they are conducted properly) <br> #### Solution with DoubleML 1. **Observational study**: Include control variables `\(X\)` which may also impact the variables `\(Y\)` or `\(D\)` 2. Include covariates `\(X\)` that help to predict the outcome `\(Y\)` using ML methods. ] --- ## Example: Price Elasticity of Demand #### **Price Elasticity of Demand: How does the price impact sales?** .pull-left[ <center><img src="figures/demand.png" height="165px" alt="An illustration of demand estimation. On the left hand side two hands are displayed on top of each other. Between the hands there are two arrows showing up and down. On the right hand side, there is a price tag with a dollar sign attached to a circular graph." /></center> ] .pull-right[ - Absolute change in price (EUR 100) and the resulting absolute change in sales (10 million units) can be difficult to interpret - **Price elasticity of demand**: Percentage change in quantity demanded `\(Q\)` when there is a one percent increase in price `\(P\)` ] <br> `$$E_d = \frac{\Delta Q / Q}{\Delta P / P} = \frac{-10 / 200}{100 / 1000} = \frac{-0.05}{0.1} = - 0.5$$` **Econometric model** for estimating the **price elasticity** `\(\theta_0\)`: `$$\log(Q) = \alpha +\log(P) \theta_0 + X \beta + \varepsilon,$$` where the vector of controls `\(X\)` can be very **high-dimensional** --- ## Motivation for Causal Machine Learning - Machine Learning methods usually tailored for **prediction** - In econometrics and industry both prediction (stock market, demand, ...) and learning of **causal relationship** is of interest - Here: **Focus on causal inference** with machine Learning methods - Examples for causal inference: - Effect of a new website, app design or the latest newsletter - Price elasticity of demand - General: What is the **effect** of a certain **treatment** on a relevant **outcome** variable? --- ## Motivation for Causal Machine Learning #### Challenge I: Identification of causal parameter **Typical problem** - Potential endogeneity of the treatment assignment * Potential sources - Optimizing behavior of the individuals with regard to the outcome - Simultaneity (price elasticity of demand) - Unobserved heterogeneity - Omitted variables - The treatment assignment is observed rather than randomly assigned - Example: Covid vaccination <br> * Possible Solutions - **Selection on observable characteristics (controls)** - Instrumental Variable (IV) estimation - ... --- ## Motivation for Causal Machine Learning #### Challenge II: "Big data" * **High-dimensional setting** with `\(p\)` (# of variables) even larger than `\(n\)` (# of observations), and / or * a highly **non-linear functional form** for the relationship between variables of interest #### Solution * Use ML methods for estimation that ... + ... perform regularization, e.g., variable selection + ... are able to model non-linearities --- ## Motivation for Causal Machine Learning #### Challenge I & II: Controls are needed for two reasons .pull-left[ #### **Identification:** We need them to ensure `\(D\)` is as good as randomly assigned (exogenous) *conditional on* `\(X\)`. Along these lines, we can think of the confounding equation: `$$D = m_0(X) + V, E[V|X] = 0.$$` `\(\rightarrow\)` Variation in `\(V\)` is quasi-experimental <img src="Lect1_Introduction_to_DML_files/figure-html/unnamed-chunk-1-1.png" title="Case I: Identification. A directed acyclical graph with nodes D, X, V and Y. The following edges connect the nodes. From V to D, from X to D, from X to Y, from D to Y" alt="Case I: Identification. A directed acyclical graph with nodes D, X, V and Y. The following edges connect the nodes. From V to D, from X to D, from X to Y, from D to Y" width="100%" /> ] .pull-right[ #### **Efficiency:** Some controls may explain part of the variation in the outcome `\(Y\)` ML methods may deliver more accurate results than OLS or `\(t\)`-tests <br> `\(\rightarrow\)` The effect can be estimated more accurately <img src="Lect1_Introduction_to_DML_files/figure-html/unnamed-chunk-2-1.png" title="Case II: Efficiency. A directed acyclical graph with nodes D, X, V and Y. The following edges connect the nodes. From V to D, from X to Y, from D to Y" alt="Case II: Efficiency. A directed acyclical graph with nodes D, X, V and Y. The following edges connect the nodes. From V to D, from X to Y, from D to Y" width="100%" /> ] --- class: inverse center middle, hide-logo ## What is Double/Debiased Machine Learning (DML)? --- ## What is Double/Debiased Machine Learning (DML)? - **Double/debiased machine learning (DML)** introduced by Chernozhukov et al. (2018) - General framework for causal inference and estimation of treatment effects based on machine learning tools using big data - Combines the strength of **machine learning** and **econometrics** - Our object-oriented implementation **DoubleML** (in R and Python) provides a general interface for the growing number of models and methods for DML - **Documentation** & **user guide**: https://docs.doubleml.org - The R package is available on CRAN ```r # latest CRAN release install.packages("DoubleML") # development version from GitHub remotes::install_github("DoubleML/doubleml-for-r") ``` --- ## What is Double/Debiased Machine Learning (DML)? - **Exploiting the strengths of two disciplines:** .center[ <center><img src="figures/doubleml_ml_eco_useR.png" height="200px" alt = "A diagram with three boxes. On top, there are two boxes. On the left, a box with title PREDICTION - MACHINE LEARNING; on the right a box with title INFERENCE - ECONOMETRICS AND STATISTICS. In the first box (prediction) it is written: Powerful methods in high-dimensional and non-linear settings, for example, lasso, ridge, regression trees, random forests and boosted trees. In the second box (inference) it is written: Statistical framework for estimation of causal effects, this is, structural equation models, identification, asymptotic theory, hypothesis tests and confidence intervals" /></center> ] - **Result / output** from the DML framework: - Estimate of the causal effect (with valid confidence intervals `\(\rightarrow\)` statistical tests for effects) - Good statistical properties ( `\(\sqrt{N}\)` rate of convergence; unbiased; approximately Gaussian) - Multiple treatment effects; heterogeneous treatment effects, ... --- class: inverse center middle, hide-logo ## A Motivating Example: Basics of Double Machine Learning --- ## Partially Linear Regression #### Partially linear regression (PLR) model `$$\begin{align*} &Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}[\zeta | D,X] = 0, \\ &D = m_0(X) + V, & &\mathbb{E}[V | X] = 0, \end{align*}$$` with * Outcome variable `\(Y\)` * Policy or treatment variable of interest `\(D\)` * High-dimensional vector of confounding covariates `\(X = (X_1, \ldots, X_p)\)` * Stochastic errors `\(\zeta\)` and `\(V\)` .center[ <img src="Lect1_Introduction_to_DML_files/figure-html/unnamed-chunk-4-1.png" title="A directed acyclical graph with nodes A, X, V and Y. The following edges connect the nodes. From V to A, from X to A, from X to Y, from A to Y" alt="A directed acyclical graph with nodes A, X, V and Y. The following edges connect the nodes. From V to A, from X to A, from X to Y, from A to Y" width="50%" /> ] --- ## Partially Linear Regression #### Partially linear regression (PLR) model `$$\begin{align*} &Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}[\zeta | D,X] = 0, \\ &D = m_0(X) + V, & &\mathbb{E}[V | X] = 0, \end{align*}$$` with * Outcome variable `\(Y\)` * Policy or treatment variable of interest `\(D\)` * High-dimensional vector of confounding covariates `\(X = (X_1, \ldots, X_p)\)` * Stochastic errors `\(\zeta\)` and `\(V\)` #### Problem of simple "plug-in" approaches: Regularization bias * If we use an ML model to estimate `\(\hat{g}\)` and simply plug in the predictions `\(\hat{g}\)`, the final estimate on `\(\theta_0\)` will not be unbiased and neither be asymptotically normal --- ## Partially Linear Regression <br> .center[ #### Illustration of naive approach: App Example based on Chernozhukov et al. (2018) and [https://docs.doubleml.org/stable/guide/basics.html](https://docs.doubleml.org/stable/guide/basics.html) App available via GitHub: https://github.com/DoubleML/BasicsDML ] <!-- TODO: Insert link to app --> --- ## Frisch-Waugh-Lovell Theorem ### Solution to regularization bias: Orthogonalization * Remember the Frisch-Waugh-Lovell (FWL) Theorem in a **linear regression model** `$$Y = D \theta_0 + X\beta + \varepsilon$$` * `\(\theta_0\)` can be consistently estimated by **partialling out** `\(X\)`, i.e, 1. OLS regression of `\(Y\)` on `\(X\)`: `\(\tilde{\beta} = (X'X)^{-1} X'Y\)` `\(\rightarrow\)` Residuals `\(\hat{\varepsilon}\)` 2. OLS regression of `\(D\)` on `\(X\)`: `\(\tilde{\gamma} = (X'X)^{-1} X'D\)` `\(\rightarrow\)` Residuals `\(\hat{\zeta}\)` 3. Final OLS regression of `\(\hat{\varepsilon}\)` on `\(\hat{\zeta}\)` * Orthogonalization: The idea of the FWL Theorem can be generalized to using ML estimators instead of OLS --- ## Partially Linear Regression <br> .center[ #### Illustration of naive approach: App Example based on Chernozhukov et al. (2018) and [https://docs.doubleml.org/stable/guide/basics.html](https://docs.doubleml.org/stable/guide/basics.html) App available via GitHub: https://github.com/DoubleML/BasicsDML ] --- class: inverse center middle, hide-logo ## The Key Ingredients of Double Machine Learning --- ## The Key Ingredients of DML > #### 1. Neyman Orthogonality > The inference is based on a score function `\(\psi(W; \theta, \eta)\)` that satisfies > `$$E[\psi(W; \theta_0, \eta_0)] = 0,$$` > where `\(W:=(Y,D,X,Z)\)` and with `\(\theta_0\)` being the unique solution that obeys the **Neyman orthogonality condition** > `$$\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta] \right|_{\eta=\eta_0} = 0.$$` <br> - `\(\partial_{\eta}\)` denotes the pathwise (Gateaux) derivative operator - **Neyman orthogonality** ensures that the **moment condition** identifying `\(\theta_0\)` is **insensitive to small pertubations** of the nuisance function `\(\eta\)` around `\(\eta_0\)` - Using a Neyman-orthogonal score **eliminates the first order biases** arising from the replacement of `\(\eta_0\)` with a ML estimator `\(\hat{\eta}_0\)` --- ## The Key Ingredients of DML > #### 1. Neyman Orthogonality > The inference is based on a score function `\(\psi(W; \theta, \eta)\)` that satisfies > `$$E[\psi(W; \theta_0, \eta_0)] = 0,$$` > where `\(W:=(Y,D,X,Z)\)` and with `\(\theta_0\)` being the unique solution that obeys the **Neyman orthogonality condition** > `$$\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta] \right|_{\eta=\eta_0} = 0.$$` <br> - For many models the Neyman orthogonal **score functions** are **linear** in `\(\theta\)` `$$\psi(W;\theta, \eta) = \psi_a(W; \eta) \theta + \psi_b(W; \eta)$$` - The estimator `\(\tilde{\theta}_{0}\)` then takes the form `$$\tilde{\theta}_0 = - \left(\mathbb{E}_N[\psi_a(W; \eta)]\right)^{-1}\mathbb{E}_N[\psi_b(W; \eta)]$$` --- ## The Key Ingredients of DML > #### 1. Neyman Orthogonality > The inference is based on a score function `\(\psi(W; \theta, \eta)\)` that satisfies > `$$E[\psi(W; \theta_0, \eta_0)] = 0,$$` > where `\(W:=(Y,D,X,Z)\)` and with `\(\theta_0\)` being the unique solution that obeys the **Neyman orthogonality condition** > `$$\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta] \right|_{\eta=\eta_0} = 0.$$` <br> * PLR example: Orthogonality by including the first-stage regression, i.e., the regression relationship of the treatment variable `\(D\)` and the regressors `\(X\)`. * Orthogonal score function `\(\psi(\cdot)= (Y-g(x)-\theta D)(D-m(X)\)`. --- ## The Key Ingredients of DML > #### 2. High-Quality Machine Learning Estimators > The nuisance parameters are estimated with high-quality (fast-enough converging) machine learning methods. <br> - Different structural assumptions on `\(\eta_0\)` lead to the use of different machine-learning tools for estimating `\(\eta_0\)` (Chernozhukov et al., 2018, Chapter 3) <br> > #### 3. Sample Splitting > To avoid the biases arising from overfitting, a form of **sample splitting** is used at the stage of producing the estimator of the main parameter `\(\theta_0\)`. <br> - Cross-fitting performs well empirically (efficiency gain by switching roles) --- ## Key Ingredients of DML #### Illustration of the cross-fitting algorithm <!-- `\(\Rightarrow\)` [Video](https://www.youtube.com/watch?v=BMAr27rp4uA&t=1s) --> .center[ <iframe width="750" height="415" src="https://www.youtube.com/embed/BMAr27rp4uA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] --- ## Partially Linear Regression <br> .center[ #### Illustration of DML approach: App Example based on Chernozhukov et al. (2018) and [https://docs.doubleml.org/stable/guide/basics.html](https://docs.doubleml.org/stable/guide/basics.html) App available via GitHub: https://github.com/DoubleML/BasicsDML ] --- class: inverse center middle, hide-logo ## References --- ## References #### Double Machine Learning Approach * Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68, doi:10.1111/ectj.12097. * Chernozhukov, V., Hansen, C., Spindler, M., and Syrgkanis, V. (forthcoming), Applied Causal Inference Powered by ML and AI. #### DoubleML Package for Python and R * Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2021), DoubleML - An Object-Oriented Implementation of Double Machine Learning in R, [arXiv:2103.09603](https://arxiv.org/abs/2103.09603). * Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2022), DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python, Journal of Machine Learning Research, 23(53): 1-6, https://www.jmlr.org/papers/v23/21-0862.html. --- class: inverse center middle, hide-logo ## Appendix --- ## Examples: Covid Vaccination #### Does the COVID-19 vaccine increase mortality? <center><img src="figures/covid_vaccination.png" height="500px" /></center> .footnote[Source: [Tagesschau.de](https://www.tagesschau.de/faktenfinder/impfquote-sterblichkeitsrate-101.html)] --- ## Examples: Covid Vaccination #### Does the COVID-19 vaccine increase mortality? .center[ <img src="Lect1_Introduction_to_DML_files/figure-html/unnamed-chunk-5-1.png" width="60%" /> ]