# Introduction to Causal ML and Double ML

Tools for Causality
Grenoble, Sept 25 - 29, 2023
Philipp Bach, Sven Klaassen

## Agenda

Welcome to our course on Double Machine Learning!

#### Day 1

• Introduction to Causal ML & DoubleML

• Introduction to DoubleML for Python

• Heterogeneous Treatment Effects

• Hands-On Examples

• Uplift modeling
• Demand estimation

#### Day 2

• Introduction to Sensitivity Analysis

• Hands-On Examples

• Lalonde data
• Demand estimation

# Basics: Causality and Causal ML

## Basics: Predictive vs. Causal Modelling

#### Predictive Modelling

How can we build a good prediction rule, $f(X)$, that uses features $X$ to predict $Y$?

Example: Customer Churn

How well can we predict whether customers churn?

Which of these variables are good predictors for churn (explainability)?

#### Causal Modelling

What is the causal effect of a treatment $A$ on an outcome $Y$?

Why do customer churn?

How can we retain customers?

Causal ML: How can we use state-of-the art ML methods for causal inference?

# Causal Inference: A Brief Introduction

## Introduction to Causal Inference

#### How to assess causality with data?

• Approach

1. Define causal parameters of interest
2. State necessary assumptions for identification and valid estimation
• Helpful tools in causal inference

• Potential Outcomes (PO) framework
• Directed Acyclic Graphs (DAGs)
• General question

What is the causal effect of treatment $D$ on outcome $Y$?

## Introduction to Causal Inference

#### Example: Uplift Modeling

Key questions:

1. What is the causal effect of an email campaign (coupon) ($=D$) on product sales (conversion) ($=Y$)?
1. How to optimally target coupons ($=D$) towards newsletter subscribers?

## Definition: Causal Effects

• Binary treatment $D$ $$$D = \begin{cases}1, & \text{if treated ( = with coupon)}\\ 0, & \text{if not treated ( = without coupon)}\end{cases}$$$

#### Potential outcome framework

• Define the potential outcomes as

• $Y(1)$: Conversion if she would receive a discount

• $Y(0)$: Conversion if she would not receive a discount

• Individual causal effect (ICE) $\Delta = Y(1) - Y(0).$

• Reference: Huber (2023)

## Graphical Representation

#### Directed Acyclic Graphs (DAGs)

• DAGs help to communicate/discuss causal problems

• DAGs can be used to assess statistical dependencies between variables ( $d$-separation)

• Some of these statistical relationships might sometimes be unexpected (bad controls)

• References: Glymour, Pearl, and Jewell (2016), Cinelli, Forney, and Pearl (2022)

Code
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()

# Draw the graph
plt.figure(figsize=(4, 3))
pos = {"D": (0, 0), "Y": (2, 0)}
nx.draw(G, pos, with_labels=True, node_size=800, node_color='lightblue')
plt.show()

## Fundamental Problem of Causal Inference

#### Individual causal effects cannot be identified, in general.

• We only observe one of the potential outcomes $\Rightarrow$ Factual

• The other outcome is unobserved $\Rightarrow$ Counterfactual

• Make assumptions, e.g., SUTVA assumption

• Use data and estimate causal parameters

#### Example/Question:

What would you do to estimate the causal effect of the coupon on the conversion rate?

## Estimation of Causal Effects

• A/B test, Randomized Control Trial (RCT), experiment

#### Gold standard: Experiment, RCT, A/B test

Under certain assumptions, we can consistently estimate the Average Treatment Effect (ATE)1 of treatment $D$ on outcome $Y$ $ATE= E[Y(1)] - E[Y(0)].$

Basic idea:

• Use sample-based estimates $E[Y|d = 1]$ and $E[Y|d=0]$ for $E[Y(1)]$ and $E[Y(0)]$, respectively

## Estimation of Causal Effects

#### Why does it work?

In an RCT, the allocation of the treatment to individuals is assigned randomly

$\Rightarrow$ Treatment assignment is independent of potential outcomes

• Individuals do not self-select into the treatment, i.e., they cannot pick that value of the treatment that is best for them (in terms of their potential outcomes)

## Estimation of the ATE based on Data

#### How to estimate the ATE using data?

1. Run a two-sample $t$-test, or,

2. Use linear regression $$$E[Y|D] = \underbrace{\alpha}_{ \leadsto E[Y|D=0]} + \underbrace {\beta}_{\leadsto ATE} \cdot D$$.$

## Estimation of the ATE with Covariates

• Why account for covariates/individual characteristics $X$ in an RCT?

• Uplift example: Individual characteristics, gender, shopping history $\ldots$

• Accounting for $X$ s that help to explain $Y$ can reduce the unexplained variation (increase power / efficiency).

• Randomization has to hold with respect to the covariates as well
$\Rightarrow$ Pre-treatment covariates

• Scope for more sophisticated evaluation: Conditional Average Treatment Effect (CATE) $\Rightarrow$ Heterogeneity/Personalization $\ldots$

## Estimation of the ATE with Covariates

Code
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()

# Draw the graph
plt.figure(figsize=(4, 3))
pos = {"D": (0, 0), "Y": (2, 0), "X": (1,1)}
nx.draw(G, pos, with_labels=True, node_size=800, node_color='lightblue')
plt.show()

## Estimation of the ATE with Covariates

Estimation with linear regression1

\begin{align*}E[Y|D,X] = \underbrace{\alpha}_{ \leadsto E[Y|D=0]} + \underbrace{\beta}_{\leadsto ATE} \cdot D + \gamma X. \end{align*}

## Estimation of Causal Effects … without Experiments?

• In an RCT, we can credibly justify that individuals do not self-select into the treatment status.

• But, what if an RCT is not feasible/available?

• Examples: Randomization too costly (e.g., credit lines) or infeasible (e.g., removing standard services, enforcing memberships not possible)

• Observational studies

• The treatment assignment might be confounded by observable variables $X$ and unobservable variables $U$ $\Rightarrow$ The independence assumption might be violated.

## Estimation of Causal Effects … without Experiments?

Code
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()

# Draw the graph
plt.figure(figsize=(4, 3))
pos = {"D": (0, 0), "Y": (2, 0), "X": (1,1)}
edge_colors = ['black', 'red', 'red']
nx.draw(G, pos, with_labels=True, node_size=800, node_color='lightblue',
edge_color=edge_colors)
plt.show()

## Uplift Example (continued)

#### Experimental data

• Randomize the treatment $D$ (coupon) to newsletter subscribers

• Independence assumption is satisfied

#### Observational data

• Use historic sales data

• Observe who has been treated (i.e., who used an email coupon) and who has not

• Account for confounding variables $X$, e.g., shopping history

• Conditional independence is assumed

## Challenges to Observational Studies

• If we can control for all confounding variables $X$, the treatment is as good as randomly assigned conditional on $X$ $\Rightarrow$ We can estimate the $ATE$

• Intuition: Matching on observables
• Assumption: Unconfoundedness/Selection on observables

• Assumption (I): Independence of PO and treatment conditional on $X$’s

• Assumption (II): Common support/overlap

• (+ Assumptions that depend on the specific estimation framework used)

## Estimation of the ATE with Confounders

• Estimation based on linear regression with covariates $X$

$$$E[Y|D,X] = \underbrace{\alpha}_{ \leadsto E[Y|D=0]} + \underbrace{\beta}_{\leadsto ATE} \cdot D + \gamma X$$.$

• $\alpha = E[Y|D=0]$, $\beta = ATE$

• There are alternative estimation approaches available (imposing different assumptions)

• Matching on observables / estimation in subgroups
• Propensity score matching
• Inverse probability weighting / standardization
• Doubly robust approach / AIPW
• Causal/Double Machine learning

## So should we simply include all variables in the data?

• Be careful! Don’t include bad controls/colliders.

• Collider bias/selection bias $\Rightarrow$ Biases and spurious correlations

• Post-treatment covariates

• Model specification based on domain-specific expertise

Code
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()

# Draw the graph
plt.figure(figsize=(4, 3))
pos = {"D": (0, 0), "Y": (2, 0), "X": (1,1), "C": (1, -1)}
edge_colors = ['black', 'red', 'red', 'black', 'black']
nx.draw(G, pos, with_labels=True, node_size=800, node_color='lightblue',
edge_color=edge_colors)
plt.show()

# Introduction to Double Machine Learning

## Introduction to Double Machine Learning

#### Why ML for causal analysis?

• Machine Learning (ML) algorithms are powerful tools for predictions

• Use Case continued: Uplift modeling

• Experiment, A/B test: Improve precision/efficiency/power
• Account for confounding in observational studies
• High-dimensional vector of covariates or complex relationship of $Y$, $D$ and $X$.

## Introduction to Double Machine Learning

#### Double Machine Learning (DML) approach (Chernozhukov et al. 2018)

• General framework for ML-based inference on a causal parameter, $\theta_0$, for example, $\theta_0 = ATE = E[Y(1) - Y(0)]$

• Use Case: With DML, we can estimate the ATE by using ML learners, such as gradient boosting, random forests, $\ldots$

• First, we need a formal causal model and a definition of the causal quantity of interest (=ATE)

## Interactive Regression Model (IRM)

\begin{align}\begin{aligned}Y = g_0(D, X) + U, & &\mathbb{E}(U | X, D) = 0,%\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, \end{aligned}\end{align}

with the ATE being

$\theta_0 = \mathbb{E}[g_0(1, X) - g_0(0,X)].$

Here, $\mathbb{E}[g_0(1, X)]$ and $\mathbb{E}[g_0(0,X)]$ denote the expected value of the outcome variable with treatment status $D=1$ and $D=0$, respectively.

## Interactive Regression Model (IRM)

#### Why ML for causal analysis?

\begin{align}\begin{aligned}Y = \underbrace{g_0(D, X)}_{\text{ML Learner}} + U, & &\mathbb{E}(U | X, D) = 0.%\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, \end{aligned}\end{align}

• In this model, we want to use flexible ML methods to estimate $g_0(D, X)$ and in order to
• Abstract from potentially restrictive assumptions, such as linearity or additivity
• Model heterogeneous treatment effects using nonlinear learners
• Handle high-dimensional data in terms of $X$

## Introduction to Double Machine Learning

#### Why Double Machine Learning?

\begin{align}\begin{aligned}Y = \underbrace{g_0(D, X)}_{\text{ML Learner}} + U, & &\mathbb{E}(U | X, D) = 0.%\\D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, \end{aligned}\end{align}

• Why not simply estimate $g_0(D,X)$ with ML and plug in predictions $\hat{g}_0(D,X)$?
• ML methods address the variance-bias-tradeoff by introducing some kind of regularization
• This will translate into a bias of the causal estimate

## Uplift modeling example: Interactive Regression Model (IRM)

#### Challenges in causal machine learning

• Causal estimation after using ML methods for estimation require adjustments of the estimation framework $\Rightarrow$ Chernozhukov et al. (2018)

#### The Key Ingredients of DML

1. Neyman orthogonality
1. High-quality ML estimation
1. Sample splitting

## Neyman Orthogonality: Motivation

• We want to make estimation of the ATE robust against the regularization bias of ML learners

• This can be achieved by using an orthogonal estimation framework

• Neyman orthogonality states that the error terms that arise due to regularization do not affect the causal estimate

## Neyman Orthogonality: Formal Definition

Technically, the inference framework is built on a moment condition that satisfies the property of Neyman orthogonality, i.e.,

$\mathbb{E}[\underbrace{\psi(W; \theta_0, \eta_0)}_{\text{score function}}] = 0,$ with $W$ denoting the data, $\theta_0$ the causal parameter of interest (ATE), and $\eta$ the nuisance part.

Neyman orthogonality ensures that the moment condition identifying $\theta_0$ is insensitive to small pertubations of the nuisance function $\eta$ around $\eta_0$

$\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta)] \right|_{\eta=\eta_0} = 0.$

## Neyman Orthogonality: IRM Example

#### Plug-in approach (not orthogonal)

\begin{align} \psi (W, \theta_0, \eta) = & g_0(1, X) - g(0,X) - \theta_0, \end{align}

with the nuisance parameter being

\begin{align} \eta &= (g(1, X), g(0,X)), \\ \eta_0 &= (g_0(1, X), g_0(0,X)) \end{align}

## Neyman Orthogonality: IRM Example

#### Neyman-orthogonal score (doubly robust score)

\begin{align} \psi (W, \theta_0, \eta_0) = & g(1,X) - g(0,X) \\ & + \frac{D (Y - g(1,X))}{m(X)} - \frac{(1 - D)(Y - g(0,X))}{1 - m(x)} \\ & - \theta, \end{align}

with the nuisance parameter

\begin{align} \eta &= (g(1, X), g(0, X), m(X)), \\ \eta_0 &= (g_0(1, X), g_0(0, X), m_0(X)), \end{align}

with the propensity score $m_0 = \mathbf{P}(D = 1 | X)$ which we have to estimate in order to achieve orthogonality, see DoubleML User Guide for more details.

## Neyman Orthogonality: Use Case Example

• Non-orthogonal score in uplift example
def dml_score_nonorth(y, d, g_hat0, g_hat1, m_hat, smpls):

# g_hat1, g_hat0 are the outcome predictions under treatment and control, respectively
psi_b = g_hat1 - g_hat0

# m_hat is the internally estimated propensity score
psi_a = np.full_like(m_hat, -1.0)

return psi_a, psi_b
from lightgbm import LGBMClassifier
ml_g_boost = LGBMClassifier(n_estimators=1000, learning_rate=0.005)
ml_m_boost = LGBMClassifier(n_estimators=1000, learning_rate=0.005)

np.random.seed(3141)
dml_obj_nonorth = dml.DoubleMLIRM(obj_dml_data=dml_data,
ml_g=clone(ml_g_boost),
ml_m=clone(ml_m_boost),
n_folds=2,
n_rep=1,
trimming_threshold=0.01,
score=dml_score_nonorth)
dml_obj_nonorth.fit()

## Neyman Orthogonality: Use Case Example

• Comparison of non-orthogonal score to orthogonal score

## Orthogonality: How Does it Work?

• The plug-in approach allows for valid estimation of the ATE, $\theta_0$, whenever we can precisely estimate $g_0(1,X)$ and $g_0(0,X)$

• Example: $g_0(1,X)$ and $g_0(1,X)$ are simply linear functions of the covariates $X_1$ and $X_2$ and the causal effect is constant and additive, such that

\begin{align} g_0(1,X) &= \alpha + \theta_0 + \gamma_1 X_1 + \gamma_2 X_2,\\ g_0(0,X) &= \alpha + \gamma_1 X_1 + \gamma_2 X_2. \end{align}

• Then we could estimate $\theta_0$ by simply using a correctly specified linear regression model

• However, this would not necessarily work if we used Lasso. Can you see why?

## Orthogonality: How Does it Work?

• The $\ell 1$-penalization of Lasso would reduce some of the coefficients in $\gamma = (\gamma_1, \gamma_2)$ towards zero and maybe set some of them exactly to zero, say $\gamma_2$ for the confounder $X_2$

• This is equivalent to removing the confounder $X_2$ from the causal model
$\Rightarrow$ Ommited variable bias / confounding bias

Code
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()

# Draw the graph
plt.figure(figsize=(4, 3))
pos = {"D": (0, 0), "Y": (2, 0), "X1": (1,1), "X2": (1, -1)}
edge_colors = ['black', 'black', 'black', 'red', 'red']
nx.draw(G, pos, with_labels=True, node_size=800, node_color='lightblue',
edge_color=edge_colors)
plt.show()

## Orthogonality: How Does it Work?

• Using the orthogonal score makes estimation of the ATE robust against such estimation errors, if they are moderate

• In the IRM, this is done by including the propensity score in the orthogonal score $\psi(\cdot)$

• We can estimate $\theta_0$ consistently, if the nuisance parameters $g_0(1,X)$, $g_0(0,X)$ and $m_0(X)$ are estimated good enough

We will know see what good enough means…

## High-Quality ML Estimation

• The nuisance parameters are estimated with high-quality (fast-enough converging) machine learning methods.

• Different structural assumptions on $\eta_0$ lead to the use of different machine-learning tools for estimating $\eta_0$

• Rate requirement in the IRM for estimation of ATE: $\lVert \hat{m}_0 - m_0 \rVert_{P,2} \times \lVert \hat{\ell}_0 - \ell_0\rVert _{P,2} \le \delta_N N^{-1/2}$

• $\Rightarrow$ We can use ML methods, that have typically slower convergence rates than the parametric $N^{1/2}$ (as for a correctly specified linear regression model)

## High-Quality ML Estimation: Use Case Example

• Comparison of the performance of the ML-based causal model to estimation based on (unpenalized) logistic regression
print(dml_obj_linear.evaluate_learners())
print(dml_obj_linear.evaluate_learners(metric=balanced_accuracy))
{'ml_g0': array([[0.45735802]]), 'ml_g1': array([[0.44600344]]), 'ml_m': array([[0.48082561]])}
{'ml_g0': array([[0.49937559]]), 'ml_g1': array([[0.51025686]]), 'ml_m': array([[0.6107002]])}

print(dml_obj_orth.evaluate_learners())
print(dml_obj_orth.evaluate_learners(metric=balanced_accuracy))
{'ml_g0': array([[0.46770032]]), 'ml_g1': array([[0.44701511]]), 'ml_m': array([[0.4792336]])}
{'ml_g0': array([[0.51123643]]), 'ml_g1': array([[0.53925407]]), 'ml_m': array([[0.61656226]])}

## High-Quality ML Estimation: Use Case Example

• Comparison of the performance of the ML-based causal model to estimation based on (unpenalized) logistic regression

## Sample-Splitting

• To avoid the biases arising from overfitting, a form of sample splitting is used at the stage of producing the estimator of the main parameter $\theta_0$.

• Informal example:

• Split data 50/50 into train (T1) and test (T2) samples
• Fit ML algorithms on T1
• Generate predictions for the T2
• Plug predictions into score function and solve for $\theta$
• Apply DML results to obtain confidence intervals, standard errors and $p$-values
• Efficiency gains by using cross-fitting (swapping roles of samples for train / hold-out)

## Sample-Splitting

• To avoid the biases arising from overfitting, a form of sample splitting is used at the stage of producing the estimator of the main parameter $\theta_0$.

• Informal example:

• Split data 50/50 into train (T1) and test (T2) samples
• Fit ML algorithms on T2
• Generate predictions for the T1
• Plug predictions into score function and solve for $\theta$
• Apply DML results to obtain confidence intervals, standard errors and $p$-values

## High-Quality ML Estimation: Use Case Example

• Comparison of estimation without sample-splitting to simple train-test-split and cross-fitting (all models with orthogonal score)
np.random.seed(3141)
dml_obj_no_split = dml.DoubleMLIRM(obj_dml_data=dml_data,
ml_g=clone(ml_g_boost),
ml_m=clone(ml_m_boost),
n_folds=1,
n_rep=1,
trimming_threshold=0.01,
apply_cross_fitting=False)

dml_obj_no_split.fit()

## High-Quality ML Estimation: Use Case Example

• Comparison of estimation without sample-splitting and cross-fitting (all models with orthogonal score)

## Main Result: Double Machine Learning

#### Main result in Chernozhukov et al. (2018)

There exist regularity conditions, such that the DML estimator $\tilde{\theta}_0$ concentrates in a $1/\sqrt{N}$-neighborhood of $\theta_0$ and the sampling error is approximately $\sqrt{N}(\tilde{\theta}_0 - \theta_0) \sim N(0, \sigma^2),$ with \begin{align}\begin{aligned}\sigma^2 := J_0^{-2} \mathbb{E}(\psi^2(W; \theta_0, \eta_0)),\\J_0 = \mathbb{E}(\psi_a(W; \eta_0)).\end{aligned}\end{align}

## Examples

You can play around with the three key ingredients in this

🦏 Shiny App 🦏

# References

## References

Bach, Philipp, Victor Chernozhukov, Malte S Kurz, and Martin Spindler. 2022. “DoubleML-an Object-Oriented Implementation of Double Machine Learning in Python.” Journal of Machine Learning Research 23: 53–51.
Bach, Philipp, Victor Chernozhukov, Malte S Kurz, Martin Spindler, and Sven Klaassen. 2021. DoubleMLAn Object-Oriented Implementation of Double Machine Learning in R.” https://arxiv.org/abs/2103.09603.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097.
Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. forthcoming. Applied Causal Inference Powered by ML and AI. online.
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2022. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research, 00491241221099552.
Facure, Matheus, and Michell Germano. 2021. matheusfacure/python-causality-handbook: First Edition.” Zenodo. https://doi.org/10.5281/zenodo.4445778.
Glymour, Madelyn, Judea Pearl, and Nicholas P Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.
Huber, Martin. 2023. Causal Analysis: Impact Evaluation and Causal Machine Learning with Applications in r. MIT Press.