In this notebook, we demontrate exemplarily how the DoubleML package can be used to estimate the causal effect of seeing a new ad design on customers' purchases in a webshop. We base the estimation steps of our analysis according to the DoubleML workflow.

Let's consider the following stylized scenario. The manager of a webshop performs an A/B test to estimate the effect a new ad design $A$ has on customers' purchases (in $100\$$), $Y$, on average. This effect is called the **A**verage **T**reatment **E**ffect (**ATE**). The treatment is assigned randomly conditional on the visitors' characteristics, which we call $V$. Such characteristics could be collected from a customer's shoppers account, for example. These might include the number of previous purchases, time since the last purchase, length of stay on a page as well as whether a customer has a rewards card, among other characteristics.

In the following, we use a **D**irected **A**cyclical **G**raph (DAG) to illustrate our assumptions on the causal structure of the scenario. As not only the outcome, but also the treatment is dependent on the individual characteristics, there are arrows going from $V$ to both $A$ and $Y$. In our example, we also assume that the treatment $A$ is a direct cause of the customers' purchases $Y$.

Let's assume the conditional randomization has been conducted properly, such that a tidy data set has been collected. Now, a data scientist wants to evaluate whether the new ad design causally affected the sales, by using the DoubleML package.

Before we start the case study, let us briefly address the question why we need to include individual characteristics in our analysis at all. There are mainly two reasons why we want to control for observable characteristics. First, so-called confounders, i.e., variables that have a causal effect on both the treatment variable and the outcome variable, possibly create a bias in our estimate. In order to uncover the true causal effect of the treatment, it is necessary that our causal framework takes all confounding variables into account. Otherwise, the average causal effect of the treatment on the outcome is not identified. A second reason to include individual characteristics is efficiency. The more variation can be explained within our causal framework, the more precise will be the resulting estimate. In practical terms, greater efficiency leads to tighter confidence intervals and smaller standard errors and p-values. This might help to improve the power of A/B tests even if the treatment variable is unconditionally assigned to individuals.

ML methods have turned out to be very flexible in terms of modeling complex relationships of explanatory variables and dependent variables and, thus, have exhibited a great predictive performance in many applications. In the double machine learning approach (Chernozhukov et al. (2018)), ML methods are used for modelling so-called nuisance functions. In terms of the A/B case study considered here, ML tools can be used to flexibly control for confounding variables. For example, a linear parametric specification as in a standard linear regression model might not be correct and, hence, not sufficient to account for the underlying confounding. Moreover, by using powerful ML techniques, the causal model will likely be able to explain a greater share of the total variation and, hence, lead to more precise estimation.

As an illustrative example we use a data set from the ACIC 2019 Data Challenge. In this challenge, a great number of data sets have been generated in a way that they mimic distributional relationships that are found in many economic real data applications. Although the data have not been generated explicitly to address an A/B testing case study, they are well-suited for demonstration purposes. We will focus on one of the many different data genereting processes (DGP) that we picked at random, in this particualar case a data set called `high42`

. An advantage of using the synthetic ACIC 2019 data is that we know the true average treatment effect which is 0.8 in our data set.

In [1]:

```
# Load required modules
import numpy as np
import pandas as pd
import doubleml as dml
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import log_loss
from xgboost import XGBClassifier, XGBRegressor
import matplotlib.pyplot as plt
import scipy.stats as stats
```

First we load the data.

In [2]:

```
import pandas as pd
# Load data set from url (internet connection required)
url = 'https://raw.githubusercontent.com/DoubleML/doubleml-docs/master/doc/examples/data/high42.CSV'
df = pd.read_csv(url)
```

In [3]:

```
print(df.shape)
```

(1000, 202)

In [4]:

```
df.head()
```

Out[4]:

Y | A | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V191 | V192 | V193 | V194 | V195 | V196 | V197 | V198 | V199 | V200 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 7.358185 | 1 | 10 | 0 | 0 | 7 | 192.793769 | 23.676950 | 8 | 0.185443 | ... | 1.462837 | 1 | 1627.274196 | 0 | 0 | 4.683956 | 0.565667 | 0 | 3 | 0.024338 |

1 | 8.333672 | 1 | 12 | 0 | 1 | 4 | 199.653596 | 19.281270 | 7 | 0.514842 | ... | 1.330522 | 1 | 1661.484439 | 1 | 0 | 6.766661 | -0.395402 | 0 | 4 | 0.056518 |

2 | 7.472758 | 0 | 14 | 1 | 1 | 2 | 194.207792 | 24.589331 | 5 | 0.309199 | ... | 1.384151 | 1 | 1658.939293 | 0 | 0 | 5.647794 | 1.112766 | 0 | 0 | 0.013442 |

3 | 6.502319 | 1 | 0 | 1 | 0 | 9 | 201.838024 | 25.513918 | 4 | 0.160160 | ... | 1.220303 | 1 | 1650.801625 | 0 | 0 | 5.370363 | -0.305842 | 0 | 4 | 0.034632 |

4 | 7.043758 | 1 | 12 | 0 | 0 | 9 | 201.360443 | 31.160641 | 6 | 0.291976 | ... | 1.170094 | 1 | 1676.818876 | 0 | 0 | 3.446532 | 2.440661 | 0 | 1 | 0.017514 |

5 rows × 202 columns

We see that the data set consists of 1000 observations (= website visitors) and 202 variables:

`Y`

: A customer's purchases (in $100\$$)`A`

: Binary treatment variable with a value 1 indicating that a customer has been exposed to the new ad design (and value 0 otherwise).`V1`

,...,`V200`

: The remaining 200 columns $V$ represent individual characteristics of the customers (=confounders).

`y_col`

, the treatment variable $A$ via `d_cols`

and the confounding variables $V$ via `x_cols`

.

In [5]:

```
# Specify explanatory variables for data-backend
features_base = list(df.columns.values)[2:]
# TODO: Initialize DoubleMLData (data-backend of DoubleML)
data_dml = dml.DoubleMLData(df,
y_col='Y',
d_cols='A',
x_cols=features_base)
```

In [6]:

```
# TODO: print data backend
print(data_dml)
```

The inference problem is to determine the causal effect of seeing the new ad design $A$ on customers' purchases $Y$ once we control for individual characteristics $V$. In our example, we are interested in the average treatment effect. Basically, there are two causal models available in DoubleML that can be used to estimate the ATE.

The so-called **interactive regression model** (IRM) called by DoubleMLIRM is a flexible (nonparametric) model to estimate this causal quantity. The model does not impose functional form restrictions on the underlying regression relationships, for example, linearity or additivity as in a standard linear regression model. This means that the model hosts heterogeneous treatment effects, i.e., account for variation in the effect of the new ad design across customers. Moreover, it is possible to also estimate other causal parameters with the IRM, for example, the average treatment effect on the treated (= those customers who have been exposed to the new ad), which might be of interest too.

We briefly introduce the interactive regression model where the main regression relationship of interest is provided by

$$Y = g_0(A, V) + U_1, \quad E(U_1 | V, A) = 0,$$where the treatment variable is binary, $A \in \lbrace 0,1 \rbrace$. We consider estimation of the average treatment effect (ATE):

$$\theta_0 = \mathbb{E}[g_0(1, V) - g_0(0,V)],$$when treatment effects are heterogeneous. In order to be able to use ML methods, the estimation framework generally requires a property called "double robustness" or "Neyman orthogonality". In the IRM, double robustness can be achieved by including the first-stage estimation

$$A = m_0(V) + U_2, \quad E(U_2| V) = 0,$$which amounts to estimation of the propensity score, i.e., the probability that a customer is exposed to the treatment provided her observed characteristics. Both predictions are then combined in the doubly robust score for the average treatment effect which is given by

$$\psi(W; \theta, \eta) := g(1,V) - g(0,V) + \frac{A (Y - g(1,V))}{m(V)} - \frac{(1 - A)(Y - g(0,V))}{1 - m(V)} - \theta.$$In [7]:

```
# TODO: Calculate unconditional average treatment effect
df[['A', 'Y']].groupby('A').mean()
```

Out[7]:

Y | |
---|---|

A | |

0 | 6.836141 |

1 | 7.953744 |

In [8]:

```
df[['A', 'Y']].groupby('A').mean().diff()
```

Out[8]:

Y | |
---|---|

A | |

0 | NaN |

1 | 1.117603 |

In [9]:

```
# TODO: Initialize Linear and Logistic Regression learners
linreg = make_pipeline(StandardScaler(), LinearRegression())
logreg_class = make_pipeline(StandardScaler(), LogisticRegression(penalty="none"))
```

In [10]:

```
# TODO: Initialize one ML learner of your choice
# Initialize Lasso learners
lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=20000))
lasso_class = make_pipeline(StandardScaler(),
LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear',
Cs = 4, max_iter=1000))
```

In [11]:

```
# TODO: Initialize a second ML learner of your choice
# (proceed as long as you like)
# Initialize Random Forest learners
randomForest = RandomForestRegressor()
randomForest_class = RandomForestClassifier()
```

At this stage, we instantiate a causal model object of the class DoubleMLIRM. Provide the learners via parameters `ml_g`

and `ml_m`

. You can either stick with the default setting or change the parameters. The API documentation is available here.

**Hint**: Use numpy.random.seed to set a random seed prior to your initialization. This makes the sample splits of the different models comparable. Also try to use the same DML specifications in all models to attain some comparability.

In [12]:

```
# TODO: Initialize benchmark DoubleMLIRM model
np.random.seed(1234)
dml_irm_regression = dml.DoubleMLIRM(data_dml,
ml_g = linreg,
ml_m = logreg_class,
trimming_threshold = 0.025,
n_folds = 3,
n_rep = 3)
```

In [13]:

```
# TODO: Initialize a DoubleMLIRM model using the ML learners of your choice
np.random.seed(1234)
dml_irm_lasso = dml.DoubleMLIRM(data_dml,
ml_g = lasso,
ml_m = lasso_class,
trimming_threshold = 0.025,
n_folds = 3,
n_rep = 3)
```

Proceed with the models using the other ML learners.

In [14]:

```
np.random.seed(1234)
dml_irm_forest = dml.DoubleMLIRM(data_dml,
ml_g = randomForest,
ml_m = randomForest_class,
trimming_threshold = 0.025,
n_folds = 3,
n_rep = 3)
```

In [15]:

```
# TODO: Fit benchmark DoubleMLIRM model using the fit() method
# HINT: set parameter 'store_predictions = True' for later model diagnostics
dml_irm_regression.fit(store_predictions=True)
```

Out[15]:

<doubleml.double_ml_irm.DoubleMLIRM at 0x141afad2130>

In [16]:

```
def pred_acc_irm(DoubleML, prop):
"""
A function to calculate prediction accuracy values for every repetition
of a Double Machine Learning model using IRM, DoubleMLIRM
...
Parameters
----------
DoubleML : doubleml.double_ml_irm.DoubleMLIRM
The IRM Double Machine Learning model
prop : bool
Indication if RMSE values have to be computed for main regression or
log loss values for propensity score
"""
# export data and predictions of the DoubleML model
y = DoubleML._dml_data.y
d = DoubleML._dml_data.d
g0 = DoubleML.predictions.get('ml_g0')
g1 = DoubleML.predictions.get('ml_g1')
m = DoubleML.predictions.get('ml_m')
# dimensions of prediction array
h = g0.shape[0]
w = DoubleML.n_rep
# check whether treatment is binary
if np.isin(d, [0,1]).all() == False:
raise ValueError("Treatment must be a binary variable.")
# prepare array to store prediction accuracy measure values
pred_acc_array = np.zeros((w,))
# check whether to assess main regression or propensity score accuracy:
if prop == False:
# evaluate main regression accuracy
# export an array with correctly picked prediction values
export_pred_array = np.zeros((h, w))
for i in range(w):
for j in range(h):
if d[j] == 0:
export_pred_array[j,i] = g0[j,i]
else:
export_pred_array[j,i] = g1[j,i]
# fill array that contains rmse of each repetition
for i in range(w):
pred_acc_array[i] = mean_squared_error(y, export_pred_array[:,i], squared=False)
else:
# evaluate propensity score accuracy
# fill array that contains log loss of each repetition
for i in range(w):
pred_acc_array[i] = log_loss(d, m[:,i], eps=0.025)
return pred_acc_array
```

In [17]:

```
# TODO: Evaluate the predictive performance for `ml_g` and `ml_m` using the
# helper function `pred_acc_irm()`.
# calculate mean and standard deviation of repetition RMSE's to evaluate main regression accuracy
rmse_main_linlog_irm = pred_acc_irm(dml_irm_regression, prop=False)
rmse_main_linlog_irm_mean = np.mean(rmse_main_linlog_irm)
rmse_main_linlog_irm_std = np.std(rmse_main_linlog_irm)
# calculate mean and standard deviation of repetition log losses to evaluate propensity score accuracy
logloss_prop_linlog_irm = pred_acc_irm(dml_irm_regression, prop=True)
logloss_prop_linlog_irm_mean = np.mean(logloss_prop_linlog_irm)
logloss_prop_linlog_irm_std = np.std(logloss_prop_linlog_irm)
```

In [18]:

```
print("Mean of the main regression RMSE across 3 repetitions is", rmse_main_linlog_irm_mean)
print("Standard deviation of the RMSE is", rmse_main_linlog_irm_std)
```

In [19]:

```
print("Mean of the propensity score's log loss across 3 repetitions is", logloss_prop_linlog_irm_mean)
print("Standard deviation of log loss is", logloss_prop_linlog_irm_std)
```

In [20]:

```
def rep_propscore_plot(DoubleML):
"""
A function to create histograms as sublots for every repetition's propensity score density
of a Double Machine Learning model
...
Parameters
----------
DoubleML : doubleml
The Double Machine Learning model
"""
#export nuisance part from the DoubleML model
m = DoubleML.predictions.get('ml_m')
# dimensions of nuisance array
h = m.shape[0]
rep = DoubleML.n_rep
i = 0
# create histograms as subplots covering the propensity score densities of all repetitions
if rep > 1:
fig, ax = plt.subplots(1, rep, figsize=[20,4.8])
for i in range(rep):
ax[i].hist(np.reshape(m[:,i], h), range=[0,1], bins=25, density=False)
ax[i].set_title('repetition ' + str(i+1))
ax[i].set_xlabel("prop_score")
ax[i].set_ylabel("count")
else:
fig, ax = plt.subplots(figsize=[20,4.8])
ax.hist(np.reshape(m[:,i], h), range=[0,1], bins=25, density=False)
ax.hist(np.reshape(m[:,i], h), range=[0,1], bins=25, density=False)
ax.set_title('repetition ' + str(i+1))
ax.set_xlabel("prop_score")
ax.set_ylabel("count")
plt.show()
```

In [21]:

```
# (TODO): Summarize the propensity score estimates
rep_propscore_plot(dml_irm_regression)
```

In [22]:

```
# TODO: Fit the ML DoubleMLIRM model using the fit() method
dml_irm_lasso.fit(store_predictions=True)
```

Out[22]:

<doubleml.double_ml_irm.DoubleMLIRM at 0x141afad20a0>

In [23]:

```
# TODO: Evaluate the predictive performance for `ml_g` and `ml_m` using the
# helper function `pred_acc_irm()`.
# calculate mean and standard deviation of repetition RMSE's to evaluate main regression accuracy
rmse_main_lasso_irm = pred_acc_irm(dml_irm_lasso, prop=False)
rmse_main_lasso_irm_mean = np.mean(rmse_main_lasso_irm)
rmse_main_lasso_irm_std = np.std(rmse_main_lasso_irm)
# calculate mean and standard deviation of repetition log losses to evaluate propensity score accuracy
logloss_prop_lasso_irm = pred_acc_irm(dml_irm_lasso, prop=True)
logloss_prop_lasso_irm_mean = np.mean(logloss_prop_lasso_irm)
logloss_prop_lasso_irm_std = np.std(logloss_prop_lasso_irm)
```

In [24]:

```
print("Mean of the main regression RMSE across 3 repetitions is", rmse_main_lasso_irm_mean)
print("Standard deviation of RMSE is", rmse_main_lasso_irm_std)
```

In [25]:

```
print("Mean of the propensity score's log loss across 3 repetitions is", logloss_prop_lasso_irm_mean)
print("Standard deviation of log loss is", logloss_prop_lasso_irm_std)
```

In [26]:

```
# (TODO): Summarize the propensity score estimates
rep_propscore_plot(dml_irm_lasso)
```

Proceed with the models using the other ML learners.

In [27]:

```
# Initialize DoubleMLIRM model
np.random.seed(1234)
dml_irm_forest = dml.DoubleMLIRM(data_dml,
ml_g = randomForest,
ml_m = randomForest_class,
trimming_threshold = 0.025,
n_folds = 3,
n_rep = 3)
# Set nuisance-part specific parameters
dml_irm_forest.set_ml_nuisance_params('ml_g0', 'A', {
'max_features': 200, 'n_estimators': 250})
dml_irm_forest.set_ml_nuisance_params('ml_g1', 'A', {
'max_features': 200, 'n_estimators': 250})
dml_irm_forest.set_ml_nuisance_params('ml_m', 'A', {
'max_features': 200, 'n_estimators': 250})
dml_irm_forest.fit(store_predictions=True)
```

Out[27]:

<doubleml.double_ml_irm.DoubleMLIRM at 0x141b2da4f70>

In [28]:

```
# calculate mean and standard deviation of repetition RMSE's to evaluate main regression accuracy
rmse_main_forest_irm = pred_acc_irm(dml_irm_forest, prop=False)
rmse_main_forest_irm_mean = np.mean(rmse_main_forest_irm)
rmse_main_forest_irm_std = np.std(rmse_main_forest_irm)
# calculate mean and standard deviation of repetition log losses to evaluate propensity score accuracy
logloss_prop_forest_irm = pred_acc_irm(dml_irm_forest, prop=True)
logloss_prop_forest_irm_mean = np.mean(logloss_prop_forest_irm)
logloss_prop_forest_irm_std = np.std(logloss_prop_forest_irm)
```

In [29]:

```
print("Mean of the main regression RMSE across 3 repetitions is", rmse_main_forest_irm_mean)
print("Standard deviation of RMSE is", rmse_main_forest_irm_std)
```

In [30]:

```
print("Mean of the propensity score's log loss across 3 repetitions is", logloss_prop_forest_irm_mean)
print("Standard deviation of log loss is", logloss_prop_forest_irm_std)
```

In [31]:

```
rep_propscore_plot(dml_irm_forest)
```

Provide a brief summary of your estimation results, for example by creating a table or figure.

Summarize your results on the **coefficient estimate** for $\theta_0$ as well as the **standard errors** and / or **confidence intervals**, respectively. You can create a table or a figure illustrating your findings.

Try to answer the following questions:

- Can you reject the $H_0$ that the new add ($A$) has no effect on sales ($Y$) at common significance levels?
- How close is your estimate to the true value of $\theta_0=0.8$?

In [32]:

```
## TODO: After calling fit(), access the coefficient parameter,
## the standard error and confidence interval accessing the fiels
## `coef` and `summary`.
```

In [33]:

```
# TODO: Summarize your results
reg_summary = dml_irm_regression.summary
reg_summary
```

Out[33]:

coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

A | 1.057833 | 0.624814 | 1.693035 | 0.090449 | -0.166781 | 2.282446 |

In [34]:

```
## TODO: After calling fit(), access the coefficient parameter,
## the standard error and confidence interval accessing the fiels
## `coef` and `summary`.
```

In [35]:

```
# TODO: Summarize your results
lasso_summary = dml_irm_lasso.summary
lasso_summary
```

Out[35]:

coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

A | 0.855559 | 0.070928 | 12.062417 | 1.668137e-33 | 0.716544 | 0.994575 |

Proceed with the models using the other ML learners.

In [36]:

```
forest_summary = dml_irm_forest.summary
forest_summary
```

Out[36]:

coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

A | 0.893176 | 0.077421 | 11.53655 | 8.631791e-31 | 0.741433 | 1.044919 |

In [37]:

```
irm_summary = pd.concat((reg_summary, lasso_summary, forest_summary))
irm_summary.index = ['regression','lasso', 'forest']
irm_summary = irm_summary[['coef', 'std err', '2.5 %', '97.5 %']]
irm_summary.round(3)
```

Out[37]:

coef | std err | 2.5 % | 97.5 % | |
---|---|---|---|---|

regression | 1.058 | 0.625 | -0.167 | 2.282 |

lasso | 0.856 | 0.071 | 0.717 | 0.995 |

forest | 0.893 | 0.077 | 0.741 | 1.045 |

In [38]:

```
errors = np.full((2, irm_summary.shape[0]), np.nan)
errors[0, :] = irm_summary['coef'] - irm_summary['2.5 %']
errors[1, :] = irm_summary['97.5 %'] - irm_summary['coef']
plt.errorbar(irm_summary.index, irm_summary.coef, fmt='o', yerr=errors)
plt.axhline(y=0.8, color='r', linestyle='--', label="true value")
plt.title('Interactive Regression Model (IRM)')
plt.xlabel('ML method')
_ = plt.ylabel('Coefficients and 95%-CI')
```

**Notes and Acknowledgement**

We would like to thank the organizers of the ACIC 2019 Data Challenge for setting up this data challenge and making the numerous synthetic data examples publicly available. Although the data examples in the ACIC 2019 Data Challenge do not explicitly adress A/B testing, we put the data example here in this context to give a tractable example on the use of causal machine learning in practice. The parameters for the random forests and extreme gradient boosting learners have been tuned externally. The corresponding tuning notebook will be uploaded in the examples gallery in the future.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. doi:10.1111/ectj.12097.