1. The basics of double/debiased machine learning#

In the following we provide a brief summary of and motivation to the double machine learning (DML) framework and show how the corresponding methods provided by the DoubleML package can be applied. For details we refer to Chernozhukov et al. (2018).

Note

Detailed notebooks containing the complete code for the examples can be found in the Example Gallery.

1.1. Data generating process#

We consider the following partially linear model

\[ \begin{align}\begin{aligned}y_i = \theta_0 d_i + g_0(x_i) + \zeta_i, & &\zeta_i \sim \mathcal{N}(0,1),\\d_i = m_0(x_i) + v_i, & &v_i \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]

with covariates \(x_i \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kj} = 0.7^{|j-k|}\). We are interested in performing valid inference on the causal parameter \(\theta_0\). The true parameter \(\theta_0\) is set to \(0.5\) in our simulation experiment.

The nuisance functions are given by

\[ \begin{align}\begin{aligned}m_0(x_i) &= x_{i,1} + \frac{1}{4} \frac{\exp(x_{i,3})}{1+\exp(x_{i,3})},\\g_0(x_i) &= \frac{\exp(x_{i,1})}{1+\exp(x_{i,1})} + \frac{1}{4} x_{i,3}.\end{aligned}\end{align} \]

Note

import numpy as np
from doubleml.datasets import make_plr_CCDDHNR2018

np.random.seed(1234)
n_rep = 1000
n_obs = 500
n_vars = 20
alpha = 0.5

data = list()

for i_rep in range(n_rep):
    (x, y, d) = make_plr_CCDDHNR2018(alpha=alpha, n_obs=n_obs, dim_x=n_vars, return_type='array')
    data.append((x, y, d))
library(DoubleML)
set.seed(1234)
n_rep = 1000
n_obs = 500
n_vars = 20
alpha = 0.5

data = list()
for (i_rep in seq_len(n_rep)) {
    data[[i_rep]] = make_plr_CCDDHNR2018(alpha=alpha, n_obs=n_obs, dim_x=n_vars,
                                        return_type="data.frame")
}

1.2. Regularization bias in simple ML-approaches#

Naive inference that is based on a direct application of machine learning methods to estimate the causal parameter, \(\theta_0\), is generally invalid. The use of machine learning methods introduces a bias that arises due to regularization. A simple ML approach is given by randomly splitting the sample into two parts. On the auxiliary sample indexed by \(i \in I^C\) the nuisance function \(g_0(X)\) is estimated with an ML method, for example a random forest learner. Given the estimate \(\hat{g}_0(X)\), the final estimate of \(\theta_0\) is obtained as (\(n=N/2\)) using the other half of observations indexed with \(i \in I\)

\[\hat{\theta}_0 = \left(\frac{1}{n} \sum_{i\in I} D_i^2\right)^{-1} \frac{1}{n} \sum_{i\in I} D_i (Y_i - \hat{g}_0(X_i)).\]

The following figure shows the distribution of the resulting estimates \(\hat{\theta}_0\) for the simple ML approach (the corresponding notebooks are available in the Example Gallery).

Distribution with non-orthogonal score
Distribution with non-orthogonal score

The regularization bias in the simple ML-approach is caused by the slow convergence of \(\hat{\theta}_0\)

\[|\sqrt{n} (\hat{\theta}_0 - \theta_0) | \rightarrow_{P} \infty\]

i.e., slower than \(1/\sqrt{n}\). The driving factor is the bias that arises by learning \(g\) with a random forest or any other ML technique. A heuristic illustration is given by

\[\sqrt{n}(\hat{\theta}_0 - \theta_0) = \underbrace{\left(\frac{1}{n} \sum_{i\in I} D_i^2\right)^{-1} \frac{1}{n} \sum_{i\in I} D_i \zeta_i}_{=:a} + \underbrace{\left(\frac{1}{n} \sum_{i\in I} D_i^2\right)^{-1} \frac{1}{n} \sum_{i\in I} D_i (g_0(X_i) - \hat{g}_0(X_i))}_{=:b}.\]

\(a\) is approximately Gaussian under mild conditions. However, \(b\) (the regularization bias) diverges in general.

1.3. Overcoming regularization bias by orthogonalization#

To overcome the regularization bias we can partial out the effect of \(X\) from \(D\) to obtain the orthogonalized regressor \(V = D - m(X)\). We then use the final estimate

\[\check{\theta}_0 = \left(\frac{1}{n} \sum_{i\in I} \hat{V}_i D_i\right)^{-1} \frac{1}{n} \sum_{i\in I} \hat{V}_i (Y_i - \hat{g}_0(X_i)).\]

The following figure shows the distribution of the resulting estimates \(\hat{\theta}_0\) without sample-splitting (the corresponding notebooks are available in the Example Gallery).

Distribution without sample-splitting
Distribution without sample-splitting

If the nuisance models \(\hat{g}_0()\) and \(\hat{m}()\) are estimated on the whole dataset, which is also used for obtaining the final estimate \(\check{\theta}_0\), another bias is observed.

1.4. Sample splitting to remove bias induced by overfitting#

Using sample splitting, i.e., estimate the nuisance models \(\hat{g}_0()\) and \(\hat{m}()\) on one part of the data (training data) and estimate \(\check{\theta}_0\) on the other part of the data (test data), overcomes the bias induced by overfitting. We can exploit the benefits of cross-fitting by switching the role of the training and test sample. Cross-fitting performs well empirically because the entire sample can be used for estimation.

The following figure shows the distribution of the resulting estimates \(\hat{\theta}_0\) with orthogonal score and sample-splitting (the corresponding notebooks are available in the Example Gallery).

Distribution with orthogonal scores and sample-splitting
Distribution with orthogonal scores and sample-splitting

1.5. Double/debiased machine learning#

To illustrate the benefits of the auxiliary prediction step in the DML framework we write the error as

\[\sqrt{n}(\check{\theta}_0 - \theta_0) = a^* + b^* + c^*\]

Chernozhukov et al. (2018) argues that:

The first term

\[a^* := (EV^2)^{-1} \frac{1}{\sqrt{n}} \sum_{i\in I} V_i \zeta_i\]

will be asymptotically normally distributed.

The second term

\[b^* := (EV^2)^{-1} \frac{1}{\sqrt{n}} \sum_{i\in I} (\hat{m}(X_i) - m(X_i)) (\hat{g}_0(X_i) - g_0(X_i))\]

vanishes asymptotically for many data generating processes.

The third term \(c^*\) vanishes in probability if sample splitting is applied. Finally, let us compare all distributions.

All distributions
All distributions

The DoubleML implementation implements a several orthogonal scores and directly applies cross-fitting. The complete code is available in the Example Gallery.

theta_dml = np.full(n_rep, np.nan)
se_dml = np.full(n_rep, np.nan)

for i_rep in range(n_rep):
    (x, y, d) = data[i_rep]
    obj_dml_data = DoubleMLData.from_arrays(x, y, d)
    obj_dml_plr = DoubleMLPLR(
        obj_dml_data,
        ml_l=LGBMRegressor(n_estimators=300, learning_rate=0.1),
        ml_m=LGBMRegressor(n_estimators=200, learning_rate=0.1),
        ml_g=LGBMRegressor(n_estimators=300, learning_rate=0.1),
        n_folds=2,
        score='IV-type')
    obj_dml_plr.fit()
    theta_dml[i_rep] = obj_dml_plr.coef[0]
    se_dml[i_rep] = obj_dml_plr.se[0]
theta_dml = rep(NA, n_rep)
se_dml = rep(NA, n_rep)

for (i_rep in seq_len(n_rep)) {
    df = data[[i_rep]]
    obj_dml_data = double_ml_data_from_data_frame(df, y_col = "y", d_cols = "d")
    obj_dml_plr = DoubleMLPLR$new(
        obj_dml_data,
        ml_l,
        ml_m,
        ml_g,
        n_folds=2,
        score='IV-type')
    obj_dml_plr$fit()
    theta_dml[i_rep] = obj_dml_plr$coef
    se_dml[i_rep] = obj_dml_plr$se
}

1.6. Partialling out score#

Another debiased estimator, based on the partialling-out approach of Robinson(1988), is

\[\check{\theta}_0 = \left(\frac{1}{n} \sum_{i\in I} \hat{V}_i \hat{V}_i \right)^{-1} \frac{1}{n} \sum_{i\in I} \hat{V}_i (Y_i - \hat{\ell}_0(X_i)),\]

with \(\ell_0(X_i) = E(Y|X)\). All nuisance parameters for the estimator with score='partialling out' are conditional mean functions, which can be directly estimated using ML methods. This is a minor advantage over the estimator with score='IV-type'. In the following, we repeat the above analysis with score='partialling out'. In a first part of the analysis, we estimate \(\theta_0\) without sample splitting. Again we observe a bias from overfitting.

The following figure shows the distribution of the resulting estimates \(\hat{\theta}_0\) without sample-splitting (the corresponding notebooks are available in the Example Gallery).

Distribution without sample splitting
Distribution without sample splitting

Using sample splitting, overcomes the bias induced by overfitting.

Distribution with orthogonal score and sample-splitting
Distribution with orthogonal score and sample-splitting

Again, the implementation automatically applies cross-fitting. The complete code is available in the Example Gallery.

theta_dml_po = np.full(n_rep, np.nan)
se_dml_po = np.full(n_rep, np.nan)

for i_rep in range(n_rep):
    (x, y, d) = data[i_rep]
    obj_dml_data = DoubleMLData.from_arrays(x, y, d)
    obj_dml_plr = DoubleMLPLR(
        obj_dml_data,
        ml_l=LGBMRegressor(n_estimators=300, learning_rate=0.1),
        ml_m=LGBMRegressor(n_estimators=200, learning_rate=0.1),
        n_folds=2,
        score='partialling out')
    obj_dml_plr.fit()
    theta_dml_po[i_rep] = obj_dml_plr.coef[0]
    se_dml_po[i_rep] = obj_dml_plr.se[0]
theta_dml_po = rep(NA, n_rep)
se_dml_po = rep(NA, n_rep)

for (i_rep in seq_len(n_rep)) {
    df = data[[i_rep]]
    obj_dml_data = double_ml_data_from_data_frame(df, y_col = "y", d_cols = "d")
    obj_dml_plr = DoubleMLPLR$new(
        obj_dml_data,
        ml_l,
        ml_m,
        n_folds=2,
        score='partialling out')
    obj_dml_plr$fit()
    theta_dml_po[i_rep] = obj_dml_plr$coef
    se_dml_po[i_rep] = obj_dml_plr$se
}

Finally, let us compare all distributions.

All distributions with partialling-out score
All distributions with partialling-out score

1.7. References#

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. doi:10.1111/ectj.12097.

Robinson, P. M. (1988). Root-N-consistent semi-parametric regression. Econometrica 56, 931-54. doi:10.2307/1912705.