Note
-
Download Jupyter notebook:
https://docs.doubleml.org/stable/examples/py_double_ml_firststage.ipynb.
Python: First Stage and Causal Estimation#
This notebook illustrates the results from a simulation study. It shows insights on the relationship between the first stage ML predictive quality and the performance of the corresponding causal estimator. The data generating process (DGP) is based on Belloni et al. (2013). This DGP implements a high-dimensional sparse and linear model. We consider the case of \(n=100\) observations and \(p=200\) covariates. The covariates are correlated via a Toeplitz covariate structures. More details and the code are available from the GitHub repository DoubleML/DML-Hyperparameter-Tuning-Replication.
We employ a lasso learner for the first stage predictions, i.e., to predict the outcome variable \(Y\) as based on \(X\) in a partially linear model as well as predicting the (continuous) treatment variable \(D\) based on \(X\). As we employ a linear learner, this is equivalent to a linear model.
We are interested to what extent, the choice of the lasso penalty affects the first-stage predictions, e.g, as measured by the root mean squared error for the corresponding nuisance parameter, and the combined loss. Moreover, we would like to investigate how the predictive quality for the first stage translates into estimation quality of the causal parameter \(\theta_0\).
[1]:
import doubleml as dml
from doubleml import DoubleMLData
import numpy as np
import pandas as pd
from itertools import product
from importlib.machinery import SourceFileLoader
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
import plotly.express as px
import plotly.graph_objects as go
# Load result files
path_to_res = "https://raw.githubusercontent.com/DoubleML/DML-Hyperparameter-Tuning-Replication/main/motivation_example_BCH/simulation_run/results/raw_res_manual_lasso_R_100_n_100_p_200_rho_0.6_R2d_0.6_R2y_0.6_design_1a.csv"
df_results = pd.read_csv(path_to_res, index_col=0)
Causal estimation vs. lasso penalty \(\lambda\)#
Estimation quality vs. \(\lambda\)#
We plot the mean squared error for the causal estimator \(MSE(\theta) = \frac{1}{R} (\hat{\theta}_0 - \theta_0)^2\) over a grid of \(\lambda=(\lambda_{\ell_0}, \lambda_{m_0})\) values for the nuisance component \(\ell_0(X) = E[Y|X]\) (predict \(Y\) based on \(X\)) and \(m_0(X) = E[D|X]\) (predict \(D\) based on \(X\)). \(R\) is the number of repetitions.
[2]:
df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()
# Take logarithm of alphas
df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])
df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])
measure_col = 'sq_error'
# Choose columns for x-axis and y-axis (either alpha values or nuisance error)
(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.28),
eye=dict(x=-1.25, y=-1.25, z=1.25)
)
layout = go.Layout(autosize = False, width = 700, height = 700)
this_df = df_agg
val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values
fig = go.Figure(data=[go.Surface(contours = {"z": {"show": True, "start": 0.95, "end": 1, "size": 0.05}},
z=val_list, showscale = False)])
fig.update_layout(
scene = dict(xaxis_title=f"-{x_col}",
yaxis_title=f"-{y_col}",
zaxis_title=measure_col), scene_camera=camera, title=f'Mean squared error and choice of lasso penalty')
fig.show()
Empirical coverage vs. lasso penalty \(\lambda\)#
[3]:
df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()
# Take logarithm of alphas
df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])
df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])
measure_col = 'cover'
# Choose columns for x-axis and y-axis (either alpha values or nuisance error)
(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.28),
eye=dict(x=-1.25, y=-1.25, z=1.25)
)
layout = go.Layout(autosize = False, width = 700, height = 700)
this_df = df_agg
val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values
fig = go.Figure(data=[go.Surface(contours = {"z": {"show": True, "start": 0.95, "end": 1, "size": 0.05}},
z=val_list, showscale = False)])
fig.update_layout(
scene = dict(xaxis_title=f"-{x_col}",
yaxis_title=f"-{y_col}",
zaxis_title=measure_col), scene_camera=camera, title=f'Empirical Coverage and choice of lasso penalty')
fig.show()
Combined loss vs. lasso penalty \(\lambda\)#
In the partially linear regression model and the partialling-out score, the combined loss can be defined as
The combined loss tries to measure the rate of the bias (except for additive constants)
which by the results in Chernozhukov et al. (2018), has to be estimated at a rate faster than \(N^{-1/2}\).
[4]:
df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()
# Take logarithm of alphas
df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])
df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])
df_agg['combined_loss'] = (df_agg['nuis_rmse_ml_l'] + df_agg['nuis_rmse_ml_m']) * df_agg['nuis_rmse_ml_m']
measure_col = 'combined_loss'
# Choose columns for x-axis and y-axis (either alpha values or nuisance error)
(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-0.28),
eye=dict(x=-1.25, y=-1.25, z=1.25)
)
layout = go.Layout(autosize = False, width = 700, height = 700)
this_df = df_agg
val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values
fig = go.Figure(data=[go.Surface(contours = {"z": {"show": True, "start": 0.95, "end": 1, "size": 0.05}},
z=val_list, showscale = False)])
fig.update_layout(
scene = dict(xaxis_title=f"-{x_col}",
yaxis_title=f"-{y_col}",
zaxis_title=measure_col), scene_camera=camera, title=f'Combined loss and choice of lasso penalty')
fig.show()
The results show that an appropriate choice for \(\lambda_{\ell_0}\) and \(\lambda_{m_0}\) is crucial to obtain a good causal estimator for \(\theta_0\) in terms of the estimation error (i.e., \(MSE(\theta_0)\)) and the empirical coverage. A lower combined loss is found to be associated with a lower \(MSE(\theta_0)\) and a higher empirical coverage.
References#
Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on Treatment Effects after Selection among High-Dimensional Controls. The Review of Economic Studies, 81(2):608–650, 11 2013. ISSN 0034-6527. doi: 10.1093/restud/rdt044. URL https://doi.org/10.1093/restud/rdt044.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68, doi:10.1111/ectj.12097.