{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Python: First Stage and Causal Estimation\n", "\n", "This notebook illustrates the results from a simulation study. It shows insights on the relationship between the first stage ML predictive quality and the performance of the corresponding causal estimator. The data generating process (DGP) is based on [Belloni et al. (2013)](https://doi.org/10.1093/restud/rdt044). This DGP implements a high-dimensional sparse and linear model. We consider the case of $n=100$ observations and $p=200$ covariates. The covariates are correlated via a Toeplitz covariate structures. More details and the code are available from the GitHub repository .\n", "\n", "We employ a lasso learner for the first stage predictions, i.e., to predict the outcome variable $Y$ as based on $X$ in a [partially linear model](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) as well as predicting the (continuous) treatment variable $D$ based on $X$. As we employ a linear learner, this is equivalent to a linear model.\n", "\n", "We are interested to what extent, the choice of the lasso penalty affects the first-stage predictions, e.g, as measured by the root mean squared error for the corresponding nuisance parameter, and the combined loss. Moreover, we would like to investigate how the predictive quality for the first stage translates into estimation quality of the causal parameter $\\theta_0$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import doubleml as dml\n", "from doubleml import DoubleMLData\n", "import numpy as np\n", "import pandas as pd\n", "from itertools import product\n", "\n", "from importlib.machinery import SourceFileLoader\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import LassoCV\n", "\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "\n", "# Load result files\n", "path_to_res = \"https://raw.githubusercontent.com/DoubleML/DML-Hyperparameter-Tuning-Replication/main/motivation_example_BCH/simulation_run/results/raw_res_manual_lasso_R_100_n_100_p_200_rho_0.6_R2d_0.6_R2y_0.6_design_1a.csv\"\n", "\n", "df_results = pd.read_csv(path_to_res, index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Causal estimation vs. lasso penalty $\\lambda$ \n", "\n", "\n", "### Estimation quality vs. $\\lambda$\n", "\n", "We plot the mean squared error for the causal estimator $MSE(\\theta) = \\frac{1}{R} (\\hat{\\theta}_0 - \\theta_0)^2$ over a grid of $\\lambda=(\\lambda_{\\ell_0}, \\lambda_{m_0})$ values for the nuisance component $\\ell_0(X) = E[Y|X]$ (predict $Y$ based on $X$) and $m_0(X) = E[D|X]$ (predict $D$ based on $X$). $R$ is the number of repetitions.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()\n", "# Take logarithm of alphas\n", "df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])\n", "df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])\n", "\n", "measure_col = 'sq_error'\n", "\n", "# Choose columns for x-axis and y-axis (either alpha values or nuisance error)\n", "(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'\n", "\n", "camera = dict(\n", " up=dict(x=0, y=0, z=1),\n", " center=dict(x=0, y=0, z=-0.28),\n", " eye=dict(x=-1.25, y=-1.25, z=1.25)\n", ")\n", "\n", "layout = go.Layout(autosize = False, width = 700, height = 700)\n", "\n", "\n", "this_df = df_agg\n", "val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values\n", "\n", "fig = go.Figure(data=[go.Surface(contours = {\"z\": {\"show\": True, \"start\": 0.95, \"end\": 1, \"size\": 0.05}},\n", " z=val_list, showscale = False)])\n", "fig.update_layout(\n", " scene = dict(xaxis_title=f\"-{x_col}\",\n", " yaxis_title=f\"-{y_col}\",\n", " zaxis_title=measure_col), scene_camera=camera, title=f'Mean squared error and choice of lasso penalty')\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Empirical coverage vs. lasso penalty $\\lambda$ " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()\n", "# Take logarithm of alphas\n", "df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])\n", "df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])\n", "\n", "measure_col = 'cover'\n", "\n", "# Choose columns for x-axis and y-axis (either alpha values or nuisance error)\n", "(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'\n", "\n", "camera = dict(\n", " up=dict(x=0, y=0, z=1),\n", " center=dict(x=0, y=0, z=-0.28),\n", " eye=dict(x=-1.25, y=-1.25, z=1.25)\n", ")\n", "\n", "layout = go.Layout(autosize = False, width = 700, height = 700)\n", "\n", "\n", "this_df = df_agg\n", "val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values\n", "\n", "fig = go.Figure(data=[go.Surface(contours = {\"z\": {\"show\": True, \"start\": 0.95, \"end\": 1, \"size\": 0.05}},\n", " z=val_list, showscale = False)])\n", "fig.update_layout(\n", " scene = dict(xaxis_title=f\"-{x_col}\",\n", " yaxis_title=f\"-{y_col}\",\n", " zaxis_title=measure_col), scene_camera=camera, title=f'Empirical Coverage and choice of lasso penalty')\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combined loss vs. lasso penalty $\\lambda$\n", "\n", "In the partially linear regression model and the *partialling-out* score, the combined loss can be defined as\n", "\n", "$$\\text{Comb. loss} := \\lVert \\hat{m}_0 - D \\rVert_{P,2} \\times \\big( \\lVert \\hat{m}_0 - D \\rVert_{P,2} + \\lVert \\hat{\\ell}_0 - Y\\rVert _{P,2}\\big),$$\n", "\n", "The combined loss tries to measure the rate of the bias (except for additive constants)\n", "\n", "$$\\lVert \\hat{m}_0 - m_0 \\rVert_{P,2} \\times \\big( \\lVert \\hat{m}_0 - m_0 \\rVert_{P,2} + \\lVert \\hat{\\ell}_0 - \\ell_0\\rVert _{P,2}\\big)$$\n", "\n", "which by the results in Chernozhukov et al. (2018), has to be estimated at a rate faster than $N^{-1/2}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_agg = df_results.groupby(['alpha_ml_l', 'alpha_ml_m']).mean(numeric_only = True).reset_index()\n", "# Take logarithm of alphas\n", "df_agg['ln_alpha_ml_l'] = np.log(df_agg['alpha_ml_l'])\n", "df_agg['ln_alpha_ml_m'] = np.log(df_agg['alpha_ml_m'])\n", "\n", "df_agg['combined_loss'] = (df_agg['nuis_rmse_ml_l'] + df_agg['nuis_rmse_ml_m']) * df_agg['nuis_rmse_ml_m']\n", "measure_col = 'combined_loss'\n", "\n", "# Choose columns for x-axis and y-axis (either alpha values or nuisance error)\n", "(x_col, y_col) = 'ln_alpha_ml_m', 'ln_alpha_ml_l'\n", "\n", "camera = dict(\n", " up=dict(x=0, y=0, z=1),\n", " center=dict(x=0, y=0, z=-0.28),\n", " eye=dict(x=-1.25, y=-1.25, z=1.25)\n", ")\n", "\n", "layout = go.Layout(autosize = False, width = 700, height = 700)\n", "\n", "\n", "this_df = df_agg\n", "val_list = this_df.loc[:, [x_col, y_col, measure_col]].pivot(index=x_col, columns=y_col, values=measure_col).values\n", "\n", "fig = go.Figure(data=[go.Surface(contours = {\"z\": {\"show\": True, \"start\": 0.95, \"end\": 1, \"size\": 0.05}},\n", " z=val_list, showscale = False)])\n", "fig.update_layout(\n", " scene = dict(xaxis_title=f\"-{x_col}\",\n", " yaxis_title=f\"-{y_col}\",\n", " zaxis_title=measure_col), scene_camera=camera, title=f'Combined loss and choice of lasso penalty')\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results show that an appropriate choice for $\\lambda_{\\ell_0}$ and $\\lambda_{m_0}$ is crucial to obtain a *good* causal estimator for $\\theta_0$ in terms of the estimation error (i.e., $MSE(\\theta_0)$) and the empirical coverage. A lower combined loss is found to be associated with a lower $MSE(\\theta_0)$ and a higher empirical coverage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## References\n", "\n", "Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on Treatment Effects after Selection among High-Dimensional Controls. *The Review of Economic Studies*, 81(2):608–650, 11 2013. ISSN 0034-6527. doi: 10.1093/restud/rdt044. URL https://doi.org/10.1093/restud/rdt044.\n", "\n", "\n", "Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. *The Econometrics Journal*, 21: C1-C68, doi:10.1111/ectj.12097." ] } ], "metadata": { "kernelspec": { "display_name": "doubleml", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }