{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python: First Stage and Causal Estimation\n",
"\n",
"This notebook illustrates the results from a simulation study. It shows insights on the relationship between the first stage ML predictive quality and the performance of the corresponding causal estimator. The data generating process (DGP) is based on [Belloni et al. (2013)](https://doi.org/10.1093/restud/rdt044). This DGP implements a high-dimensional sparse and linear model. We consider the case of $n=100$ observations and $p=200$ covariates. The covariates are correlated via a Toeplitz covariate structures. More details and the code are available from the GitHub repository .\n",
"\n",
"We employ a lasso learner for the first stage predictions, i.e., to predict the outcome variable $Y$ as based on $X$ in a [partially linear model](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) as well as predicting the (continuous) treatment variable $D$ based on $X$. As we employ a linear learner, this is equivalent to a linear model.\n",
"\n",
"We are interested to what extent, the choice of the lasso penalty affects the first-stage predictions, e.g, as measured by the root mean squared error for the corresponding nuisance parameter, and the combined loss. Moreover, we would like to investigate how the predictive quality for the first stage translates into estimation quality of the causal parameter $\\theta_0$.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2024-06-18T07:04:48.182384Z",
"iopub.status.busy": "2024-06-18T07:04:48.182190Z",
"iopub.status.idle": "2024-06-18T07:04:49.418443Z",
"shell.execute_reply": "2024-06-18T07:04:49.417840Z"
}
},
"outputs": [],
"source": [
"import doubleml as dml\n",
"from doubleml import DoubleMLData\n",
"import numpy as np\n",
"import pandas as pd\n",
"from itertools import product\n",
"\n",
"from importlib.machinery import SourceFileLoader\n",
"from sklearn.linear_model import Lasso\n",
"from sklearn.linear_model import LassoCV\n",
"\n",
"import plotly.express as px\n",
"import plotly.graph_objects as go\n",
"\n",
"# Load result files\n",
"path_to_res = \"https://raw.githubusercontent.com/DoubleML/DML-Hyperparameter-Tuning-Replication/main/motivation_example_BCH/simulation_run/results/raw_res_manual_lasso_R_100_n_100_p_200_rho_0.6_R2d_0.6_R2y_0.6_design_1a.csv\"\n",
"\n",
"df_results = pd.read_csv(path_to_res, index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Causal estimation vs. lasso penalty $\\lambda$ \n",
"\n",
"\n",
"### Estimation quality vs. $\\lambda$\n",
"\n",
"We plot the mean squared error for the causal estimator $MSE(\\theta) = \\frac{1}{R} (\\hat{\\theta}_0 - \\theta_0)^2$ over a grid of $\\lambda=(\\lambda_{\\ell_0}, \\lambda_{m_0})$ values for the nuisance component $\\ell_0(X) = E[Y|X]$ (predict $Y$ based on $X$) and $m_0(X) = E[D|X]$ (predict $D$ based on $X$). $R$ is the number of repetitions.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-06-18T07:04:49.421438Z",
"iopub.status.busy": "2024-06-18T07:04:49.421228Z",
"iopub.status.idle": "2024-06-18T07:04:49.554673Z",
"shell.execute_reply": "2024-06-18T07:04:49.554065Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"