{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Python: Conditional Average Treatment Effects (CATEs) for IRM models\n", "\n", "In this simple example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate conditional average treatment effects with B-splines for one or two-dimensional effects in the [DoubleMLIRM](https://docs.doubleml.org/stable/guide/models.html#interactive-regression-model-irm) model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import doubleml as dml\n", "\n", "from doubleml.datasets import make_heterogeneous_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "We define a data generating process to create synthetic data to compare the estimates to the true effect. The data generating process is based on the Monte Carlo simulation from [Oprescu et al. (2019)](http://proceedings.mlr.press/v97/oprescu19a.html).\n", "\n", "The documentation of the data generating process can be found [here](https://docs.doubleml.org/dev/api/api.html#dataset-generators)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-dimensional Example\n", "\n", "We start with an one-dimensional effect and create our training data. In this example the true effect depends only the first covariate $X_0$ and takes the following form\n", "\n", "$$\n", "\\theta_0(X) = \\exp(2X_0) + 3\\sin(4X_0).\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated dictionary also contains a callable with key `treatment_effect` to calculate the true treatment effect for new observations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "data_dict = make_heterogeneous_data(\n", " n_obs=2000,\n", " p=10,\n", " support_size=5,\n", " n_x=1,\n", " binary_treatment=True,\n", ")\n", "treatment_effect = data_dict['treatment_effect']\n", "data = data_dict['data']\n", "print(data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, define the ``DoubleMLData`` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_dml_base = dml.DoubleMLData(\n", " data,\n", " y_col='y',\n", " d_cols='d'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, define the learners for the nuisance functions and fit the [IRM Model](https://docs.doubleml.org/stable/api/generated/doubleml.DoubleMLIRM.html). Remark that linear learners would usually be optimal due to the data generating process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First stage estimation\n", "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", "randomForest_reg = RandomForestRegressor(n_estimators=500)\n", "randomForest_class = RandomForestClassifier(n_estimators=500)\n", "\n", "np.random.seed(42)\n", "\n", "dml_irm = dml.DoubleMLIRM(data_dml_base,\n", " ml_g=randomForest_reg,\n", " ml_m=randomForest_class,\n", " trimming_threshold=0.05,\n", " n_folds=5)\n", "print(\"Training IRM Model\")\n", "dml_irm.fit()\n", "\n", "print(dml_irm.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To estimate the CATE, we rely on the best-linear-predictor of the linear score as in [Semenova et al. (2021)](https://doi.org/10.1093/ectj/utaa027) To approximate the target function $\\theta_0(x)$ with a linear form, we have to define a data frame of basis functions. Here, we rely on [patsy](https://patsy.readthedocs.io/en/latest/) to construct a suitable basis of [B-splines](https://en.wikipedia.org/wiki/B-spline)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import patsy\n", "design_matrix = patsy.dmatrix(\"bs(x, df=5, degree=2)\", {\"x\": data[\"X_0\"]})\n", "spline_basis = pd.DataFrame(design_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To estimate the parameters to calculate the CATE estimate call the ``cate()`` method and supply the dataframe of basis elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cate = dml_irm.cate(spline_basis)\n", "print(cate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain the confidence intervals for the CATE, we have to call the ``confint()`` method and a supply a dataframe of basis elements.\n", "This could be the same basis as for fitting the CATE model or a new basis to e.g. evaluate the CATE model on a grid.\n", "Here, we will evaluate the CATE on a grid from 0.1 to 0.9 to plot the final results.\n", "Further, we construct uniform confidence intervals by setting the option ``joint`` and providing a number of bootstrap repetitions ``n_rep_boot``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_data = {\"x\": np.linspace(0.1, 0.9, 100)}\n", "spline_grid = pd.DataFrame(patsy.build_design_matrices([design_matrix.design_info], new_data)[0])\n", "df_cate = cate.confint(spline_grid, level=0.95, joint=True, n_rep_boot=2000)\n", "print(df_cate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can plot our results and compare them with the true effect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "plt.rcParams['figure.figsize'] = 10., 7.5\n", "\n", "df_cate['x'] = new_data['x']\n", "df_cate['true_effect'] = treatment_effect(new_data[\"x\"].reshape(-1, 1))\n", "fig, ax = plt.subplots()\n", "ax.plot(df_cate['x'],df_cate['effect'], label='Estimated Effect')\n", "ax.plot(df_cate['x'],df_cate['true_effect'], color=\"green\", label='True Effect')\n", "ax.fill_between(df_cate['x'], df_cate['2.5 %'], df_cate['97.5 %'], color='b', alpha=.3, label='Confidence Interval')\n", "\n", "plt.legend()\n", "plt.title('CATE')\n", "plt.xlabel('x')\n", "_ = plt.ylabel('Effect and 95%-CI')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the effect is not one-dimensional, the estimate still corresponds to the projection of the true effect on the basis functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Two-Dimensional Example\n", "\n", "It is also possible to estimate multi-dimensional conditional effects. We will use a similar data generating process but now the effect depends on the first two covariates $X_0$ and $X_1$ and takes the following form\n", "$$\n", "\\theta_0(X) = \\exp(2X_0) + 3\\sin(4X_1).\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the argument ``n_x=2`` we can specify set the effect to be two-dimensional." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "data_dict = make_heterogeneous_data(\n", " n_obs=5000,\n", " p=10,\n", " support_size=5,\n", " n_x=2,\n", " binary_treatment=True,\n", ")\n", "treatment_effect = data_dict['treatment_effect']\n", "data = data_dict['data']\n", "print(data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As univariate example estimate the [IRM Model](https://docs.doubleml.org/stable/api/generated/doubleml.DoubleMLIRM.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_dml_base = dml.DoubleMLData(\n", " data,\n", " y_col='y',\n", " d_cols='d'\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First stage estimation\n", "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", "randomForest_reg = RandomForestRegressor(n_estimators=500)\n", "randomForest_class = RandomForestClassifier(n_estimators=500)\n", "\n", "np.random.seed(42)\n", "\n", "dml_irm = dml.DoubleMLIRM(data_dml_base,\n", " ml_g=randomForest_reg,\n", " ml_m=randomForest_class,\n", " trimming_threshold=0.05,\n", " n_folds=5)\n", "print(\"Training IRM Model\")\n", "dml_irm.fit()\n", "\n", "print(dml_irm.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As above, we will rely on the [patsy](https://patsy.readthedocs.io/en/latest/) package to construct the basis elements.\n", "In the two-dimensional case, we will construct a tensor product of B-splines (for more information see [here](https://patsy.readthedocs.io/en/latest/spline-regression.html#tensor-product-smooths))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "design_matrix = patsy.dmatrix(\"te(bs(x_0, df=7, degree=3), bs(x_1, df=7, degree=3))\", {\"x_0\": data[\"X_0\"], \"x_1\": data[\"X_1\"]})\n", "spline_basis = pd.DataFrame(design_matrix)\n", "\n", "cate = dml_irm.cate(spline_basis)\n", "print(cate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we create a new grid to evaluate and plot the effects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid_size = 100\n", "x_0 = np.linspace(0.1, 0.9, grid_size)\n", "x_1 = np.linspace(0.1, 0.9, grid_size)\n", "x_0, x_1 = np.meshgrid(x_0, x_1)\n", "\n", "new_data = {\"x_0\": x_0.ravel(), \"x_1\": x_1.ravel()}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "spline_grid = pd.DataFrame(patsy.build_design_matrices([design_matrix.design_info], new_data)[0])\n", "df_cate = cate.confint(spline_grid, joint=True, n_rep_boot=2000)\n", "print(df_cate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.graph_objects as go\n", "\n", "grid_array = np.array(list(zip(x_0.ravel(), x_1.ravel())))\n", "true_effect = treatment_effect(grid_array).reshape(x_0.shape)\n", "effect = np.asarray(df_cate['effect']).reshape(x_0.shape)\n", "lower_bound = np.asarray(df_cate['2.5 %']).reshape(x_0.shape)\n", "upper_bound = np.asarray(df_cate['97.5 %']).reshape(x_0.shape)\n", "\n", "fig = go.Figure(data=[\n", " go.Surface(x=x_0,\n", " y=x_1,\n", " z=true_effect),\n", " go.Surface(x=x_0,\n", " y=x_1,\n", " z=upper_bound, showscale=False, opacity=0.4,colorscale='purp'),\n", " go.Surface(x=x_0,\n", " y=x_1,\n", " z=lower_bound, showscale=False, opacity=0.4,colorscale='purp'),\n", "])\n", "fig.update_traces(contours_z=dict(show=True, usecolormap=True,\n", " highlightcolor=\"limegreen\", project_z=True))\n", "\n", "fig.update_layout(scene = dict(\n", " xaxis_title='X_0',\n", " yaxis_title='X_1',\n", " zaxis_title='Effect'),\n", " width=700,\n", " margin=dict(r=20, b=10, l=10, t=10))\n", "\n", "fig.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" }, "vscode": { "interpreter": { "hash": "ac5e9af40c2048901fb5e070f7bbe2ca12417b0669992742e66f016e0e17b88e" } } }, "nbformat": 4, "nbformat_minor": 0 }