{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Group Average Treatment Effects (GATEs) for PLR models\n", "\n", "In this simple example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate group average treatment effects in the [DoubleMLPLR](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import doubleml as dml\n", "\n", "from doubleml.datasets import make_heterogeneous_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "We define a data generating process to create synthetic data to compare the estimates to the true effect. The data generating process is based on the Monte Carlo simulation from [Oprescu et al. (2019)](http://proceedings.mlr.press/v97/oprescu19a.html).\n", "\n", "The documentation of the data generating process can be found [here](https://docs.doubleml.org/dev/api/api.html#dataset-generators). In this example the true effect depends only the first covariate $X_0$ and takes the following form\n", "\n", "$$\n", "\\theta_0(X) = \\exp(2X_0) + 3\\sin(4X_0).\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "data_dict = make_heterogeneous_data(\n", " n_obs=500,\n", " p=10,\n", " support_size=5,\n", " n_x=1,\n", ")\n", "data = data_dict['data']\n", "print(data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated dictionary also contains the true individual effects saved in the key `effects`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ite = data_dict['effects']\n", "print(ite[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to estimate the average treatment effect for different groups based on the covariate $X_0$. The groups can be specified as [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with boolean columns. We consider the following three groups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "groups = pd.DataFrame(\n", " np.column_stack((data['X_0'] <= 0.3,\n", " (data['X_0'] > 0.3) & (data['X_0'] <= 0.7),\n", " data['X_0'] > 0.7)),\n", " columns=['Group 1', 'Group 2', 'Group 3'])\n", "print(groups.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The true effects (still including sampling uncertainty) are given by" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "true_effects = [ite[groups[group]].mean() for group in groups.columns]\n", "print(true_effects)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Partially Linear Regression Model (PLR)\n", "The first step is to fit a [DoubleML PLR Model](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) to the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_dml_base = dml.DoubleMLData(\n", " data,\n", " y_col='y',\n", " d_cols='d'\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First stage estimation\n", "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", "ml_l = RandomForestRegressor(n_estimators=500)\n", "ml_m = RandomForestRegressor(n_estimators=500)\n", "\n", "np.random.seed(42)\n", "\n", "dml_plr = dml.DoubleMLPLR(data_dml_base,\n", " ml_l=ml_l,\n", " ml_m=ml_m,\n", " n_folds=5)\n", "print(\"Training PLR Model\")\n", "dml_plr.fit()\n", "\n", "print(dml_plr.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Group Average Treatment Effects (GATEs)\n", "To calculate GATEs just call the ``gate()`` method and supply the DataFrame with the group definitions and the ``level`` (with default of ``0.95``). Remark that for straightforward interpretation of the GATEs the groups should be mutually exclusive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gate = dml_plr.gate(groups=groups)\n", "print(gate.confint(level=0.95))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confidence intervals above are point-wise, but by setting the option ``joint`` and providing a number of bootstrap repetitions ``n_rep_boot``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ci = gate.confint(level=0.95, joint=True, n_rep_boot=1000)\n", "print(ci)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let us plot the estimates together with the true effect within each group.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = 10., 7.5\n", "\n", "errors = np.full((2, ci.shape[0]), np.nan)\n", "errors[0, :] = ci['effect'] - ci['2.5 %']\n", "errors[1, :] = ci['97.5 %'] - ci['effect']\n", "\n", "plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors, label='Estimated Effect (with joint CI)')\n", "\n", "#add true effect\n", "ax = plt.subplot(1, 1, 1)\n", "ax.scatter(x=['Group 1', 'Group 2', 'Group 3'], y=true_effects, c='red', label='True Effect')\n", "\n", "plt.title('GATEs')\n", "plt.xlabel('Groups')\n", "plt.legend()\n", "_ = plt.ylabel('Effect and 95%-CI')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to supply disjoint groups as a single vector (still as a data frame). Remark the slightly different name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "groups = pd.DataFrame(columns=['Group'], index=range(data['X_0'].shape[0]), dtype=str)\n", "for i, x_i in enumerate(data['X_0']):\n", " if x_i <= 0.3:\n", " groups['Group'][i] = '1'\n", " elif (x_i > 0.3) & (x_i <= 0.7):\n", " groups['Group'][i] = '2'\n", " else:\n", " groups['Group'][i] = '3'\n", "\n", "print(groups.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time lets consider pointwise confidence intervals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gate = dml_plr.gate(groups=groups)\n", "ci = gate.confint()\n", "print(ci)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficients of the best linear predictor can be seen via the summary (the values can be accessed through the underlying model ``.blp_model``)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(gate.summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remark that the confidence intervals in the summary are slightly smaller, since they are not based on the White's heteroskedasticity robus standard errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "errors = np.full((2, ci.shape[0]), np.nan)\n", "errors[0, :] = ci['effect'] - ci['2.5 %']\n", "errors[1, :] = ci['97.5 %'] - ci['effect']\n", "\n", "#add true effect\n", "ax = plt.subplot(1, 1, 1)\n", "ax.scatter(x=['Group_1', 'Group_2', 'Group_3'], y=true_effects, c='red', label='True Effect')\n", "\n", "plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors)\n", "plt.title('GATEs')\n", "plt.xlabel('Groups')\n", "_ = plt.ylabel('Effect and 95%-CI')" ] } ], "metadata": { "kernelspec": { "display_name": "dml_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 2 }