{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python: Group Average Treatment Effects (GATEs) for PLR models\n",
    "\n",
    "In this simple example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate group average treatment effects in the [DoubleMLPLR](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import doubleml as dml\n",
    "\n",
    "from doubleml.irm.datasets import make_heterogeneous_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "We define a data generating process to create synthetic data to compare the estimates to the true effect. The data generating process is based on the Monte Carlo simulation from [Oprescu et al. (2019)](http://proceedings.mlr.press/v97/oprescu19a.html).\n",
    "\n",
    "The documentation of the data generating process can be found [here](https://docs.doubleml.org/stable/api/datasets.html#dataset-generators). In this example the true effect depends only the first covariate $X_0$ and takes the following form\n",
    "\n",
    "$$\n",
    "\\theta_0(X) = \\exp(2X_0) + 3\\sin(4X_0).\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(42)\n",
    "data_dict = make_heterogeneous_data(\n",
    "    n_obs=500,\n",
    "    p=10,\n",
    "    support_size=5,\n",
    "    n_x=1,\n",
    ")\n",
    "data = data_dict['data']\n",
    "print(data.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generated dictionary also contains the true individual effects saved in the key `effects`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ite = data_dict['effects']\n",
    "print(ite[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal is to estimate the average treatment effect for different groups based on the covariate $X_0$. The groups can be specified as [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with boolean columns. We consider the following three groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = pd.DataFrame(\n",
    "    np.column_stack((data['X_0'] <= 0.3,\n",
    "                     (data['X_0'] > 0.3) & (data['X_0'] <= 0.7),\n",
    "                     data['X_0'] > 0.7)),\n",
    "    columns=['Group 1', 'Group 2', 'Group 3'])\n",
    "print(groups.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The true effects (still including sampling uncertainty) are given by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "true_effects = [ite[groups[group]].mean() for group in groups.columns]\n",
    "print(true_effects)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Partially Linear Regression Model (PLR)\n",
    "The first step is to fit a [DoubleML PLR Model](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dml_base = dml.DoubleMLData(\n",
    "    data,\n",
    "    y_col='y',\n",
    "    d_cols='d'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First stage estimation\n",
    "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n",
    "ml_l = RandomForestRegressor(n_estimators=500)\n",
    "ml_m = RandomForestRegressor(n_estimators=500)\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "dml_plr = dml.DoubleMLPLR(data_dml_base,\n",
    "                          ml_l=ml_l,\n",
    "                          ml_m=ml_m,\n",
    "                          n_folds=5)\n",
    "print(\"Training PLR Model\")\n",
    "dml_plr.fit()\n",
    "\n",
    "print(dml_plr.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group Average Treatment Effects (GATEs)\n",
    "To calculate GATEs just call the ``gate()`` method and supply the DataFrame with the group definitions and the ``level`` (with default of ``0.95``). Remark that for straightforward interpretation of the GATEs the groups should be mutually exclusive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gate = dml_plr.gate(groups=groups)\n",
    "print(gate.confint(level=0.95))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confidence intervals above are point-wise, but by setting the option ``joint`` and providing a number of bootstrap repetitions ``n_rep_boot``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ci = gate.confint(level=0.95, joint=True, n_rep_boot=1000)\n",
    "print(ci)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let us plot the estimates together with the true effect within each group.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['figure.figsize'] = 10., 7.5\n",
    "\n",
    "errors = np.full((2, ci.shape[0]), np.nan)\n",
    "errors[0, :] = ci['effect'] - ci['2.5 %']\n",
    "errors[1, :] = ci['97.5 %'] - ci['effect']\n",
    "\n",
    "plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors, label='Estimated Effect (with joint CI)')\n",
    "\n",
    "#add true effect\n",
    "ax = plt.subplot(1, 1, 1)\n",
    "ax.scatter(x=['Group 1', 'Group 2', 'Group 3'], y=true_effects, c='red', label='True Effect')\n",
    "\n",
    "plt.title('GATEs')\n",
    "plt.xlabel('Groups')\n",
    "plt.legend()\n",
    "_ = plt.ylabel('Effect and 95%-CI')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to supply disjoint groups as a single vector (still as a data frame). Remark the slightly different name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups =  pd.DataFrame(columns=['Group'], index=range(data['X_0'].shape[0]), dtype=str)\n",
    "for i, x_i in enumerate(data['X_0']):\n",
    "    if x_i <= 0.3:\n",
    "         groups.loc[i, 'Group'] = '1'\n",
    "    elif (x_i > 0.3) & (x_i <= 0.7):\n",
    "         groups.loc[i, 'Group'] = '2'\n",
    "    else:\n",
    "         groups.loc[i, 'Group'] = '3'\n",
    "\n",
    "print(groups.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time lets consider pointwise confidence intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gate = dml_plr.gate(groups=groups)\n",
    "ci = gate.confint()\n",
    "print(ci)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The coefficients of the best linear predictor can be seen via the summary (the values can be accessed through the underlying model ``.blp_model``)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(gate.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remark that the confidence intervals in the summary are slightly smaller, since they are not based on the White's heteroskedasticity robus standard errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "errors = np.full((2, ci.shape[0]), np.nan)\n",
    "errors[0, :] = ci['effect'] - ci['2.5 %']\n",
    "errors[1, :] = ci['97.5 %'] - ci['effect']\n",
    "\n",
    "#add true effect\n",
    "ax = plt.subplot(1, 1, 1)\n",
    "ax.scatter(x=['Group_1', 'Group_2', 'Group_3'], y=true_effects, c='red', label='True Effect')\n",
    "\n",
    "plt.errorbar(ci.index, ci.effect, fmt='o', yerr=errors)\n",
    "plt.title('GATEs')\n",
    "plt.xlabel('Groups')\n",
    "_ =  plt.ylabel('Effect and 95%-CI')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}