{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python: Panel Data Introduction\n",
    "\n",
    "In this example, we replicate the results from the guide [Getting Started with the did Package](https://bcallaway11.github.io/did/articles/did-basics.html) of the [did-R-package](https://bcallaway11.github.io/did/index.html).\n",
    "\n",
    "As the [did-R-package](https://bcallaway11.github.io/did/index.html) the implementation of [DoubleML](https://docs.doubleml.org/stable/index.html) is based on [Callaway and Sant'Anna(2021)](https://doi.org/10.1016/j.jeconom.2020.12.001).\n",
    "\n",
    "The notebook requires the following packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
    "\n",
    "from doubleml.data import DoubleMLPanelData\n",
    "from doubleml.did import DoubleMLDIDMulti"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "The data we will use is simulated and part of the [CSDID-Python-Package](https://d2cml-ai.github.io/csdid/index.html).\n",
    "\n",
    "A description of the data generating process can be found at the [CSDID-documentation](https://d2cml-ai.github.io/csdid/examples/csdid_basic.html#Examples-with-simulated-data).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dta = pd.read_csv(\"https://raw.githubusercontent.com/d2cml-ai/csdid/main/data/sim_data.csv\")\n",
    "dta.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To work with the [DoubleML-package](https://docs.doubleml.org/stable/index.html), we initialize a ``DoubleMLPanelData`` object.\n",
    "\n",
    "Therefore, we set the *never-treated* units in group column `G` to `np.inf` (we have to change the datatype to `float`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set dtype for G to float\n",
    "dta[\"G\"] = dta[\"G\"].astype(float)\n",
    "dta.loc[dta[\"G\"] == 0, \"G\"] = np.inf\n",
    "dta.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can initialize the ``DoubleMLPanelData`` object, specifying\n",
    "\n",
    " - `y_col` : the outcome\n",
    " - `d_cols`: the group variable indicating the first treated period for each unit\n",
    " - `id_col`: the unique identification column for each unit\n",
    " - `t_col` : the time column\n",
    " - `x_cols`: the additional pre-treatment controls\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dml_data = DoubleMLPanelData(\n",
    "    data=dta,\n",
    "    y_col=\"Y\",\n",
    "    d_cols=\"G\",\n",
    "    id_col=\"id\",\n",
    "    t_col=\"period\",\n",
    "    x_cols=[\"X\"]\n",
    ")\n",
    "print(dml_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ATT Estimation\n",
    "\n",
    "The [DoubleML-package](https://docs.doubleml.org/stable/index.html) implements estimation of group-time average treatment effect via the `DoubleMLDIDMulti` class (see [model documentation](https://docs.doubleml.org/stable/guide/models.html#difference-in-differences-models-did)).\n",
    "\n",
    "The class basically behaves like other `DoubleML` classes and requires the specification of two learners (for more details on the regression elements, see [score documentation](https://docs.doubleml.org/stable/guide/scores.html#difference-in-differences-models)). The model will be estimated using the `fit()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dml_obj = DoubleMLDIDMulti(\n",
    "    obj_dml_data=dml_data,\n",
    "    ml_g=LinearRegression(),\n",
    "    ml_m=LogisticRegression(),\n",
    "    control_group=\"never_treated\",\n",
    ")\n",
    "\n",
    "dml_obj.fit()\n",
    "print(dml_obj)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The summary displays estimates of the $ATT(g,t_\\text{eval})$ effects for different combinations of $(g,t_\\text{eval})$ via $\\widehat{ATT}(\\mathrm{g},t_\\text{pre},t_\\text{eval})$, where\n",
    " - $\\mathrm{g}$ specifies the group\n",
    " - $t_\\text{pre}$ specifies the corresponding pre-treatment period\n",
    " - $t_\\text{eval}$ specifies the evaluation period\n",
    "\n",
    "This corresponds to the estimates given in `att_gt` function in the [did-R-package](https://bcallaway11.github.io/did/index.html), where the standard choice is $t_\\text{pre} = \\min(\\mathrm{g}, t_\\text{eval}) - 1$ (without anticipation).\n",
    "\n",
    "Remark that this includes pre-tests effects if $\\mathrm{g} > t_{eval}$, e.g. $ATT(4,2)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual for the DoubleML-package, you can obtain joint confidence intervals via bootstrap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "level = 0.95\n",
    "\n",
    "ci = dml_obj.confint(level=level)\n",
    "dml_obj.bootstrap(n_rep_boot=5000)\n",
    "ci_joint = dml_obj.confint(level=level, joint=True)\n",
    "ci_joint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A visualization of the effects can be obtained via the `plot_effects()` method.\n",
    "\n",
    "Remark that the plot used joint confidence intervals per default. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbsphinx-thumbnail"
    ]
   },
   "outputs": [],
   "source": [
    "fig, ax = dml_obj.plot_effects()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Effect Aggregation\n",
    "\n",
    "As the [did-R-package](https://bcallaway11.github.io/did/index.html), the $ATT$'s can be aggregated to summarize multiple effects.\n",
    "For details on different aggregations and details on their interpretations see [Callaway and Sant'Anna(2021)](https://doi.org/10.1016/j.jeconom.2020.12.001).\n",
    "\n",
    "The aggregations are implemented via the `aggregate()` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Group Aggregation\n",
    "\n",
    "To obtain group-specific effects it is possible to aggregate several $\\widehat{ATT}(\\mathrm{g},t_\\text{pre},t_\\text{eval})$ values based on the group $\\mathrm{g}$ by setting the `aggregation=\"group\"` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated = dml_obj.aggregate(aggregation=\"group\")\n",
    "print(aggregated)\n",
    "_ = aggregated.plot_effects()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is a `DoubleMLDIDAggregation` object which includes an overall aggregation summary based on group size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Time Aggregation\n",
    "\n",
    "This aggregates $\\widehat{ATT}(\\mathrm{g},t_\\text{pre},t_\\text{eval})$, based on $t_\\text{eval}$, but weighted with respect to group size. Corresponds to *Calendar Time Effects* from the [did-R-package](https://bcallaway11.github.io/did/index.html).\n",
    "\n",
    "For calendar time effects set `aggregation=\"time\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_time = dml_obj.aggregate(\"time\")\n",
    "print(aggregated_time)\n",
    "fig, ax = aggregated_time.plot_effects()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Event Study Aggregation\n",
    "\n",
    "Finally, `aggregation=\"eventstudy\"` aggregates $\\widehat{ATT}(\\mathrm{g},t_\\text{pre},t_\\text{eval})$ based on exposure time $e = t_\\text{eval} - \\mathrm{g}$ (respecting group size)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_eventstudy = dml_obj.aggregate(\"eventstudy\")\n",
    "print(aggregated_eventstudy)\n",
    "fig, ax = aggregated_eventstudy.plot_effects()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregation Details\n",
    "\n",
    "The `DoubleMLDIDAggregation` objects include several `DoubleMLFrameworks` which support methods like `bootstrap()` or `confint()`.\n",
    "Further, the weights can be accessed via the properties\n",
    "\n",
    " - ``overall_aggregation_weights``: weights for the overall aggregation\n",
    " - ``aggregation_weights``: weights for the aggregation\n",
    "\n",
    "To clarify, e.g. for the eventstudy aggregation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(aggregated_eventstudy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the overall effect aggregation aggregates each effect with positive exposure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(aggregated_eventstudy.overall_aggregation_weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If one would like to consider how the aggregated effect with $e=0$ is computed, one would have to look at the third set of weights within the ``aggregation_weights`` property"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aggregated_eventstudy.aggregation_weights[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Taking a look at the original `dml_obj`, one can see that this combines the following estimates:\n",
    "\n",
    " - $\\widehat{ATT}(2,1,2)$\n",
    " - $\\widehat{ATT}(3,2,3)$\n",
    " - $\\widehat{ATT}(4,3,4)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dml_obj.summary)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}