{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Flexible Covariate Adjustment in Regression Discontinuity Designs (RDD)\n", "\n", "This notebook demonstrates how to use RDD within ``DoubleML``. Our implementation, ``RDFlex``, is based on the paper _\"Flexible Covariate Adjustments in Regression Discontinuity Designs\"_ by [Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942). \n", "\n", "In regression discontinuity designs (RDD), treatment assignment is determined by a continuous running variable $S$ (or \"score\") crossing a known threshold $c$ (or \"cutoff\"). We aim to estimate the average treatment effect locally at the cutoff,\n", "\n", "$$\\tau_{0} = \\mathbb{E}[Y_i(1)-Y_i(0)\\mid S_i = c]$$\n", "\n", "with $Y_i(1)$ and $Y_i(0)$ denoting the potential outcomes of an individual with and without treatment, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** The dependencies for the module ``doubleml.rdd`` can be installed seperately via ``pip install rdrobust``, if necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import plotly.express as px\n", "import plotly.graph_objs as go\n", "from statsmodels.nonparametric.kernel_regression import KernelReg\n", "\n", "from lightgbm import LGBMRegressor, LGBMClassifier\n", "\n", "from rdrobust import rdrobust\n", "\n", "import doubleml as dml\n", "from doubleml.rdd import RDFlex\n", "from doubleml.rdd.datasets import make_simple_rdd_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sharp RDD\n", "\n", "In the sharp design, the treatment assignment is deterministic given the score. Namely, all the individuals with a score higher than the cutoff, receive the treatment $$D_i = \\mathbb{I}[S_i \\geq c].$$\n", "\n", "Without loss of generality, for the whole example we consider the cutoff to be normalized to $c=0$ and formulas are given accordingly.\n", "\n", "In sharp RDD, the treatment effect defined above is identified by\n", "\n", "$$\\tau_0 = \\lim_{s \\to c^+} \\mathbb{E}[Y_i \\mid S_i = s] - \\lim_{s \\to c^-} \\mathbb{E}[Y_i \\mid S_i = s].$$\n", "\n", "A key assumption for this identification is the **continuity** of the conditional expectations of the potential outcomes $\\mathbb{E}[Y_i(d)\\mid S_i=c]$ for $d \\in \\{0, 1\\}$.\n", " \n", "This implies that units cannot perfectly manipulate their score to either receive or avoid treatment exactly at the cutoff.\n", "\n", "### Generate Sharp Data\n", "\n", "The function ``make_simple_rdd_data()`` can be used to generate data of a rather standard RDD setting. If we set ``fuzzy = False``, the generated data follows a sharp RDD. We also generate covariates $X$ that can be used to adjust the estimation at a later stage.\n", "By default, the cutoff is normalized to ``c = 0``. The true RDD effect can be controlled by ``tau`` and is set to a value of $2.0$ in this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(1234)\n", "\n", "true_tau = 2.0\n", "data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=False, tau=true_tau)\n", "\n", "cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]\n", "df = pd.DataFrame(\n", " np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])),\n", " columns=['y', 'd', 'score'] + cov_names,\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the observed outcomes, we can clearly see a discontinuity at the cutoff value of $c = 0$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.scatter(\n", " x=df['score'],\n", " y=df['y'],\n", " color=df['d'].astype(bool),\n", " labels={\n", " \"x\": \"Score\", \n", " \"y\": \"Outcome\",\n", " \"color\": \"Treatment\"\n", " },\n", " title=\"Scatter Plot of Outcome vs. Score by Treatment Status\"\n", ")\n", "\n", "fig.update_layout(\n", " xaxis_title=\"Score\",\n", " yaxis_title=\"Outcome\"\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sharp RDD Without Adjustment\n", "\n", "The standard RDD estimator for the sharp design takes the form \n", "\n", "$$\\hat{\\tau}_{\\text{base}}(h) = \\sum_{i=1}^n w_i(h)Y_i,$$\n", "\n", "where the $w_i(h)$ are local linear regression weights that depend on the data through the realizations of the running variable only and $h > 0$ is a bandwidth.\n", "\n", "The packages ``rdrobust`` implements this estimation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdrobust_sharp_noadj = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], c=0.0)\n", "rdrobust_sharp_noadj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sharp RDD with Linear Adjustment\n", "\n", "The linearly adjusted RDD estimator for the sharp design takes the form \n", "\n", "$$\\hat{\\tau}_{lin}(h) = \\sum_{i=1}^n w_i(h)(Y_i-X_i^T\\hat{\\gamma}_h)$$\n", "\n", "where $w_i(h)$ are local linear regression weights that depend on the data through the realizations of the running variable $S_i$ only and $h>0$ is a bandwidth. $\\hat{\\gamma}_h$ is a minimizer from the regression\n", "\n", "$$\\underset{\\beta,\\gamma}{\\mathrm{arg\\,min}} \\, \\sum K_h(S_i) (Y_i - Q_i^\\top\\beta- X_i^{\\top}\\gamma )^2.$$\n", "\n", "with $Q_i =(D_i, S_i, D_i S_i,1)^T$ (for more details, see our [User Guide](https://docs.doubleml.org/stable/guide/models.html)), $K_h(v)=K(v/h)/h$ with $K(\\cdot)$ a kernel function.\n", "\n", "The packages ``rdrobust`` implements this estimation with a linear adjustment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdrobust_sharp = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], covs=df[cov_names], c=0.0)\n", "rdrobust_sharp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sharp RDD with Flexible Adjustment\n", "\n", "[Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942) propose an estimator that reduces the variance of the above esimator, using a flexible adjustment of the outcome by machine learning. For more details, see our [User Guide](https://docs.doubleml.org/stable/guide/models.html). The estimator here takes the form \n", "\n", "$$\\hat{\\tau}_{\\text{RDFlex}}(h;\\eta) = \\sum_{i=1}^n w_i(h)M_i(\\eta),\\quad M_i(\\eta) = Y_i - \\eta(X_i),$$\n", "\n", "with $\\eta(\\cdot)$ being potentially nonlinear adjustment functions.\n", "\n", "We initialize a `DoubleMLData` object using the usual package syntax:\n", "\n", " - `y_col` refers to the observed outcome, on which we want to estimate the effect at the cutoff\n", " - `s_col` refers to the score\n", " - `x_cols` refers to the covariates to be adjusted for\n", " - `d_cols` is an indicator whether an observation is treated or not. In the sharp design, this should be identical to an indicator whether an observation is left or right of the cutoff ($D_i = \\mathbb{I}[S_i \\geq c]$)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dml_data_sharp = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``RDFlex`` object is intialized with only one learner, that adjusts the outcome based on the covariates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_g = LGBMRegressor(n_estimators=500, learning_rate=0.01, verbose=-1)\n", "\n", "rdflex_sharp = RDFlex(dml_data_sharp,\n", " ml_g,\n", " cutoff=0,\n", " fuzzy=False,\n", " n_folds=5,\n", " n_rep=1)\n", "rdflex_sharp.fit(n_iterations=2)\n", "\n", "print(rdflex_sharp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is visible that the flexible adjustment decreases the standard error in the estimation and therefore provides tighter confidence intervals. For coverage simulations, see the [DoubleML Coverage Repository](https://docs.doubleml.org/doubleml-coverage/dev/rdd/rdd.html).\n", "\n", "`RDFlex` uses an iterative fitting approach to determine a preliminary bandwidth selections for the local adjustments. The default number of iterations is `n_iterations=2`, according to [Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fuzzy RDD\n", "\n", "In the fuzzy RDDs, the treatment assignment is still deterministic given the score $\\left(T_i = \\mathbb{I}[S_i \\geq c]\\right)$.\n", "However, in the neighborhood of the cutoff, there is a probability of non-compliance. Thus, the treatment received might differ from the assigned one $(D_i \\neq T_i)$ for some units. These observations cause the jump in the probability of treatment at the cutoff to be smaller than 1 but larger than 0. In other words, around the cutoff there can be treatment randomization on both sides.\n", "\n", "The parameter of interest in the Fuzzy RDD is the average treatment effect at the cutoff, for all individuals that comply with the assignment\n", "\n", "$$\\theta_{0} = \\mathbb{E}[Y_i(1)-Y_i(0)\\mid S_i = c, \\{i\\in \\text{compliers}\\}].$$\n", "\n", "This effect can be identified by\n", "\n", "$$\\theta_{0} = \\frac{\\lim_{s \\to c^+} \\mathbb{E}[Y_i \\mid S_i = s] - \\lim_{s \\to c^-} \\mathbb{E}[Y_i \\mid S_i = s]}{\\lim_{s \\to c^+} \\mathbb{E}[D_i \\mid S_i = s] - \\lim_{s \\to c^-} \\mathbb{E}[D_i \\mid S_i = s]}.$$\n", "\n", "### Generate Fuzzy Data\n", "\n", "The function ``make_simple_rdd_data()`` with ``fuzzy = True`` generates basic data for the fuzzy case. The cutoff is still set to $c = 0$ and we set the true effect to be ``tau = 2.0`` again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(1234)\n", "\n", "data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=True, tau=true_tau)\n", "\n", "cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]\n", "df = pd.DataFrame(\n", " np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])),\n", " columns=['y', 'd', 'score'] + cov_names,\n", ")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the observed outcomes, the discontinuity is less pronounced than in the sharp case. We see some degree of randomization left and right of the cutoff." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.scatter(\n", " x=df['score'],\n", " y=df['y'],\n", " color=df['d'].astype(bool),\n", " labels={\n", " \"x\": \"Score\", \n", " \"y\": \"Outcome\",\n", " \"color\": \"Treatment\"\n", " },\n", " title=\"Scatter Plot of Outcome vs. Score by Treatment Status\"\n", ")\n", "\n", "fig.update_layout(\n", " xaxis_title=\"Score\",\n", " yaxis_title=\"Outcome\"\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fuzzy RDD Without Adjustment\n", "\n", "The standard RDD estimator for the fuzzy design takes the form \n", "\n", "$$\\hat{\\theta}_{base}(h) = \\frac{\\hat{\\tau}_{\\text{Y}, base}(h)}{\\hat{\\tau}_{\\text{D}, base}(h)} = \\frac{\\sum_{i=1}^n w_i(h)Y_i}{\\sum_{i=1}^n w_i(h)D_i}$$\n", "\n", "The packages ``rdrobust`` implements this estimation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdrobust_fuzzy_noadj = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], c=0.0)\n", "rdrobust_fuzzy_noadj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fuzzy RDD with Linear Adjustment\n", "\n", "The linearly adjusted RDD estimator for the fuzzy design takes the form \n", "\n", "$$\\hat{\\theta}_{lin}(h) = \\frac{\\hat{\\tau}_{\\text{Y}, lin}(h)}{\\hat{\\tau}_{\\text{D}, lin}(h)} = \\frac{\\sum_{i=1}^n w_i(h)(Y_i-X_i^T\\hat{\\gamma}_{Y, h})}{\\sum_{i=1}^n w_i(h)(D_i-X_i^T\\hat{\\gamma}_{D, h})}$$\n", "\n", "Under similar assumptions as in the sharp case and that there are no *Defiers* (= individuals that would always pick the opposite treatment of their assigned one), this effect estimates the average treatment effect at the cutoff. The package ``rdrobust`` implements this estimation with a linear adjustment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdrobust_fuzzy = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], covs=df[cov_names], c=0.0)\n", "rdrobust_fuzzy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fuzzy design usually has much larger standard errors than the sharp design, as the jump in treatment probability adds further estimation uncertainty." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fuzzy RDD with Flexible Adjustment\n", "\n", "[Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942) propose an estimator that reduces the variance of the above esimator, using a flexible adjustment of the outcome by ML. For more details, see our [User Guide](https://docs.doubleml.org/stable/guide/models.html). The estimator here takes the form \n", "\n", "$$\\hat{\\theta}_{\\text{RDFlex}}(h; \\eta) = \\frac{\\sum_{i=1}^n w_i(h)(Y_i - \\hat{\\eta}_Y(X_i))}{\\sum_{i=1}^n w_i(h)(D_i - \\hat{\\eta}_D(X_i))},$$\n", "\n", "\n", "with $\\eta_Y(\\cdot), \\eta_D(\\cdot)$ being potentially nonlinear adjustment functions.\n", "\n", "We initialize a `DoubleMLData` object using the usual package syntax:\n", "\n", " - `y_col` refers to the observed outcome, on which we want to estimate the effect at the cutoff\n", " - `s_col` refers to the score\n", " - `x_cols` refers to the covariates to be adjusted for\n", " - `d_cols` is an indicator whether an observation is treated or not. In the fuzzy design, this should __not__ be identical to an indicator whether an observation is left or right of the cutoff ($D_i \\neq \\mathbb{I}[S_i \\geq c]$)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dml_data_fuzzy = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, we also have to specify a classifier that adjusts the treatment assignment probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_g = LGBMRegressor(n_estimators=500, learning_rate=0.01, verbose=-1)\n", "ml_m = LGBMClassifier(n_estimators=500, learning_rate=0.01, verbose=-1)\n", "\n", "rdflex_fuzzy = RDFlex(dml_data_fuzzy,\n", " ml_g,\n", " ml_m,\n", " cutoff=0,\n", " fuzzy=True,\n", " n_folds=5,\n", " n_rep=1)\n", "rdflex_fuzzy.fit(n_iterations=2)\n", "\n", "print(rdflex_fuzzy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also for the fuzzy case, we observe a significant decrease in estimation standard error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced: Global and Local Learners, Stacked Ensembles\n", "\n", "By default, ``RDFlex`` fits ML methods using kernel weights, resulting in a \"local\" fit around the cutoff. If the adjustment should also include \"global\" information from the full support of $S$ available in the data, the use of a the ``GlobalLearner`` wrapper is recommended.\n", "\n", "The ``GlobalLearner`` allows to ignore the weights and fit the ML method on the full support of the data, even if weights are provided.\n", "\n", "The learners can also be stacked. All learners have to support the `sample_weight` in their `fit` method. By stacking and using local and global learners, it is possible to further tune the estimation and potentially reduce standard errors even further." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from doubleml.utils import GlobalRegressor, GlobalClassifier\n", "\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\n", "from sklearn.ensemble import StackingClassifier, StackingRegressor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg_estimators = [\n", " ('lr local', LinearRegression()),\n", " ('rf local', RandomForestRegressor()),\n", " ('lr global', GlobalRegressor(base_estimator=LinearRegression())),\n", " ('rf global', GlobalRegressor(base_estimator=RandomForestRegressor()))\n", "]\n", "\n", "class_estimators = [\n", " ('lr local', LogisticRegression()),\n", " ('rf local', RandomForestClassifier()),\n", " ('lr global', GlobalClassifier(base_estimator=LogisticRegression())),\n", " ('rf global', GlobalClassifier(base_estimator=RandomForestClassifier()))\n", "]\n", "\n", "ml_g = StackingRegressor(\n", " estimators=reg_estimators,\n", " final_estimator=LinearRegression(),\n", ")\n", "\n", "ml_m = StackingClassifier(\n", " estimators=class_estimators,\n", " final_estimator=LogisticRegression(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first repeat the estimation of the sharp design and observe an even smaller standard error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdflex_sharp_stack = RDFlex(dml_data_sharp,\n", " ml_g,\n", " fuzzy=False,\n", " n_folds=5,\n", " n_rep=1)\n", "rdflex_sharp_stack.fit(n_iterations=2)\n", "\n", "print(rdflex_sharp_stack)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same applies for the fuzzy case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rdflex_fuzzy_stack = RDFlex(dml_data_fuzzy,\n", " ml_g,\n", " ml_m,\n", " fuzzy=True,\n", " n_folds=5,\n", " n_rep=1)\n", "rdflex_fuzzy_stack.fit(n_iterations=2)\n", "\n", "print(rdflex_fuzzy_stack)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To conclude, we look at a visualization of the estimated coefficient and the confidence intervals. We see that by using the flexible adjustment, it is possible to shrink confidence intervals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_sharp = pd.DataFrame({\"coef\": [rdrobust_sharp_noadj.coef.values[0][0], rdrobust_sharp.coef.values[0][0], rdflex_sharp.coef[0], rdflex_sharp_stack.coef[0]],\n", " \"CI lower\": [rdrobust_sharp_noadj.ci.values[0][0], rdrobust_sharp.ci.values[0][0], rdflex_sharp.confint().values[0][0], rdflex_sharp_stack.confint().values[0][0]],\n", " \"CI upper\": [rdrobust_sharp_noadj.ci.values[0][1], rdrobust_sharp.ci.values[0][1], rdflex_sharp.confint().values[0][1], rdflex_sharp_stack.confint().values[0][1]],\n", " \"method\": [\"No Adj.\", \"Linear Adj.\", \"Flexible Adj.\", \"Flexible Adj. (Stacked)\"]})\n", "df_fuzzy = pd.DataFrame({\"coef\": [rdrobust_fuzzy_noadj.coef.values[0][0], rdrobust_fuzzy.coef.values[0][0], rdflex_fuzzy.coef[0], rdflex_fuzzy_stack.coef[0]],\n", " \"CI lower\": [rdrobust_fuzzy_noadj.ci.values[0][0], rdrobust_fuzzy.ci.values[0][0], rdflex_fuzzy.confint().values[0][0], rdflex_fuzzy_stack.confint().values[0][0]],\n", " \"CI upper\": [rdrobust_fuzzy_noadj.ci.values[0][1], rdrobust_fuzzy.ci.values[0][1], rdflex_fuzzy.confint().values[0][1], rdflex_fuzzy_stack.confint().values[0][1]],\n", " \"method\": [\"No Adj.\", \"Linear Adj.\", \"Flexible Adj.\", \"Flexible Adj. (Stacked)\"]})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))\n", "\n", "axes[0].errorbar(\n", " df_sharp['method'],\n", " df_sharp['coef'],\n", " yerr=(df_sharp['coef'] - df_sharp['CI lower'], df_sharp['CI upper'] - df_sharp['coef']),\n", " fmt='o',\n", " capsize=5,\n", " capthick=2\n", ")\n", "axes[0].set_title('Sharp Design')\n", "axes[0].set_ylabel('Coefficient')\n", "axes[0].set_xlabel('Method')\n", "axes[0].axhline(true_tau, linestyle=\"--\", color=\"r\")\n", "axes[0].tick_params(axis='x', rotation=30)\n", "\n", "axes[1].errorbar(\n", " df_fuzzy['method'],\n", " df_fuzzy['coef'],\n", " yerr=(df_fuzzy['coef'] - df_fuzzy['CI lower'], df_fuzzy['CI upper'] - df_fuzzy['coef']),\n", " fmt='o',\n", " capsize=5,\n", " capthick=2\n", ")\n", "axes[1].set_title('Fuzzy Design')\n", "axes[1].set_ylabel('Coefficient') \n", "axes[1].set_xlabel('Method')\n", "axes[1].axhline(true_tau, linestyle=\"--\", color=\"r\")\n", "axes[1].tick_params(axis='x', rotation=30)\n", "\n", "plt.tight_layout()" ] } ], "metadata": { "kernelspec": { "display_name": "didnotebook", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }