{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Real-Data Example for Multi-Period Difference-in-Differences\n", "\n", "In this example, we replicate a [real-data demo notebook](https://bcallaway11.github.io/did/articles/did-basics.html#an-example-with-real-data) from the [did-R-package](https://bcallaway11.github.io/did/index.html) in order to illustrate the use of `DoubleML` for multi-period difference-in-differences (DiD) models. \n", "\n", "\n", "\n", "The notebook requires the following packages:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pyreadr\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "from sklearn.dummy import DummyRegressor, DummyClassifier\n", "from sklearn.linear_model import LassoCV, LogisticRegressionCV\n", "\n", "from doubleml.data import DoubleMLPanelData\n", "from doubleml.did import DoubleMLDIDMulti" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Causal Research Question\n", "\n", "[Callaway and Sant'Anna (2021)](https://doi.org/10.1016/j.jeconom.2020.12.001) study the causal effect of raising the minimum wage on teen employment in the US using county data over a period from 2001 to 2007. A county is defined as treated if the minimum wage in that county is above the federal minimum wage. We focus on a preprocessed balanced panel data set as provided by the [did-R-package](https://bcallaway11.github.io/did/index.html). The corresponding documentation for the `mpdta` data is available from the [did package website](https://bcallaway11.github.io/did/reference/mpdta.html). We use this data solely as a demonstration example to help readers understand differences in the `DoubleML` and `did` packages. An analogous notebook using the same data is available from the [did documentation](https://bcallaway11.github.io/did/articles/did-basics.html#an-example-with-real-data).\n", "\n", "We follow the original notebook and provide results under identification based on unconditional and conditional parallel trends. For the Double Machine Learning (DML) Difference-in-Differences estimator, we demonstrate two different specifications, one based on linear and logistic regression and one based on their $\\ell_1$ penalized variants Lasso and logistic regression with cross-validated penalty choice. The results for the former are expected to be very similar to those in the [did data example](https://bcallaway11.github.io/did/articles/did-basics.html#an-example-with-real-data). Minor differences might arise due to the use of sample-splitting in the DML estimation.\n", "\n", "\n", "## Data\n", "\n", "We will download and read a preprocessed data file as provided by the [did-R-package](https://bcallaway11.github.io/did/index.html).\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "year", "rawType": "int32", "type": "integer" }, { "name": "countyreal", "rawType": "float64", "type": "float" }, { "name": "lpop", "rawType": "float64", "type": "float" }, { "name": "lemp", "rawType": "float64", "type": "float" }, { "name": "first.treat", "rawType": "float64", "type": "float" }, { "name": "treat", "rawType": "float64", "type": "float" } ], "conversionMethod": "pd.DataFrame", "ref": "5e2a60ba-8445-46e1-b56a-ecf010c51d7d", "rows": [ [ "0", "2003", "8001.0", "5.896760933305299", "8.461469042643875", "2007.0", "1.0" ], [ "1", "2004", "8001.0", "5.896760933305299", "8.336869637284956", "2007.0", "1.0" ], [ "2", "2005", "8001.0", "5.896760933305299", "8.340217320947035", "2007.0", "1.0" ], [ "3", "2006", "8001.0", "5.896760933305299", "8.37816098272068", "2007.0", "1.0" ], [ "4", "2007", "8001.0", "5.896760933305299", "8.487352349405215", "2007.0", "1.0" ] ], "shape": { "columns": 6, "rows": 5 } }, "text/html": [ "
\n", " | year | \n", "countyreal | \n", "lpop | \n", "lemp | \n", "first.treat | \n", "treat | \n", "
---|---|---|---|---|---|---|
0 | \n", "2003 | \n", "8001.0 | \n", "5.896761 | \n", "8.461469 | \n", "2007.0 | \n", "1.0 | \n", "
1 | \n", "2004 | \n", "8001.0 | \n", "5.896761 | \n", "8.336870 | \n", "2007.0 | \n", "1.0 | \n", "
2 | \n", "2005 | \n", "8001.0 | \n", "5.896761 | \n", "8.340217 | \n", "2007.0 | \n", "1.0 | \n", "
3 | \n", "2006 | \n", "8001.0 | \n", "5.896761 | \n", "8.378161 | \n", "2007.0 | \n", "1.0 | \n", "
4 | \n", "2007 | \n", "8001.0 | \n", "5.896761 | \n", "8.487352 | \n", "2007.0 | \n", "1.0 | \n", "