Python: Real-Data Example for Multi-Period Difference-in-Differences#

In this example, we replicate a real-data demo notebook from the did-R-package in order to illustrate the use of DoubleML for multi-period difference-in-differences (DiD) models.

The notebook requires the following packages:

[1]:
import pyreadr
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.linear_model import LassoCV, LogisticRegressionCV

from doubleml.data import DoubleMLPanelData
from doubleml.did import DoubleMLDIDMulti

Causal Research Question#

Callaway and Sant’Anna (2021) study the causal effect of raising the minimum wage on teen employment in the US using county data over a period from 2001 to 2007. A county is defined as treated if the minimum wage in that county is above the federal minimum wage. We focus on a preprocessed balanced panel data set as provided by the did-R-package. The corresponding documentation for the mpdta data is available from the did package website. We use this data solely as a demonstration example to help readers understand differences in the DoubleML and did packages. An analogous notebook using the same data is available from the did documentation.

We follow the original notebook and provide results under identification based on unconditional and conditional parallel trends. For the Double Machine Learning (DML) Difference-in-Differences estimator, we demonstrate two different specifications, one based on linear and logistic regression and one based on their \(\ell_1\) penalized variants Lasso and logistic regression with cross-validated penalty choice. The results for the former are expected to be very similar to those in the did data example. Minor differences might arise due to the use of sample-splitting in the DML estimation.

Data#

We will download and read a preprocessed data file as provided by the did-R-package.

[2]:
# download file from did package for R
url = "https://github.com/bcallaway11/did/raw/refs/heads/master/data/mpdta.rda"
pyreadr.download_file(url, "mpdta.rda")

mpdta = pyreadr.read_r("mpdta.rda")["mpdta"]
mpdta.head()

[2]:
year countyreal lpop lemp first.treat treat
0 2003 8001.0 5.896761 8.461469 2007.0 1.0
1 2004 8001.0 5.896761 8.336870 2007.0 1.0
2 2005 8001.0 5.896761 8.340217 2007.0 1.0
3 2006 8001.0 5.896761 8.378161 2007.0 1.0
4 2007 8001.0 5.896761 8.487352 2007.0 1.0

To work with DoubleML, we initialize a DoubleMLPanelData object. The input data has to satisfy some requirements, i.e., it should be in a long format with every row containing the information of one unit at one time period. Moreover, the data should contain a column on the unit identifier and a column on the time period. The requirements are virtually identical to those of the did-R-package, as listed in their data example. In line with the naming conventions of DoubleML, the treatment group indicator is passed to DoubleMLPanelData by the d_cols argument. To flexibly handle different formats for handling time periods, the time variable t_col can handle float, int and datetime formats. More information are available in the user guide. To indicate never treated units, we set their value for the treatment group variable to np.inf.

Now, we can initialize the DoubleMLPanelData object, specifying

  • y_col : the outcome

  • d_cols: the group variable indicating the first treated period for each unit

  • id_col: the unique identification column for each unit

  • t_col : the time column

  • x_cols: the additional pre-treatment controls

[3]:
# Set values for treatment group indicator for never-treated to np.inf
mpdta.loc[mpdta['first.treat'] == 0, 'first.treat'] = np.inf

dml_data = DoubleMLPanelData(
    data=mpdta,
    y_col="lemp",
    d_cols="first.treat",
    id_col="countyreal",
    t_col="year",
    x_cols=['lpop']
)
print(dml_data)
================== DoubleMLPanelData Object ==================

------------------ Data summary      ------------------
Outcome variable: lemp
Treatment variable(s): ['first.treat']
Covariates: ['lpop']
Instrument variable(s): None
Time variable: year
Id variable: countyreal
No. Observations: 500

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Columns: 6 entries, year to treat
dtypes: float64(5), int32(1)
memory usage: 107.6 KB

Note that we specified a pre-treatment confounding variable lpop through the x_cols argument. To consider cases under unconditional parallel trends, we can use dummy learners to ignore the pre-treatment confounding variable. This is illustrated below.

Effect Aggregation#

As the did-R-package, the \(ATT\)’s can be aggregated to summarize multiple effects. For details on different aggregations and details on their interpretations see Callaway and Sant’Anna(2021).

The aggregations are implemented via the aggregate() method. We follow the structure of the did package notebook and start with an aggregation relative to the treatment timing.

Event Study Aggregation#

We can aggregate the \(ATT\)s relative to the treatment timing. This is done by setting aggregation="eventstudy" in the aggregate() method. aggregation="eventstudy" aggregates \(\widehat{ATT}(\mathrm{g},t_\text{pre},t_\text{eval})\) based on exposure time \(e = t_\text{eval} - \mathrm{g}\) (respecting group size).

[7]:
# rerun bootstrap for valid simultaneous inference (as values are not saved)
dml_obj.bootstrap(n_rep_boot=5000)
aggregated_eventstudy = dml_obj.aggregate("eventstudy")
# run bootstrap to obtain simultaneous confidence intervals
print(aggregated_eventstudy)
fig, ax = aggregated_eventstudy.plot_effects()
================== DoubleMLDIDAggregation Object ==================
 Event Study Aggregation

------------------ Overall Aggregated Effects ------------------
     coef  std err         t    P>|t|     2.5 %    97.5 %
-0.077214 0.019951 -3.870174 0.000109 -0.116317 -0.038111
------------------ Aggregated Effects         ------------------
          coef   std err         t     P>|t|     2.5 %    97.5 %
-3.0  0.030446  0.015075  2.019662  0.043418  0.000900  0.059992
-2.0 -0.000549  0.013317 -0.041223  0.967118 -0.026650  0.025552
-1.0 -0.024393  0.014200 -1.717808  0.085832 -0.052226  0.003439
0.0  -0.019919  0.011816 -1.685694  0.091855 -0.043079  0.003241
1.0  -0.050930  0.016783 -3.034679  0.002408 -0.083824 -0.018037
2.0  -0.137238  0.036342 -3.776254  0.000159 -0.208467 -0.066008
3.0  -0.100768  0.034297 -2.938126  0.003302 -0.167989 -0.033548
------------------ Additional Information     ------------------
Score function: observational
Control group: never_treated
Anticipation periods: 0

C:\Users\bachp\Documents\Promotion\DissundPapers\Software\DoubleML\doubleml-for-py\doubleml\did\did_aggregation.py:368: UserWarning: Joint confidence intervals require bootstrapping which hasn't been performed yet. Automatically applying '.aggregated_frameworks.bootstrap(method="normal", n_rep_boot=500)' with default values. For different bootstrap settings, call bootstrap() explicitly before plotting.
  warnings.warn(
../../_images/examples_did_py_panel_data_example_17_2.png

Alternatively, the \(ATT\) could also be aggregated according to (calendar) time periods or treatment groups, see the user guide.

Aggregation Details#

TODO: Keep or drop this?

The DoubleMLDIDAggregation objects include several DoubleMLFrameworks which support methods like bootstrap() or confint(). Further, the weights can be accessed via the properties

  • overall_aggregation_weights: weights for the overall aggregation

  • aggregation_weights: weights for the aggregation

To clarify, e.g. for the eventstudy aggregation

If one would like to consider how the aggregated effect with \(e=0\) is computed, one would have to look at the third set of weights within the aggregation_weights property

[8]:
aggregated_eventstudy.aggregation_weights[2]
[8]:
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.23391813, 0.        , 0.        , 0.        , 0.        ,
       0.76608187, 0.        ])