3.2.13. doubleml.did.datasets.make_did_CS2021#

doubleml.did.datasets.make_did_CS2021(n_obs=1000, dgp_type=1, include_never_treated=True, time_type='datetime', **kwargs)#

Generate synthetic panel data for difference-in-differences analysis based on Callaway and Sant’Anna (2021).

This function creates panel data with heterogeneous treatment effects across time periods and groups. The data includes pre-treatment periods, multiple treatment groups that receive treatment at different times, and optionally a never-treated group that serves as a control. The true average treatment effect on the treated (ATT) has a heterogeneous structure dependent on covariates and exposure time.

The data generating process offers six variations (dgp_type 1-6) that differ in how the regression features and propensity score features are derived:

DGP 1: Outcome and propensity score are linear (in Z)
DGP 2: Outcome is linear, propensity score is nonlinear
DGP 3: Outcome is nonlinear, propensity score is linear
DGP 4: Outcome and propensity score are nonlinear
DGP 5: Outcome is linear, propensity score is constant (experimental setting)
DGP 6: Outcome is nonlinear, propensity score is constant (experimental setting)

Let \(X= (X_1, X_2, X_3, X_4)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kj} = c^{|j-k|}\). The default value is \(c = 0\), corresponding to the identity matrix.

Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where \(\tilde{Z}_1 = \exp(0.5 \cdot X_1)\), \(\tilde{Z}_2 = 10 + X_2/(1 + \exp(X_1))\), \(\tilde{Z}_3 = (0.6 + X_1 \cdot X_3 / 25)^3\) and \(\tilde{Z}_4 = (20 + X_2 + X_4)^2\).

For a feature vector \(W=(W_1, W_2, W_3, W_4)^T\) (either X or Z based on dgp_type), the core functions are:

Time-varying outcome regression function for each time period \(t\):

\[f_{reg,t}(W) = 210 + \frac{t}{T} \cdot (27.4 \cdot W_1 + 13.7 \cdot W_2 + 13.7 \cdot W_3 + 13.7 \cdot W_4)\]
Group-specific propensity function for each treatment group \(g\):

\[f_{ps,g}(W) = \xi \cdot \left(1-\frac{g}{G}\right) \cdot (-W_1 + 0.5 \cdot W_2 - 0.25 \cdot W_3 - 0.2\cdot W_4)\]

where \(T\) is the number of time periods, \(G\) is the number of treatment groups, and \(\xi\) is a scale parameter (default: 0.9).

The panel data model is defined with the following components:

Time effects: \(\delta_t = t\) for time period \(t\)
Individual effects: \(\eta_i \sim \mathcal{N}(g_i, 1)\) where \(g_i\) is unit \(i\)’s treatment group
Treatment effects: For a unit in treatment group \(g\), the effect in period \(t\) is:

\[\theta_{i,t,g} = \max(t - t_g + 1, 0) + 0.1 \cdot X_{i,1} \cdot \max(t - t_g + 1, 0)\]

where \(t_g\) is the first treatment period for group \(g\), \(X_{i,1}\) is the first covariate for unit \(i\), and \(\max(t - t_g + 1, 0)\) represents the exposure time (0 for pre-treatment periods).
Potential outcomes for unit \(i\) in period \(t\):

\[ \begin{align}\begin{aligned}Y_{i,t}(0) &= f_{reg,t}(W_{reg}) + \delta_t + \eta_i + \varepsilon_{i,0,t}\\Y_{i,t}(1) &= Y_{i,t}(0) + \theta_{i,t,g} + (\varepsilon_{i,1,t} - \varepsilon_{i,0,t})\end{aligned}\end{align} \]

where \(\varepsilon_{i,0,t}, \varepsilon_{i,1,t} \sim \mathcal{N}(0, 1)\).
Observed outcomes:

\[Y_{i,t} = Y_{i,t}(1) \cdot 1\{t \geq t_g\} + Y_{i,t}(0) \cdot 1\{t < t_g\}\]
Treatment assignment:

For non-experimental settings (DGP 1-4), the probability of being in treatment group \(g\) is:

\[P(G_i = g) = \frac{\exp(f_{ps,g}(W_{ps}))}{\sum_{g'} \exp(f_{ps,g'}(W_{ps}))}\]

For experimental settings (DGP 5-6), each treatment group (including never-treated) has equal probability:

\[P(G_i = g) = \frac{1}{G} \text{ for all } g\]

The variables \(W_{reg}\) and \(W_{ps}\) are selected based on the DGP type:

\[ \begin{align}\begin{aligned}DGP1:\quad W_{reg} &= Z \quad W_{ps} = Z\\DGP2:\quad W_{reg} &= Z \quad W_{ps} = X\\DGP3:\quad W_{reg} &= X \quad W_{ps} = Z\\DGP4:\quad W_{reg} &= X \quad W_{ps} = X\\DGP5:\quad W_{reg} &= Z \quad W_{ps} = 0\\DGP6:\quad W_{reg} &= X \quad W_{ps} = 0\end{aligned}\end{align} \]

where settings 5-6 correspond to experimental designs with equal probability across treatment groups.

Parameters:

n_obs (int, default=1000) – The number of observations to simulate.
dgp_type (int, default=1) – The data generating process to be used (1-6).
include_never_treated (bool, default=True) – Whether to include units that are never treated.
time_type (str, default="datetime") – Type of time variable. Either “datetime” or “float”.
**kwargs –
Additional keyword arguments. Accepts the following parameters:

c (float, default=0.0):
Parameter for correlation structure in X.

dim_x (int, default=4):
Dimension of feature vectors.

xi (float, default=0.9):
Scale parameter for the propensity score function.

n_periods (int, default=5):
Number of time periods.

anticipation_periods (int, default=0):
Number of periods before treatment where anticipation effects occur.

n_pre_treat_periods (int, default=2):
Number of pre-treatment periods.

start_date (str, default=”2025-01”):
Start date for datetime time variables.

Returns:

DataFrame containing the simulated panel data.

Return type:

pandas.DataFrame

References

Callaway, B. and Sant’Anna, P. H. (2021), Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230. doi:10.1016/j.jeconom.2020.12.001.