doubleml.datasets.make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type='DoubleMLData', **kwargs)#

Generates data from a difference-in-differences model used in Sant’Anna and Zhao (2020). The data generating process is defined as follows. For a generic \(W=(W_1, W_2, W_3, W_4)^T\), let

\[ \begin{align}\begin{aligned}f_{reg}(W) &= 210 + 27.4 \cdot W_1 +13.7 \cdot (W_2 + W_3 + W_4),\\f_{ps}(W) &= 0.75 \cdot (-W_1 + 0.5 \cdot W_2 -0.25 \cdot W_3 - 0.1 \cdot W_4).\end{aligned}\end{align} \]

Let \(X= (X_1, X_2, X_3, X_4)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kj} = c^{|j-k|}\). The default value is \(c = 0\), corresponding to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where \(\tilde{Z}_1 = \exp(0.5 \cdot X_1)\), \(\tilde{Z}_2 = 10 + X_2/(1 + \exp(X_1))\), \(\tilde{Z}_3 = (0.6 + X_1 \cdot X_3 / 25)^3\) and \(\tilde{Z}_4 = (20 + X_2 + X_4)^2\). At first define

\[ \begin{align}\begin{aligned}Y_0(0) &= f_{reg}(W_{reg}) + \nu(W_{reg}, D) + \varepsilon_0,\\Y_1(d) &= 2 \cdot f_{reg}(W_{reg}) + \nu(W_{reg}, D) + \varepsilon_1(d),\\p(W_{ps}) &= \frac{\exp(f_{ps}(W_{ps}))}{1 + \exp(f_{ps}(W_{ps}))},\\D &= 1\{p(W_{ps}) \ge U\},\end{aligned}\end{align} \]

where \(\varepsilon_0, \varepsilon_1(d), d=0, 1\) are independent standard normal random variables, \(U \sim \mathcal{U}[0, 1]\) is a independent standard uniform and \(\nu(W_{reg}, D)\sim \mathcal{N}(D \cdot f_{reg}(W_{reg}),1)\). The different data generating processes are defined via

\[ \begin{align}\begin{aligned}DGP1:\quad W_{reg} &= Z \quad W_{ps} = Z\\DGP2:\quad W_{reg} &= Z \quad W_{ps} = X\\DGP3:\quad W_{reg} &= X \quad W_{ps} = Z\\DGP4:\quad W_{reg} &= X \quad W_{ps} = X\\DGP5:\quad W_{reg} &= Z \quad W_{ps} = 0\\DGP6:\quad W_{reg} &= X \quad W_{ps} = 0,\end{aligned}\end{align} \]

such that the last two settings correspond to an experimental setting with treatment probability of \(P(D=1) = \frac{1}{2}.\) For the panel data the outcome is already defined as the difference \(Y = Y_1(D) - Y_0(0)\). For cross-sectional data the flag cross_sectional_data has to be set to True. Then the outcome will be defined to be

\[Y = T \cdot Y_1(D) + (1-T) \cdot Y_0(0),\]

where \(T = 1\{U_T\le \lambda_T \}\) with \(U_T\sim \mathcal{U}[0, 1]\) and \(\lambda_T=0.5\). The true average treatment effect on the treated is zero for all data generating processes.

  • n_obs – The number of observations to simulate.

  • dgp_type – The DGP to be used. Default value is 1 (integer).

  • cross_sectional_data – Indicates whether the setting is uses cross-sectional or panel data. Default value is False.

  • return_type

    If 'DoubleMLData' or DoubleMLData, returns a DoubleMLData object.

    If 'DataFrame', 'pd.DataFrame' or pd.DataFrame, returns a pd.DataFrame.

    If 'array', 'np.ndarray', 'np.array' or np.ndarray, returns np.ndarray’s (x, y, d) or (x, y, d, t).

  • **kwargs – Additional keyword arguments to set non-default values for the parameter \(xi=0.75\), \(c=0.0\) and \(\lambda_T=0.5\).


Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.