# doubleml.datasets.make_did_SZ2020#

doubleml.datasets.make_did_SZ2020(n_obs=500, dgp_type=1, cross_sectional_data=False, return_type='DoubleMLData', **kwargs)#

Generates data from a difference-in-differences model used in Sant’Anna and Zhao (2020). The data generating process is defined as follows. For a generic $$W=(W_1, W_2, W_3, W_4)^T$$, let

\begin{align}\begin{aligned}f_{reg}(W) &= 210 + 27.4 \cdot W_1 +13.7 \cdot (W_2 + W_3 + W_4),\\f_{ps}(W) &= 0.75 \cdot (-W_1 + 0.5 \cdot W_2 -0.25 \cdot W_3 - 0.1 \cdot W_4).\end{aligned}\end{align}

Let $$X= (X_1, X_2, X_3, X_4)^T \sim \mathcal{N}(0, \Sigma)$$, where $$\Sigma$$ is a matrix with entries $$\Sigma_{kj} = c^{|j-k|}$$. The default value is $$c = 0$$, corresponding to the identity matrix. Further, define $$Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}$$, where $$\tilde{Z}_1 = \exp(0.5 \cdot X_1)$$, $$\tilde{Z}_2 = 10 + X_2/(1 + \exp(X_1))$$, $$\tilde{Z}_3 = (0.6 + X_1 \cdot X_3 / 25)^3$$ and $$\tilde{Z}_4 = (20 + X_2 + X_4)^2$$. At first define

\begin{align}\begin{aligned}Y_0(0) &= f_{reg}(W_{reg}) + \nu(W_{reg}, D) + \varepsilon_0,\\Y_1(d) &= 2 \cdot f_{reg}(W_{reg}) + \nu(W_{reg}, D) + \varepsilon_1(d),\\p(W_{ps}) &= \frac{\exp(f_{ps}(W_{ps}))}{1 + \exp(f_{ps}(W_{ps}))},\\D &= 1\{p(W_{ps}) \ge U\},\end{aligned}\end{align}

where $$\varepsilon_0, \varepsilon_1(d), d=0, 1$$ are independent standard normal random variables, $$U \sim \mathcal{U}[0, 1]$$ is a independent standard uniform and $$\nu(W_{reg}, D)\sim \mathcal{N}(D \cdot f_{reg}(W_{reg}),1)$$. The different data generating processes are defined via

\begin{align}\begin{aligned}DGP1:\quad W_{reg} &= Z \quad W_{ps} = Z\\DGP2:\quad W_{reg} &= Z \quad W_{ps} = X\\DGP3:\quad W_{reg} &= X \quad W_{ps} = Z\\DGP4:\quad W_{reg} &= X \quad W_{ps} = X\\DGP5:\quad W_{reg} &= Z \quad W_{ps} = 0\\DGP6:\quad W_{reg} &= X \quad W_{ps} = 0,\end{aligned}\end{align}

such that the last two settings correspond to an experimental setting with treatment probability of $$P(D=1) = \frac{1}{2}.$$ For the panel data the outcome is already defined as the difference $$Y = Y_1(D) - Y_0(0)$$. For cross-sectional data the flag cross_sectional_data has to be set to True. Then the outcome will be defined to be

$Y = T \cdot Y_1(D) + (1-T) \cdot Y_0(0),$

where $$T = 1\{U_T\le \lambda_T \}$$ with $$U_T\sim \mathcal{U}[0, 1]$$ and $$\lambda_T=0.5$$. The true average treatment effect on the treated is zero for all data generating processes.

Parameters:
• n_obs – The number of observations to simulate.

• dgp_type – The DGP to be used. Default value is 1 (integer).

• cross_sectional_data – Indicates whether the setting is uses cross-sectional or panel data. Default value is False.

• return_type

If 'DoubleMLData' or DoubleMLData, returns a DoubleMLData object.

If 'DataFrame', 'pd.DataFrame' or pd.DataFrame, returns a pd.DataFrame.

If 'array', 'np.ndarray', 'np.array' or np.ndarray, returns np.ndarray’s (x, y, d) or (x, y, d, t).

• **kwargs – Additional keyword arguments to set non-default values for the parameter $$xi=0.75$$, $$c=0.0$$ and $$\lambda_T=0.5$$.

References

Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.