doubleml.datasets.make_confounded_plr_data#

doubleml.datasets.make_confounded_plr_data(n_obs=500, theta=5.0, cf_y=0.04, cf_d=0.04, **kwargs)#

Generates counfounded data from an partially linear regression model.

The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)). Let \(X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kj} = c^{|j-k|}\). The default value is \(c = 0\), corresponding to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where

\[ \begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2.\end{aligned}\end{align} \]

Additionally, generate a confounder \(A \sim \mathcal{U}[-1, 1]\). At first, define the treatment as

\[D = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A + \varepsilon_D\]

and with \(\varepsilon \sim \mathcal{N}(0,1)\). Since \(A\) is independent of \(X\), the long and short form of the treatment regression are given as

\[ \begin{align}\begin{aligned}E[D|X,A] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A\\E[D|X] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4.\end{aligned}\end{align} \]

Further, generate the outcome of interest \(Y\) as

\[ \begin{align}\begin{aligned}Y &= \theta \cdot D + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 210 + 27.4 \cdot Z_1 +13.7 \cdot (Z_2 + Z_3 + Z_4)\end{aligned}\end{align} \]

where \(\varepsilon \sim \mathcal{N}(0,5)\). This implies an average treatment effect of \(\theta\). Additionally, the long and short forms of the conditional expectation take the following forms

\[ \begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \gamma_A\beta_A \frac{\mathrm{Var}(A)}{\mathrm{Var}(D)}) \cdot D + g(Z).\end{aligned}\end{align} \]

Consequently, the strength of confounding is determined via \(\gamma_A\) and \(\beta_A\). Both are chosen to obtain the desired confounding of the outcome and Riesz Representer (in sample).

The observed data is given as \(W = (Y, D, X)\). Further, orcale values of the confounder \(A\), the transformed covariated \(Z\), the effect \(\theta\), the coefficients \(\gamma_a\), \(\beta_a\), the long and short forms of the main regression and the propensity score are returned in a dictionary.

Parameters:
  • n_obs (int) – The number of observations to simulate. Default is 500.

  • theta (float or int) – Average treatment effect. Default is 5.0.

  • cf_y (float) – Percentage of the residual variation of the outcome explained by latent/confounding variable. Default is 0.04.

  • cf_d (float) – Percentage gains in the variation of the Riesz Representer generated by latent/confounding variable. Default is 0.04.

Returns:

res_dict – Dictionary with entries x, y, d and oracle_values.

Return type:

dictionary

References

Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.