3.2.11. doubleml.plm.datasets.make_confounded_plr_data#
- doubleml.plm.datasets.make_confounded_plr_data(n_obs=500, theta=5.0, cf_y=0.04, cf_d=0.04, **kwargs)#
- Generates counfounded data from an partially linear regression model. - The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)). Let \(X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kj} = c^{|j-k|}\). The default value is \(c = 0\), corresponding to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where \[ \begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2.\end{aligned}\end{align} \]- Additionally, generate a confounder \(A \sim \mathcal{U}[-1, 1]\). At first, define the treatment as \[D = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A + \varepsilon_D\]- and with \(\varepsilon \sim \mathcal{N}(0,1)\). Since \(A\) is independent of \(X\), the long and short form of the treatment regression are given as \[ \begin{align}\begin{aligned}E[D|X,A] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A\\E[D|X] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4.\end{aligned}\end{align} \]- Further, generate the outcome of interest \(Y\) as \[ \begin{align}\begin{aligned}Y &= \theta \cdot D + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 210 + 27.4 \cdot Z_1 +13.7 \cdot (Z_2 + Z_3 + Z_4)\end{aligned}\end{align} \]- where \(\varepsilon \sim \mathcal{N}(0,5)\). This implies an average treatment effect of \(\theta\). Additionally, the long and short forms of the conditional expectation take the following forms \[ \begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \gamma_A\beta_A \frac{\mathrm{Var}(A)}{\mathrm{Var}(D)}) \cdot D + g(Z).\end{aligned}\end{align} \]- Consequently, the strength of confounding is determined via \(\gamma_A\) and \(\beta_A\). Both are chosen to obtain the desired confounding of the outcome and Riesz Representer (in sample). - The observed data is given as \(W = (Y, D, X)\). Further, orcale values of the confounder \(A\), the transformed covariated \(Z\), the effect \(\theta\), the coefficients \(\gamma_a\), \(\beta_a\), the long and short forms of the main regression and the propensity score are returned in a dictionary. - Parameters:
- n_obs (int) – The number of observations to simulate. Default is - 500.
- theta (float or int) – Average treatment effect. Default is - 5.0.
- cf_y (float) – Percentage of the residual variation of the outcome explained by latent/confounding variable. Default is - 0.04.
- cf_d (float) – Percentage gains in the variation of the Riesz Representer generated by latent/confounding variable. Default is - 0.04.
 
- Returns:
- res_dict – Dictionary with entries - x,- y,- dand- oracle_values.
- Return type:
- dictionary 
 - References - Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003. 
 
    
  
  
    