# doubleml.datasets.make_confounded_plr_data#

doubleml.datasets.make_confounded_plr_data(n_obs=500, theta=5.0, cf_y=0.04, cf_d=0.04, **kwargs)#

Generates counfounded data from an partially linear regression model.

The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)). Let $$X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)$$, where $$\Sigma$$ is a matrix with entries $$\Sigma_{kj} = c^{|j-k|}$$. The default value is $$c = 0$$, corresponding to the identity matrix. Further, define $$Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}$$, where

\begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2.\end{aligned}\end{align}

Additionally, generate a confounder $$A \sim \mathcal{U}[-1, 1]$$. At first, define the treatment as

$D = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A + \varepsilon_D$

and with $$\varepsilon \sim \mathcal{N}(0,1)$$. Since $$A$$ is independent of $$X$$, the long and short form of the treatment regression are given as

\begin{align}\begin{aligned}E[D|X,A] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4 + \gamma_A \cdot A\\E[D|X] = -Z_1 + 0.5 \cdot Z_2 - 0.25 \cdot Z_3 - 0.1 \cdot Z_4.\end{aligned}\end{align}

Further, generate the outcome of interest $$Y$$ as

\begin{align}\begin{aligned}Y &= \theta \cdot D + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 210 + 27.4 \cdot Z_1 +13.7 \cdot (Z_2 + Z_3 + Z_4)\end{aligned}\end{align}

where $$\varepsilon \sim \mathcal{N}(0,5)$$. This implies an average treatment effect of $$\theta$$. Additionally, the long and short forms of the conditional expectation take the following forms

\begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \gamma_A\beta_A \frac{\mathrm{Var}(A)}{\mathrm{Var}(D)}) \cdot D + g(Z).\end{aligned}\end{align}

Consequently, the strength of confounding is determined via $$\gamma_A$$ and $$\beta_A$$. Both are chosen to obtain the desired confounding of the outcome and Riesz Representer (in sample).

The observed data is given as $$W = (Y, D, X)$$. Further, orcale values of the confounder $$A$$, the transformed covariated $$Z$$, the effect $$\theta$$, the coefficients $$\gamma_a$$, $$\beta_a$$, the long and short forms of the main regression and the propensity score are returned in a dictionary.

Parameters:
• n_obs (int) – The number of observations to simulate. Default is 500.

• theta (float or int) – Average treatment effect. Default is 5.0.

• cf_y (float) – Percentage of the residual variation of the outcome explained by latent/confounding variable. Default is 0.04.

• cf_d (float) – Percentage gains in the variation of the Riesz Representer generated by latent/confounding variable. Default is 0.04.

Returns:

res_dict – Dictionary with entries x, y, d and oracle_values.

Return type:

dictionary

References

Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.