# doubleml.datasets.make_confounded_irm_data#

doubleml.datasets.make_confounded_irm_data(n_obs=500, theta=5.0, cf_y=0.04, cf_d=0.04)#

Generates counfounded data from an interactive regression model.

The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)).

Let $$X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)$$, where $$\Sigma$$ corresponds to the identity matrix. Further, define $$Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}$$, where

\begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2\\\tilde{Z}_5 &= X_5.\end{aligned}\end{align}

Additionally, generate a confounder $$A \sim \mathcal{U}[-1, 1]$$. At first, define the propensity score as

$m(X, A) = P(D=1|X,A) = 0.5 + \gamma_A \cdot A$

and generate the treatment $$D = 1\{m(X, A) \ge U\}$$ with $$U \sim \mathcal{U}[0, 1]$$. Since $$A$$ is independent of $$X$$, the short form of the propensity score is given as

$P(D=1|X) = 0.5.$

Further, generate the outcome of interest $$Y$$ as

\begin{align}\begin{aligned}Y &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 210 + 27.4 \cdot Z_1 +13.7 \cdot (Z_2 + Z_3 + Z_4)\end{aligned}\end{align}

where $$\varepsilon \sim \mathcal{N}(0,5)$$. This implies an average treatment effect of $$\theta$$. Additionally, the long and short forms of the conditional expectation take the following forms

\begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \beta_A \frac{\mathrm{Cov}(A, D(Z_5 + 1))}{\mathrm{Var}(D(Z_5 + 1))}) \cdot D (Z_5 + 1) + g(Z).\end{aligned}\end{align}

Consequently, the strength of confounding is determined via $$\gamma_A$$ and $$\beta_A$$. Both are chosen to obtain the desired confounding of the outcome and Riesz Representer (in sample).

The observed data is given as $$W = (Y, D, X)$$. Further, orcale values of the confounder $$A$$, the transformed covariated $$Z$$, the potential outcomes of $$Y$$, the coefficients $$\gamma_a$$, $$\beta_a$$, the long and short forms of the main regression and the propensity score are returned in a dictionary.

Parameters:
• n_obs (int) – The number of observations to simulate. Default is 500.

• theta (float or int) – Average treatment effect. Default is 5.0.

• cf_y (float) – Percentage of the residual variation of the outcome explained by latent/confounding variable. Default is 0.04.

• cf_d (float) – Percentage gains in the variation of the Riesz Representer generated by latent/confounding variable. Default is 0.04.

Returns:

res_dict – Dictionary with entries x, y, d and oracle_values.

Return type:

dictionary

References

Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.