doubleml.datasets.make_confounded_irm_data#

doubleml.datasets.make_confounded_irm_data(n_obs=500, theta=0.0, gamma_a=0.127, beta_a=0.58, linear=False, **kwargs)#

Generates counfounded data from an interactive regression model.

The data generating process is defined as follows (inspired by the Monte Carlo simulation used in Sant’Anna and Zhao (2020)).

Let \(X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) corresponds to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where

\[ \begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2\\\tilde{Z}_5 &= X_5.\end{aligned}\end{align} \]

Additionally, generate a confounder \(A \sim \mathcal{U}[-1, 1]\). At first, define the propensity score as

\[m(X, A) = P(D=1|X,A) = p(Z) + \gamma_A \cdot A\]

where

\[ \begin{align}\begin{aligned}p(Z) &= \frac{\exp(f_{ps}(Z))}{1 + \exp(f_{ps}(Z))},\\f_{ps}(Z) &= 0.75 \cdot (-Z_1 + 0.1 \cdot Z_2 -0.25 \cdot Z_3 - 0.1 \cdot Z_4).\end{aligned}\end{align} \]

and generate the treatment \(D = 1\{m(X, A) \ge U\}\) with \(U \sim \mathcal{U}[0, 1]\). Since \(A\) is independent of \(X\), the short form of the propensity score is given as

\[P(D=1|X) = p(Z).\]

Further, generate the outcome of interest \(Y\) as

\[ \begin{align}\begin{aligned}Y &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 2.5 + 0.74 \cdot Z_1 + 0.25 \cdot Z_2 + 0.137 \cdot (Z_3 + Z_4)\end{aligned}\end{align} \]

where \(\varepsilon \sim \mathcal{N}(0,5)\). This implies an average treatment effect of \(\theta\). Additionally, the long and short forms of the conditional expectation take the following forms

\[ \begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \beta_A \frac{\mathrm{Cov}(A, D(Z_5 + 1))}{\mathrm{Var}(D(Z_5 + 1))}) \cdot D (Z_5 + 1) + g(Z).\end{aligned}\end{align} \]

Consequently, the strength of confounding is determined via \(\gamma_A\) and \(\beta_A\), which can be set via the parameters gamma_a and beta_a.

The observed data is given as \(W = (Y, D, Z)\). Further, orcale values of the confounder \(A\), the transformed covariated \(Z\), the potential outcomes of \(Y\), the long and short forms of the main regression and the propensity score and in sample versions of the confounding parameters \(cf_d\) and \(cf_y\) (for ATE and ATTE) are returned in a dictionary.

Parameters:
  • n_obs (int) – The number of observations to simulate. Default is 500.

  • theta (float or int) – Average treatment effect. Default is 0.0.

  • gamma_a (float) – Coefficient of the unobserved confounder in the propensity score. Default is 0.127.

  • beta_a (float) – Coefficient of the unobserved confounder in the outcome regression. Default is 0.58.

  • linear (bool) – If True, the Z will be set to X, such that the underlying (short) models are linear/logistic. Default is False.

Returns:

res_dict – Dictionary with entries x, y, d and oracle_values.

Return type:

dictionary

References

Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.