doubleml.datasets.make_confounded_irm_data(n_obs=500, theta=5.0, cf_y=0.04, cf_d=0.04)#

Generates counfounded data from an interactive regression model.

The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)).

Let \(X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) corresponds to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where

\[ \begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2\\\tilde{Z}_5 &= X_5.\end{aligned}\end{align} \]

Additionally, generate a confounder \(A \sim \mathcal{U}[-1, 1]\). At first, define the propensity score as

\[m(X, A) = P(D=1|X,A) = 0.5 + \gamma_A \cdot A\]

and generate the treatment \(D = 1\{m(X, A) \ge U\}\) with \(U \sim \mathcal{U}[0, 1]\). Since \(A\) is independent of \(X\), the short form of the propensity score is given as

\[P(D=1|X) = 0.5.\]

Further, generate the outcome of interest \(Y\) as

\[ \begin{align}\begin{aligned}Y &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A + \varepsilon\\g(Z) &= 210 + 27.4 \cdot Z_1 +13.7 \cdot (Z_2 + Z_3 + Z_4)\end{aligned}\end{align} \]

where \(\varepsilon \sim \mathcal{N}(0,5)\). This implies an average treatment effect of \(\theta\). Additionally, the long and short forms of the conditional expectation take the following forms

\[ \begin{align}\begin{aligned}\mathbb{E}[Y|D, X, A] &= \theta \cdot D (Z_5 + 1) + g(Z) + \beta_A \cdot A\\\mathbb{E}[Y|D, X] &= (\theta + \beta_A \frac{\mathrm{Cov}(A, D(Z_5 + 1))}{\mathrm{Var}(D(Z_5 + 1))}) \cdot D (Z_5 + 1) + g(Z).\end{aligned}\end{align} \]

Consequently, the strength of confounding is determined via \(\gamma_A\) and \(\beta_A\). Both are chosen to obtain the desired confounding of the outcome and Riesz Representer (in sample).

The observed data is given as \(W = (Y, D, X)\). Further, orcale values of the confounder \(A\), the transformed covariated \(Z\), the potential outcomes of \(Y\), the coefficients \(\gamma_a\), \(\beta_a\), the long and short forms of the main regression and the propensity score are returned in a dictionary.

  • n_obs (int) – The number of observations to simulate. Default is 500.

  • theta (float or int) – Average treatment effect. Default is 5.0.

  • cf_y (float) – Percentage of the residual variation of the outcome explained by latent/confounding variable. Default is 0.04.

  • cf_d (float) – Percentage gains in the variation of the Riesz Representer generated by latent/confounding variable. Default is 0.04.


res_dict – Dictionary with entries x, y, d and oracle_values.

Return type:



Sant’Anna, P. H. and Zhao, J. (2020), Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122. doi:10.1016/j.jeconom.2020.06.003.