doubleml.datasets.make_heterogeneous_data#

doubleml.datasets.make_heterogeneous_data(n_obs=200, p=30, support_size=5, n_x=1, binary_treatment=False)#

Creates a simple synthetic example for heterogeneous treatment effects. The data generating process is based on the Monte Carlo simulation from Oprescu et al. (2019).

The data is generated as

\[ \begin{align}\begin{aligned}Y_i & = \theta_0(X_i)D_i + \langle X_i,\gamma_0\rangle + \epsilon_i\\D_i & = \langle X_i,\beta_0\rangle + \eta_i,\end{aligned}\end{align} \]

where \(X_i\sim\mathcal{U}[0,1]^{p}\) and \(\epsilon_i,\eta_i \sim\mathcal{U}[-1,1]\). If the treatment is set to be binary, the treatment is generated as

\[D_i = 1\{\langle X_i,\beta_0\rangle \ge \eta_i\}.\]

The coefficient vectors \(\gamma_0\) and \(\beta_0\) both have small random (identical) support which values are drawn independently from \(\mathcal{U}[0,1]\) and \(\mathcal{U}[0,0.3]\). Further, \(\theta_0(x)\) defines the conditional treatment effect, which is defined differently depending on the dimension of \(x\).

If the heterogeneity is univariate the conditional treatment effect takes the following form

\[\theta_0(x) = \exp(2x_0) + 3\sin(4x_0),\]

whereas for the two-dimensional case the conditional treatment effect is defined as

\[\theta_0(x) = \exp(2x_0) + 3\sin(4x_1).\]
Parameters:
  • n_obs (int) – Number of observations to simulate. Default is 200.

  • p (int) – Dimension of covariates. Default is 30.

  • support_size (int) – Number of relevant (confounding) covariates. Default is 5.

  • n_x (int) – Dimension of the heterogeneity. Can be either 1 or 2. Default is 1.

  • binary_treatment (bool) – Indicates whether the treatment is binary. Default is False.

Returns:

res_dict – Dictionary with entries data, effects, treatment_effect.

Return type:

dictionary