# doubleml.datasets.make_heterogeneous_data#

doubleml.datasets.make_heterogeneous_data(n_obs=200, p=30, support_size=5, n_x=1, binary_treatment=False)#

Creates a simple synthetic example for heterogeneous treatment effects. The data generating process is based on the Monte Carlo simulation from Oprescu et al. (2019).

The data is generated as

\begin{align}\begin{aligned}Y_i & = \theta_0(X_i)D_i + \langle X_i,\gamma_0\rangle + \epsilon_i\\D_i & = \langle X_i,\beta_0\rangle + \eta_i,\end{aligned}\end{align}

where $$X_i\sim\mathcal{U}[0,1]^{p}$$ and $$\epsilon_i,\eta_i \sim\mathcal{U}[-1,1]$$. If the treatment is set to be binary, the treatment is generated as

$D_i = 1\{\langle X_i,\beta_0\rangle \ge \eta_i\}.$

The coefficient vectors $$\gamma_0$$ and $$\beta_0$$ both have small random (identical) support which values are drawn independently from $$\mathcal{U}[0,1]$$ and $$\mathcal{U}[0,0.3]$$. Further, $$\theta_0(x)$$ defines the conditional treatment effect, which is defined differently depending on the dimension of $$x$$.

If the heterogeneity is univariate the conditional treatment effect takes the following form

$\theta_0(x) = \exp(2x_0) + 3\sin(4x_0),$

whereas for the two-dimensional case the conditional treatment effect is defined as

$\theta_0(x) = \exp(2x_0) + 3\sin(4x_1).$
Parameters:
• n_obs (int) – Number of observations to simulate. Default is 200.

• p (int) – Dimension of covariates. Default is 30.

• support_size (int) – Number of relevant (confounding) covariates. Default is 5.

• n_x (int) – Dimension of the heterogeneity. Can be either 1 or 2. Default is 1.

• binary_treatment (bool) – Indicates whether the treatment is binary. Default is False.

Returns:

res_dict – Dictionary with entries data, effects, treatment_effect.

Return type:

dictionary