doubleml.datasets.make_heterogeneous_data#
- doubleml.datasets.make_heterogeneous_data(n_obs=200, p=30, support_size=5, n_x=1, binary_treatment=False)#
Creates a simple synthetic example for heterogeneous treatment effects. The data generating process is based on the Monte Carlo simulation from Oprescu et al. (2019).
The data is generated as
\[ \begin{align}\begin{aligned}Y_i & = \theta_0(X_i)D_i + \langle X_i,\gamma_0\rangle + \epsilon_i\\D_i & = \langle X_i,\beta_0\rangle + \eta_i,\end{aligned}\end{align} \]where \(X_i\sim\mathcal{U}[0,1]^{p}\) and \(\epsilon_i,\eta_i \sim\mathcal{U}[-1,1]\). If the treatment is set to be binary, the treatment is generated as
\[D_i = 1\{\langle X_i,\beta_0\rangle \ge \eta_i\}.\]The coefficient vectors \(\gamma_0\) and \(\beta_0\) both have small random (identical) support which values are drawn independently from \(\mathcal{U}[0,1]\) and \(\mathcal{U}[0,0.3]\). Further, \(\theta_0(x)\) defines the conditional treatment effect, which is defined differently depending on the dimension of \(x\).
If the heterogeneity is univariate the conditional treatment effect takes the following form
\[\theta_0(x) = \exp(2x_0) + 3\sin(4x_0),\]whereas for the two-dimensional case the conditional treatment effect is defined as
\[\theta_0(x) = \exp(2x_0) + 3\sin(4x_1).\]- Parameters:
n_obs (int) – Number of observations to simulate. Default is
200
.p (int) – Dimension of covariates. Default is
30
.support_size (int) – Number of relevant (confounding) covariates. Default is
5
.n_x (int) – Dimension of the heterogeneity. Can be either
1
or2
. Default is1
.binary_treatment (bool) – Indicates whether the treatment is binary. Default is
False
.
- Returns:
res_dict – Dictionary with entries
data
,effects
,treatment_effect
.- Return type:
dictionary