doubleml.datasets.make_irm_data_discrete_treatments#

doubleml.datasets.make_irm_data_discrete_treatments(n_obs=200, n_levels=3, linear=False, random_state=None, **kwargs)#

Generates data from a interactive regression (IRM) model with multiple treatment levels (based on an underlying continous treatment).

The data generating process is defined as follows (similar to the Monte Carlo simulation used in Sant’Anna and Zhao (2020)).

Let \(X= (X_1, X_2, X_3, X_4, X_5)^T \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma\) corresponds to the identity matrix. Further, define \(Z_j = (\tilde{Z_j} - \mathbb{E}[\tilde{Z}_j]) / \sqrt{\text{Var}(\tilde{Z}_j)}\), where

\[ \begin{align}\begin{aligned}\tilde{Z}_1 &= \exp(0.5 \cdot X_1)\\\tilde{Z}_2 &= 10 + X_2/(1 + \exp(X_1))\\\tilde{Z}_3 &= (0.6 + X_1 \cdot X_3 / 25)^3\\\tilde{Z}_4 &= (20 + X_2 + X_4)^2\\\tilde{Z}_5 &= X_5.\end{aligned}\end{align} \]

A continuous treatment \(D_{\text{cont}}\) is generated as

\[D_{\text{cont}} = \xi (-Z_1 + 0.5 Z_2 - 0.25 Z_3 - 0.1 Z_4) + \varepsilon_D,\]

where \(\varepsilon_D \sim \mathcal{N}(0,1)\) and \(\xi=0.3\). The corresponding treatment effect is defined as

\[\theta (d) = 0.1 \exp(d) + 10 \sin(0.7 d) + 2 d - 0.2 d^2.\]

Based on the continous treatment, a discrete treatment \(D\) is generated as with a baseline level of \(D=0\) and additional levels based on the quantiles of \(D_{\text{cont}}\). The number of levels is defined by \(n_{\text{levels}}\). Each level is chosen to have the same probability of being selected.

The potential outcomes are defined as

\[ \begin{align}\begin{aligned}Y(0) &= 210 + 27.4 Z_1 + 13.7 (Z_2 + Z_3 + Z_4) + \varepsilon_Y\\Y(1) &= \theta (D_{\text{cont}}) 1\{D_{\text{cont}} > 0\} + Y(0),\end{aligned}\end{align} \]

where \(\varepsilon_Y \sim \mathcal{N}(0,5)\). Further, the observed outcome is defined as

\[Y = Y(1) 1\{D > 0\} + Y(0) 1\{D = 0\}.\]

The data is returned as a dictionary with the entries x, y, d and oracle_values.

Parameters:
  • n_obs (int) – The number of observations to simulate. Default is 200.

  • n_levels (int) – The number of treatment levels. Default is 3.

  • linear (bool) – Indicates whether the true underlying regression is linear. Default is False.

  • random_state (int) – Random seed for reproducibility. Default is 42.

Returns:

res_dict – Dictionary with entries x, y, d and oracle_values. The oracle values contain the continuous treatment, the level bounds, the potential level, ITE and the potential outcome without treatment.

Return type:

dictionary