doubleml.datasets.make_ssm_data#

doubleml.datasets.make_ssm_data(n_obs=8000, dim_x=100, theta=1, mar=True, return_type='DoubleMLData')#

Generates data from a sample selection model (SSM). The data generating process is defined as

\[ \begin{align}\begin{aligned}y_i &= \theta d_i + x_i' \beta d_i + u_i,\\s_i &= 1\left\lbrace d_i + \gamma z_i + x_i' \beta + v_i > 0 \right\rbrace,\\d_i &= 1\left\lbrace x_i' \beta + w_i > 0 \right\rbrace,\end{aligned}\end{align} \]

with Y being observed if \(s_i = 1\) and covariates \(x_i \sim \mathcal{N}(0, \Sigma^2_x)\), where \(\Sigma^2_x\) is a matrix with entries \(\Sigma_{kj} = 0.5^{|j-k|}\). \(\beta\) is a dim_x-vector with entries \(\beta_j=\frac{0.4}{j^2}\) \(z_i \sim \mathcal{N}(0, 1)\), \((u_i,v_i) \sim \mathcal{N}(0, \Sigma^2_{u,v})\), \(w_i \sim \mathcal{N}(0, 1)\).

The data generating process is inspired by a process used in the simulation study (see Appendix E) of Bia, Huber and Lafférs (2023).

Parameters:
  • n_obs – The number of observations to simulate.

  • dim_x – The number of covariates.

  • theta – The value of the causal parameter.

  • mar – Boolean. Indicates whether missingness at random holds.

  • return_type

    If 'DoubleMLData' or DoubleMLData, returns a DoubleMLData object.

    If 'DataFrame', 'pd.DataFrame' or pd.DataFrame, returns a pd.DataFrame.

    If 'array', 'np.ndarray', 'np.array' or np.ndarray, returns np.ndarray’s (x, y, d, z, s).

References

Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071