doubleml.datasets.make_ssm_data#
- doubleml.datasets.make_ssm_data(n_obs=8000, dim_x=100, theta=1, mar=True, return_type='DoubleMLData')#
Generates data from a sample selection model (SSM). The data generating process is defined as
\[ \begin{align}\begin{aligned}y_i &= \theta d_i + x_i' \beta d_i + u_i,\\s_i &= 1\left\lbrace d_i + \gamma z_i + x_i' \beta + v_i > 0 \right\rbrace,\\d_i &= 1\left\lbrace x_i' \beta + w_i > 0 \right\rbrace,\end{aligned}\end{align} \]with Y being observed if \(s_i = 1\) and covariates \(x_i \sim \mathcal{N}(0, \Sigma^2_x)\), where \(\Sigma^2_x\) is a matrix with entries \(\Sigma_{kj} = 0.5^{|j-k|}\). \(\beta\) is a dim_x-vector with entries \(\beta_j=\frac{0.4}{j^2}\) \(z_i \sim \mathcal{N}(0, 1)\), \((u_i,v_i) \sim \mathcal{N}(0, \Sigma^2_{u,v})\), \(w_i \sim \mathcal{N}(0, 1)\).
The data generating process is inspired by a process used in the simulation study (see Appendix E) of Bia, Huber and Lafférs (2023).
- Parameters:
n_obs – The number of observations to simulate.
dim_x – The number of covariates.
theta – The value of the causal parameter.
mar – Boolean. Indicates whether missingness at random holds.
return_type –
If
'DoubleMLData'
orDoubleMLData
, returns aDoubleMLData
object.If
'DataFrame'
,'pd.DataFrame'
orpd.DataFrame
, returns apd.DataFrame
.If
'array'
,'np.ndarray'
,'np.array'
ornp.ndarray
, returnsnp.ndarray
’s(x, y, d, z, s)
.
References
Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071