3.2.6. doubleml.irm.datasets.make_ssm_data#
- doubleml.irm.datasets.make_ssm_data(n_obs=8000, dim_x=100, theta=1, mar=True, return_type='DoubleMLSSMData')#
- Generates data from a sample selection model (SSM). The data generating process is defined as \[ \begin{align}\begin{aligned}y_i &= \theta d_i + x_i' \beta d_i + u_i,\\s_i &= 1\left\lbrace d_i + \gamma z_i + x_i' \beta + v_i > 0 \right\rbrace,\\d_i &= 1\left\lbrace x_i' \beta + w_i > 0 \right\rbrace,\end{aligned}\end{align} \]- with Y being observed if \(s_i = 1\) and covariates \(x_i \sim \mathcal{N}(0, \Sigma^2_x)\), where \(\Sigma^2_x\) is a matrix with entries \(\Sigma_{kj} = 0.5^{|j-k|}\). \(\beta\) is a dim_x-vector with entries \(\beta_j=\frac{0.4}{j^2}\) \(z_i \sim \mathcal{N}(0, 1)\), \((u_i,v_i) \sim \mathcal{N}(0, \Sigma^2_{u,v})\), \(w_i \sim \mathcal{N}(0, 1)\). - The data generating process is inspired by a process used in the simulation study (see Appendix E) of Bia, Huber and Lafférs (2023). - Parameters:
- n_obs – The number of observations to simulate. 
- dim_x – The number of covariates. 
- theta – The value of the causal parameter. 
- mar – Boolean. Indicates whether missingness at random holds. 
- return_type – - If - 'DoubleMLData'or- DoubleMLData, returns a- DoubleMLDataobject.- If - 'DataFrame',- 'pd.DataFrame'or- pd.DataFrame, returns a- pd.DataFrame.- If - 'array',- 'np.ndarray',- 'np.array'or- np.ndarray, returns- np.ndarray’s- (x, y, d, z, s).
 
 - References - Michela Bia, Martin Huber & Lukáš Lafférs (2023) Double Machine Learning for Sample Selection Models, Journal of Business & Economic Statistics, DOI: 10.1080/07350015.2023.2271071 
 
    
  
  
    