3.2.9. doubleml.plm.datasets.make_lplr_LZZ2020#
- doubleml.plm.datasets.make_lplr_LZZ2020(n_obs=500, dim_x=20, alpha=0.5, return_type='DoubleMLData', balanced_r0=True, treatment='continuous')#
Generates synthetic data for a logistic partially linear regression model, as in Liu et al. (2021), designed for use in double/debiased machine learning applications.
The data generating process is defined as follows:
Covariates \(x_i \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma_{kj} = 0.2^{|j-k|}\).
Treatment \(d_i = a_0(x_i)\) (or a binary transformation thereof, depending on the treatment parameter).
Propensity score \(p_i = \sigma(\alpha d_i + r_0(x_i))\), where \(\sigma(\cdot)\) is the logistic function.
Outcome \(y_i \sim \text{Bernoulli}(p_i)\).
The nuisance functions are defined as:
\[\begin{split}\begin{aligned} a_0(x_i) &= \frac{2}{1 + \exp(x_{i,1})} - \frac{2}{1 + \exp(x_{i,2})} + \sin(x_{i,3}) + \cos(x_{i,4}) \\ &\quad + 0.5 \cdot \mathbb{1}(x_{i,5} > 0) - 0.5 \cdot \mathbb{1}(x_{i,6} > 0) + 0.2\, x_{i,7} x_{i,8} - 0.2\, x_{i,9} x_{i,10} \\ r_0(x_i) &= 0.1\, x_{i,1} x_{i,2} x_{i,3} + 0.1\, x_{i,4} x_{i,5} + 0.1\, x_{i,6}^3 - 0.5 \sin^2(x_{i,7}) \\ &\quad + 0.5 \cos(x_{i,8}) + \frac{1}{1 + x_{i,9}^2} - \frac{1}{1 + \exp(x_{i,10})} \\ &\quad + 0.25 \cdot \mathbb{1}(x_{i,11} > 0) - 0.25 \cdot \mathbb{1}(x_{i,13} > 0) \end{aligned}\end{split}\]- Parameters:
n_obs (int, default=500) – Number of observations to simulate.
dim_x (int, default=20) – Number of covariates.
alpha (float, default=0.5) – Value of the causal parameter.
return_type (str, default="DoubleMLData") –
Determines the return format. One of:
’DoubleMLData’ or DoubleMLData: returns a
DoubleMLDataobject.’DataFrame’, ‘pd.DataFrame’ or pd.DataFrame: returns a
pandas.DataFrame.’array’, ‘np.ndarray’, ‘np.array’ or np.ndarray: returns tuple of numpy arrays (x, y, d, p).
balanced_r0 (bool, default=True) – If True, uses the “balanced” r_0 specification (smaller magnitude / more balanced heterogeneity). If False, uses an “unbalanced” r_0 specification with larger share of Y=0.
treatment (str, default="continuous") –
Type of treatment variable. One of “continuous”, “binary”, or “binary_unbalanced”. Determines how the treatment d is generated from a_0(x):
”continuous”: d = a_0(x) (continuous treatment).
”binary”: d ~ Bernoulli( sigmoid(a_0(x) - mean(a_0(x))) ) .
”binary_unbalanced”: d ~ Bernoulli( sigmoid(a_0(x)) ).
- Returns:
The generated data in the specified format.
- Return type:
Union[DoubleMLData, pd.DataFrame, Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]]
References
Liu, Molei, Yi Zhang, and Doudou Zhou. 2021. “Double/Debiased Machine Learning for Logistic Partially Linear Model.” The Econometrics Journal 24 (3): 559–88. doi:10.1093/ectj/utab019.