3.2.9. doubleml.plm.datasets.make_lplr_LZZ2020#

doubleml.plm.datasets.make_lplr_LZZ2020(n_obs=500, dim_x=20, alpha=0.5, return_type='DoubleMLData', balanced_r0=True, treatment='continuous')#

Generates synthetic data for a logistic partially linear regression model, as in Liu et al. (2021), designed for use in double/debiased machine learning applications.

The data generating process is defined as follows:

  • Covariates \(x_i \sim \mathcal{N}(0, \Sigma)\), where \(\Sigma_{kj} = 0.2^{|j-k|}\).

  • Treatment \(d_i = a_0(x_i)\) (or a binary transformation thereof, depending on the treatment parameter).

  • Propensity score \(p_i = \sigma(\alpha d_i + r_0(x_i))\), where \(\sigma(\cdot)\) is the logistic function.

  • Outcome \(y_i \sim \text{Bernoulli}(p_i)\).

The nuisance functions are defined as:

\[\begin{split}\begin{aligned} a_0(x_i) &= \frac{2}{1 + \exp(x_{i,1})} - \frac{2}{1 + \exp(x_{i,2})} + \sin(x_{i,3}) + \cos(x_{i,4}) \\ &\quad + 0.5 \cdot \mathbb{1}(x_{i,5} > 0) - 0.5 \cdot \mathbb{1}(x_{i,6} > 0) + 0.2\, x_{i,7} x_{i,8} - 0.2\, x_{i,9} x_{i,10} \\ r_0(x_i) &= 0.1\, x_{i,1} x_{i,2} x_{i,3} + 0.1\, x_{i,4} x_{i,5} + 0.1\, x_{i,6}^3 - 0.5 \sin^2(x_{i,7}) \\ &\quad + 0.5 \cos(x_{i,8}) + \frac{1}{1 + x_{i,9}^2} - \frac{1}{1 + \exp(x_{i,10})} \\ &\quad + 0.25 \cdot \mathbb{1}(x_{i,11} > 0) - 0.25 \cdot \mathbb{1}(x_{i,13} > 0) \end{aligned}\end{split}\]
Parameters:
  • n_obs (int, default=500) – Number of observations to simulate.

  • dim_x (int, default=20) – Number of covariates.

  • alpha (float, default=0.5) – Value of the causal parameter.

  • return_type (str, default="DoubleMLData") –

    Determines the return format. One of:

    • ’DoubleMLData’ or DoubleMLData: returns a DoubleMLData object.

    • ’DataFrame’, ‘pd.DataFrame’ or pd.DataFrame: returns a pandas.DataFrame.

    • ’array’, ‘np.ndarray’, ‘np.array’ or np.ndarray: returns tuple of numpy arrays (x, y, d, p).

  • balanced_r0 (bool, default=True) – If True, uses the “balanced” r_0 specification (smaller magnitude / more balanced heterogeneity). If False, uses an “unbalanced” r_0 specification with larger share of Y=0.

  • treatment (str, default="continuous") –

    Type of treatment variable. One of “continuous”, “binary”, or “binary_unbalanced”. Determines how the treatment d is generated from a_0(x):

    • ”continuous”: d = a_0(x) (continuous treatment).

    • ”binary”: d ~ Bernoulli( sigmoid(a_0(x) - mean(a_0(x))) ) .

    • ”binary_unbalanced”: d ~ Bernoulli( sigmoid(a_0(x)) ).

Returns:

The generated data in the specified format.

Return type:

Union[DoubleMLData, pd.DataFrame, Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]]

References

Liu, Molei, Yi Zhang, and Doudou Zhou. 2021. “Double/Debiased Machine Learning for Logistic Partially Linear Model.” The Econometrics Journal 24 (3): 559–88. doi:10.1093/ectj/utab019.