3.2.10. doubleml.plm.datasets.make_plpr_CP2025#

doubleml.plm.datasets.make_plpr_CP2025(num_id=250, num_t=10, dim_x=30, theta=0.5, dgp_type='dgp1', time_type='int')#

Generates synthetic data for a partially linear panel regression model, based on Clarke and Polselli (2025). The data generating process is defined as

\[ \begin{align}\begin{aligned}Y_{it} &= D_{it} \theta + g_0(X_{it}) + \alpha_i + U_{it}, & &U_{it} \sim \mathcal{N}(0,1),\\D_{it} &= m_0(X_{it}) + c_i + V_{it}, & &V_{it} \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]

with

\[\alpha_i = 0.25 \left(\frac{1}{T} \sum_{t=1}^{T} D_{it} - \bar{D} \right) + 0.25 \frac{1}{T} \sum_{t=1}^{T} \sum_{k \in \mathcal{K}} X_{it,k} + a_i\]

and \(a_i \sim \mathcal{N}(0,0.95)\), \(X_{it,p} \sim \mathcal{N}(0,5)\), \(c_i \sim \mathcal{N}(0,1)\). Where \(k \in \mathcal{K} = \{1,3\}\) is the number of relevant (non-zero) confounding variables, and \(p\) is the number of total confounding variables.

Clarke and Polselli (2025) consider three functional forms of the confounders to model the nuisance functions \(g_0\) and \(m_0\) with varying levels of non-linearity and non-smoothness:

Design 1. (dgp1): Linear in the nuisance parameters

\[ \begin{align}\begin{aligned}g_0(X_{it}) &= a X_{it,1} + X_{it,3}\\m_0(X_{it}) &= a X_{it,1} + X_{it,3}\end{aligned}\end{align} \]

Design 2. (dgp2): Non-linear and smooth in the nuisance parameters

\[ \begin{align}\begin{aligned}g_0(X_{it}) &= \frac{\exp(X_{it,1})}{1 + \exp(X_{it,1})} + a \cos(X_{it,3})\\m_0(X_{it}) &= \cos(X_{it,1}) + a \frac{\exp(X_{it,3})}{1 + \exp(X_{it,3})}\end{aligned}\end{align} \]

Design 3. (dgp3): Non-linear and discontinuous in the nuisance parameters

\[ \begin{align}\begin{aligned}g_0(X_{it}) &= b (X_{it,1} \cdot X_{it,3}) + a (X_{it,3} \cdot 1\{X_{it,3} > 0\})\\m_0(X_{it}) &= a (X_{it,1} \cdot 1\{X_{it,1} > 0\}) + b (X_{it,1} \cdot X_{it,3}),\end{aligned}\end{align} \]

where \(a = 0.25\), \(b = 0.5\).

Parameters:
  • num_id (int) – The number of units in the panel.

  • num_t (int) – The number of time periods in the panel.

  • dim_x (int) – The number of confounding variables.

  • theta (float) – The value of the causal parameter.

  • dgp_type (str) – The type of DGP design to be used. Default is 'dgp1', other options are 'dgp2' and 'dgp3'.

  • time_type (str) – The data type of the time variable. Default is 'int', other options are 'float' and 'datetime'.

Returns:

DataFrame containing the simulated static panel data.

Return type:

pandas.DataFrame

References

Clarke, P. S. and Polselli, A. (2025), Double machine learning for static panel models with fixed effects. The Econometrics Journal, utaf011, doi:10.1093/ectj/utaf011.