Python: Cluster Robust Double Machine Learning

Motivation

In many empirical applications, errors exhibit a clustered structure such that the usual i.i.d. assumption does not hold anymore. In order to perform valid statistical inference, researchers have to account for clustering. In this notebook, we will shortly emphasize the consequences of clustered data on inference based on the double machine learning (DML) approach as has been considered in Chiang et al. (2021). We will demonstrate how users of the DoubleML package can account for one- and two-way clustering in their analysis.

Clustered errors in terms of one or multiple dimensions might arise in many empirical applications. For example, in a cross-sectional study, errors might be correlated (i) within regions (one-way clustering) or (ii) within regions and industries at the same time (two-way clustering). Another example for two-way clustering, discussed in Chiang et al. (2021), refers to market share data with market shares being subject to shocks on the market and product level at the same time. We refer to Cameron et al. (2011) for an introduction to multiway clustering and a illustrative list of empirical examples.

Clustering and double machine learning

Clustering creates a challenge to the double machine learning (DML) approach in terms of

  1. a necessary adjustment of the formulae used for estimation of the variance covariance matrix, standard errors, p-values etc., and,

  2. an adjusted resampling scheme for the cross-fitting algorithm.

The first point equally applies to classical statistical models, for example a linear regression model (see, for example Cameron et al. 2011). The second point arises because the clustering implies a correlation of errors from train and test samples if the standard cross-fitting procedure suggested in Chernozhukov et al. (2018) was employed. The DML approach builds on independent sample splits into partitions that are used for training of the machine learning (ML) model learners and generation of predictions that are eventually used for solving the score function. For a motivation of the necessity of sample splitting, we refer to the illustration example in the user guide as well as to the explanation in Chernozhukov et al. (2018) .

In order to achieve independent data splits in a setting with one-way or multi-way clustering, Chiang et al. (2021) develop an updated \(K\)-fold sample splitting procedure that ensures independent sample splits: The data set is split into disjoint partitions in terms of all clustering dimensions. For example, in a situation with two-way clustering, the data is split into \(K^2\) folds. The machine learning models are then trained on a specific fold and used for generation of predictions in hold-out samples. Thereby, the sample splitting procedure ensures that the hold-out samples do not contain observations of the same clusters as used for training.

[1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

from sklearn.model_selection import KFold, RepeatedKFold
from sklearn.base import clone

from sklearn.linear_model import LassoCV

from doubleml import DoubleMLClusterData, DoubleMLData, DoubleMLPLIV

from doubleml.datasets import make_pliv_multiway_cluster_CKMS2021

A Motivating Example: Two-Way Cluster Robust DML

In a first part, we show how the two-way cluster robust double machine learning (DML) (Chiang et al. 2021) can be implemented with the DoubleML package. Chiang et al. (2021) consider double-indexed data

\begin{equation} \lbrace W_{ij}: i \in \lbrace 1, \ldots, N \rbrace, j \in \lbrace 1, \ldots, M \rbrace \rbrace \end{equation}

and the partially linear IV regression model (PLIV)

\[\begin{split}\begin{aligned} Y_{ij} = D_{ij} \theta_0 + g_0(X_{ij}) + \epsilon_{ij}, & &\mathbb{E}(\epsilon_{ij} | X_{ij}, Z_{ij}) = 0, \\ Z_{ij} = m_0(X_{ij}) + v_{ij}, & &\mathbb{E}(v_{ij} | X_{ij}) = 0. \end{aligned}\end{split}\]

Simulate two-way cluster data

We use the PLIV data generating process described in Section 4.1 of Chiang et al. (2021). The DGP is defined as

\[\begin{split}\begin{aligned} Z_{ij} &= X_{ij}' \xi_0 + V_{ij}, \\ D_{ij} &= Z_{ij}' \pi_{10} + X_{ij}' \pi_{20} + v_{ij}, \\ Y_{ij} &= D_{ij} \theta + X_{ij}' \zeta_0 + \varepsilon_{ij}, \end{aligned}\end{split}\]

with

\[\begin{split}\begin{aligned} X_{ij} &= (1 - \omega_1^X - \omega_2^X) \alpha_{ij}^X + \omega_1^X \alpha_{i}^X + \omega_2^X \alpha_{j}^X, \\ \varepsilon_{ij} &= (1 - \omega_1^\varepsilon - \omega_2^\varepsilon) \alpha_{ij}^\varepsilon + \omega_1^\varepsilon \alpha_{i}^\varepsilon + \omega_2^\varepsilon \alpha_{j}^\varepsilon, \\ v_{ij} &= (1 - \omega_1^v - \omega_2^v) \alpha_{ij}^v + \omega_1^v \alpha_{i}^v + \omega_2^v \alpha_{j}^v, \\ V_{ij} &= (1 - \omega_1^V - \omega_2^V) \alpha_{ij}^V + \omega_1^V \alpha_{i}^V + \omega_2^V \alpha_{j}^V, \end{aligned}\end{split}\]

and \(\alpha_{ij}^X, \alpha_{i}^X, \alpha_{j}^X \sim \mathcal{N}(0, \Sigma)\) where \(\Sigma\) is a \(p_x \times p_x\) matrix with entries \(\Sigma_{kj} = s_X^{|j-k|}\). Further

\[\begin{split}\begin{aligned} \left(\begin{matrix} \alpha_{ij}^\varepsilon \\ \alpha_{ij}^v \end{matrix}\right), \left(\begin{matrix} \alpha_{i}^\varepsilon \\ \alpha_{i}^v \end{matrix}\right), \left(\begin{matrix} \alpha_{j}^\varepsilon \\ \alpha_{j}^v \end{matrix}\right) \sim \mathcal{N}\left(0, \left(\begin{matrix} 1 & s_{\varepsilon v} \\ s_{\varepsilon v} & 1 \end{matrix} \right) \right) \end{aligned}\end{split}\]

and \(\alpha_{ij}^V, \alpha_{i}^V, \alpha_{j}^V \sim \mathcal{N}(0, 1)\).

Data from this DGP can be generated with the make_pliv_multiway_cluster_CKMS2021() function from DoubleML. Analogously to Chiang et al. (2021, Section 5) we use the following parameter setting: \(\theta=1.0\), \(N=M=25\), \(p_x=100\), \(\pi_{10}=1.0\), \(\omega_X = \omega_{\varepsilon} = \omega_V = \omega_v = (0.25, 0.25)\), \(s_X = s_{\varepsilon v} = 0.25\) and the \(j\)-th entries of the \(p_x\)-vectors \(\zeta_0 = \pi_{20} = \xi_0\) are \((\zeta_{0})_j = 0.5^j\). This are also the default values of make_pliv_multiway_cluster_CKMS2021().

[2]:
# Set the simulation parameters
N = 25  # number of observations (first dimension)
M = 25  # number of observations (second dimension)
dim_X = 100  # dimension of X
np.random.seed(3141) # set seed

obj_dml_data = make_pliv_multiway_cluster_CKMS2021(N, M, dim_X)

Data-Backend for Cluster Data

The implementation of cluster robust double machine learning is based on a special data-backend called DoubleMLClusterData. As compared to the standard data-backend DoubleMLData, users can specify the clustering variables during instantiation of a DoubleMLClusterData object. The estimation framework will subsequently account for the provided clustering options.

[3]:
# The simulated data is of type DoubleMLClusterData
print(obj_dml_data)
================== DoubleMLClusterData Object ==================

------------------ Data summary      ------------------
Outcome variable: Y
Treatment variable(s): ['D']
Cluster variable(s): ['cluster_var_i', 'cluster_var_j']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['Z']
No. Observations: 625

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625 entries, 0 to 624
Columns: 105 entries, cluster_var_i to Z
dtypes: float64(103), int64(2)
memory usage: 512.8 KB

[4]:
# The cluster variables are part of the DataFrame
obj_dml_data.data.head()
[4]:
cluster_var_i cluster_var_j X1 X2 X3 X4 X5 X6 X7 X8 ... X94 X95 X96 X97 X98 X99 X100 Y D Z
0 0 0 -0.261903 -0.195564 -0.118952 0.508459 0.226598 -0.544555 0.183888 1.694919 ... -0.205938 -0.996934 -1.136836 0.010269 -0.396985 -0.161141 -0.614188 0.256567 -0.113780 0.104787
1 0 1 -1.082973 -0.303835 0.760778 -0.542671 -0.601598 -0.201768 0.234910 0.212844 ... 0.427573 0.303324 0.247826 0.109273 -0.410795 -0.128408 0.633433 -1.816318 -1.002983 -0.942661
2 0 2 -0.110359 -0.679539 0.491245 -0.309772 -0.552727 0.036729 -0.673302 -1.024604 ... 0.094381 -0.922996 -2.054068 -0.477474 -0.543380 -0.720664 -0.332996 -0.851366 -0.302648 0.628069
3 0 3 -0.010940 1.003427 0.412653 0.784238 0.014637 0.611269 0.323679 -0.557999 ... 0.698694 1.342992 -1.136089 -0.632058 -0.509958 -0.456370 -0.557595 1.081488 0.438960 0.054162
4 0 4 -0.073207 -0.370736 -0.005857 -0.833907 -0.096337 -0.714240 0.094026 0.435401 ... -0.149714 -0.164864 -0.756805 0.518175 0.510385 -0.681176 -1.020271 -1.708190 -1.805007 -0.850321

5 rows × 105 columns

Initialize the objects of class DoubleMLPLIV

[5]:
# Set machine learning methods for m, g & r
learner = LassoCV()
ml_g = clone(learner)
ml_m = clone(learner)
ml_r = clone(learner)

# initialize the DoubleMLPLIV object
dml_pliv_obj = DoubleMLPLIV(obj_dml_data,
                            ml_g, ml_m, ml_r,
                            n_folds=3)
[6]:
print(dml_pliv_obj)
================== DoubleMLPLIV Object ==================

------------------ Data summary      ------------------
Outcome variable: Y
Treatment variable(s): ['D']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['Z']
Cluster variable(s): ['cluster_var_i', 'cluster_var_j']
No. Observations: 625

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: LassoCV()
Learner ml_m: LassoCV()
Learner ml_r: LassoCV()

------------------ Resampling        ------------------
No. folds per cluster: 3
No. folds: 9
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
Empty DataFrame
Columns: [coef, std err, t, P>|t|]
Index: []

Cluster Robust Cross Fitting

A key element of cluster robust DML (Chiang et al. 2021) is a special sample splitting used for the cross-fitting. In case of two-way clustering, we assume \(N\) clusters in the first dimension and \(M\) clusters in the second dimension.

For \(K\)-fold cross-fitting, Chiang et al. (2021) proposed to randomly partition \([N]:=\{1,\ldots,N\}\) into \(K\) subsets \(\{I_1, \ldots, I_K\}\) and \([M]:=\{1,\ldots,N\}\) into \(K\) subsets \(\{J_1, \ldots, J_K\}\). Effectively, one then considers \(K^2\) folds. Basically for each \((k, \ell) \in \{1, \ldots, K\} \times \{1, \ldots, K\}\), the nuisance functions are estimated for all double-indexed observations in \(([N]\setminus I_K) \times ([M]\setminus J_\ell)\), i.e.,

\[\hat{\eta}_{k\ell} = \hat{\eta}\left((W_{ij})_{(i,j)\in ([N]\setminus I_K) \times ([M]\setminus J_\ell)}\right)\]

The causal parameter is then estimated as usual by solving a moment condition with a Neyman orthogonal score function. For two-way cluster robust double machine learning with algorithm DML2 this results in solving

\[\frac{1}{K^2} \sum_{k=1}^{K} \sum_{\ell=1}^{K} \frac{1}{|I_k| |J_\ell|} \sum_{(i,j) \in I_K \times J_\ell} \psi(W_{ij}, \tilde{\theta}_0, \hat{\eta}_{k\ell}) = 0\]

for \(\tilde{\theta}_0\). Here \(|I_k|\) denotes the cardinality, i.e., the number of clusters in the \(k\)-th fold for the first cluster variable.

We can visualize the sample splitting of the \(N \cdot M = 625\) observations into \(K \cdot K = 9\) folds. The following heat map illustrates the partitioned data set that is split into \(K=9\) folds. The horizontal axis corresponds to the fold indices and the vertical axis to the indices of the observations. A blue field indicates that the observation \(i\) is used for fitting the nuisance part, red indicates that the fold is used for prediction generation and white means that an observation is left out from the sample splitting.

For example, the first observation as displayed on the very bottom of the figure (using Python indexing starting at 0) is used for training of the nuisance parts in the first (0), the third (2), fourth (3) and sixth (5) fold and used for generation of the predictions in fold eight (7). At the same time the observation is left out from the sample splitting procedure in folds two (1), five (4), seven (6) and nine (8).

[8]:
# The function plt_smpls is defined at the end of the Notebook
plt_smpls(dml_pliv_obj.smpls[0], dml_pliv_obj._n_folds_per_cluster)
../_images/examples_py_double_ml_multiway_cluster_14_0.png

If we visualize the sample splitting in terms of the cluster variables, the partitioning of the data into \(9\) folds \(I_k \times J_\ell\) becomes clear. The identifiers for the first cluster variable \([N]:=\{1,\ldots,N\}\) have been randomly partioned into \(K=3\) folds denoted by \(\{I_1, I_2, I_3\}\) and the identifiers for the second cluster variable \([M]:=\{1,\ldots,M\}\) have also been randomly partioned into \(K=3\) folds denoted by \(\{J_1, J_2, J_3\}\). By considering every combination \(I_k \times J_\ell\) for \(1 \leq k, \ell \leq K = 3\) we effectively base the cross-fitting on \(9\) folds.

We now want to focus on the top-left sub-plot showing the partitioning of the cluster data for the first fold. The \(x\)-axis corresponds to the first cluster variable and the \(y\)-axis to the second cluster variable. Observations with cluster variables \((i,j) \in I_K \times J_\ell\) are used for estimation of the target parameter \(\tilde{\theta}_0\) by solving a Neyman orthogonal score function. For estimation of the nuisance function, we only use observation where neither the first cluster variable is in \(I_K\) nor the second cluster variable is in \(J_\ell\), i.e., we use observations indexed by \((i,j)\in ([N]\setminus I_K) \times ([M]\setminus J_\ell)\) to estimate the nuisance functions

\[\hat{\eta}_{k\ell} = \hat{\eta}\left((W_{ij})_{(i,j)\in ([N]\setminus I_K) \times ([M]\setminus J_\ell)}\right).\]

This way we guarantee that there are never observations from the same cluster (first and/or second cluster dimension) in the sample for the nuisance function estimation (blue) and at the same time in the sample for solving the score function (red). As a result of this special sample splitting proposed by Chiang et al. (2021), the observations in the score (red) and nuisance (blue) sample can be considered independent and the standard cross-fitting approach for double machine learning can be applied.

[9]:
# The function plt_smpls_cluster is defined at the end of the Notebook
plt_smpls_cluster(dml_pliv_obj.smpls_cluster[0], dml_pliv_obj._n_folds_per_cluster)
../_images/examples_py_double_ml_multiway_cluster_16_0.png

Cluster Robust Standard Errors

In the abstract base class DoubleML the estimation of cluster robust standard errors is implemented for all supported double machine learning models. It is based on the assumption of a linear Neyman orthogonal score function. We use the notation \(n \wedge m := \min\{n,m\}\). For the the asymptotic variance of \(\sqrt{\underline{C}}(\tilde{\theta_0} - \theta_0)\) with \(\underline{C} := N \wedge M\) Chiang et al. (2021) then propose the following estimator

\[\hat{\sigma}^2 = \hat{J}^{-1} \hat{\Gamma} \hat{J}^{-1}\]

where

\[\begin{split}\begin{aligned} \hat{\Gamma} = \frac{1}{K^2} \sum_{(k, \ell) \in[K]^2} \Bigg[ \frac{|I_k| \wedge |J_\ell|}{(|I_k||J_\ell|)^2} \bigg(&\sum_{i \in I_k} \sum_{j \in J_\ell} \sum_{j' \in J_\ell} \psi(W_{ij}; \tilde{\theta}, \hat{\eta}_{k \ell}) \psi(W_{ij'}; \tilde{\theta}_0, \hat{\eta}_{k \ell}) \\ &+ \sum_{i \in I_k} \sum_{i' \in I_k} \sum_{j \in J_\ell} \psi(W_{ij}; \tilde{\theta}, \hat{\eta}_{k \ell}) \psi(W_{i'j}; \tilde{\theta}_0, \hat{\eta}_{k \ell}) \bigg) \Bigg] \end{aligned}\end{split}\]

and

\[\begin{aligned} \hat{J} = \frac{1}{K^2} \sum_{(k, \ell) \in[K]^2} \frac{1}{|I_k||J_\ell|} \sum_{i \in I_k} \sum_{j \in J_\ell} \psi_a(W_{ij}; \tilde{\theta}_0, \hat{\eta}_{k \ell}). \end{aligned}\]

A \((1-\alpha)\) confidence interval is then given by (Chiang et al. 2021)

\[\begin{aligned} \left[ \tilde{\theta} \pm \Phi^{-1}(1-\alpha/2) \sqrt{\hat{\sigma}^2 / \underline{C}} \right] \end{aligned}\]

with \(\underline{C} = N \wedge M\).

[10]:
# Estimate the PLIV model with cluster robust double machine learning
dml_pliv_obj.fit()
dml_pliv_obj.summary
[10]:
coef std err t P>|t| 2.5 % 97.5 %
D 1.135871 0.118601 9.577271 9.964261e-22 0.903418 1.368324

(One-Way) Cluster Robust Double Machine Learing

We again use the PLIV data generating process described in Section 4.1 of Chiang et al. (2021). To obtain one-way clustered data, we set the following weights to zero

\[\omega_2^X = \omega_2^\varepsilon = \omega_2^v = \omega_2^V = 0.\]

Again we can simulate this data with make_pliv_multiway_cluster_CKMS2021(). To prepare the data-backend for one-way clustering, we only have to alter the cluster_cols to be 'cluster_var_i'.

[11]:
obj_dml_data = make_pliv_multiway_cluster_CKMS2021(N, M, dim_X,
                                                   omega_X=np.array([0.25, 0]),
                                                   omega_epsilon=np.array([0.25, 0]),
                                                   omega_v=np.array([0.25, 0]),
                                                   omega_V=np.array([0.25, 0]))
[12]:
obj_dml_data.cluster_cols = 'cluster_var_i'
print(obj_dml_data)
================== DoubleMLClusterData Object ==================

------------------ Data summary      ------------------
Outcome variable: Y
Treatment variable(s): ['D']
Cluster variable(s): ['cluster_var_i']
Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100']
Instrument variable(s): ['Z']
No. Observations: 625

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625 entries, 0 to 624
Columns: 105 entries, cluster_var_i to Z
dtypes: float64(103), int64(2)
memory usage: 512.8 KB

[13]:
# Set machine learning methods for m & g
learner = LassoCV()
ml_g = clone(learner)
ml_m = clone(learner)
ml_r = clone(learner)

# initialize the DoubleMLPLIV object
dml_pliv_obj = DoubleMLPLIV(obj_dml_data,
                            ml_g, ml_m, ml_r,
                            n_folds=3)
[14]:
dml_pliv_obj.fit()
dml_pliv_obj.summary
[14]:
coef std err t P>|t| 2.5 % 97.5 %
D 0.951532 0.047954 19.842625 1.276189e-87 0.857544 1.04552

Real-Data Application

As a real-data application we revist the consumer demand example from Chiang et al. (2021). The U.S. automobile data of Berry, Levinsohn, and Pakes (1995) is obtained from the R package hdm. In this example, we consider different specifications for the cluster dimensions.

Load and Process Data

[15]:
from sklearn.preprocessing import PolynomialFeatures
from rpy2.robjects.packages import importr, data
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
[16]:
hdm = importr('hdm')
blp_data = data(hdm).fetch('BLP')['BLP'][0]
[17]:
x_cols = ['hpwt', 'air', 'mpd', 'space']
blp_data
[17]:
model.name model.id firm.id cdid id price mpd air mpg space hpwt trend share outshr y
1 AMGREM 23.0 15.0 1.0 129.0 -6.825617 1.888146 0.0 1.697 1.150200 0.528997 0.0 0.001051 0.880106 0.820366
2 AMHORN 24.0 15.0 1.0 130.0 -6.245370 1.935989 0.0 1.740 1.278000 0.494324 0.0 0.000670 0.880106 0.369981
3 AMJAVL 25.0 15.0 1.0 132.0 -4.652778 1.716799 0.0 1.543 1.459200 0.467613 0.0 0.000341 0.880106 -0.306915
4 AMMATA 26.0 15.0 1.0 134.0 -4.921913 1.687871 0.0 1.517 1.606800 0.426540 0.0 0.000522 0.880106 0.120721
5 AMAMBS 18.0 15.0 1.0 136.0 -2.833024 1.504286 0.0 1.352 1.645800 0.452489 0.0 0.000442 0.880106 -0.045144
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2213 VV740 526.0 6.0 20.0 5584.0 4.378596 2.639135 1.0 2.100 1.305612 0.385917 19.0 0.000488 0.907801 0.022915
2214 VV760G 529.0 6.0 20.0 5585.0 14.225574 2.136442 1.0 1.700 1.305612 0.435967 19.0 0.000091 0.907801 -1.652350
2215 YGGVPL 555.0 23.0 20.0 5589.0 -8.368152 3.518846 0.0 2.800 0.843730 0.358289 19.0 0.000067 0.907801 -1.957375
2216 PS911C 425.0 12.0 20.0 5590.0 32.997571 3.016154 1.0 2.400 1.093950 0.814913 19.0 0.000039 0.907801 -2.512519
2217 PS944 438.0 12.0 20.0 5592.0 20.296729 3.267500 1.0 2.600 1.153587 0.693796 19.0 0.000025 0.907801 -2.953683

2217 rows × 15 columns

[18]:
def construct_iv(blp_data, x_cols=['hpwt', 'air', 'mpd', 'space']):
    n = blp_data.shape[0]
    p = len(x_cols)

    firmid = blp_data['firm.id'].values
    cdid = blp_data['cdid'].values
    id_var = blp_data['id'].values
    X = blp_data[x_cols]

    sum_other = pd.DataFrame(columns=['sum.other.' + var for var in x_cols],
                             index=blp_data.index)
    sum_rival = pd.DataFrame(columns=['sum.rival.' + var for var in x_cols],
                             index=blp_data.index)

    for i in range(n):
        other_ind = (firmid == firmid[i]) & (cdid == cdid[i]) & (id_var != id_var[i])
        rival_ind = (firmid != firmid[i]) & (cdid == cdid[i])
        for j in range(p):
            sum_other.iloc[i, j] = X.iloc[:,j][other_ind].sum()
            sum_rival.iloc[i, j] = X.iloc[:,j][rival_ind].sum()

    return pd.concat((sum_other, sum_rival), axis=1)
[19]:
iv_vars = construct_iv(blp_data, x_cols=['hpwt', 'air', 'mpd', 'space'])
[20]:
poly = PolynomialFeatures(degree=3, include_bias=False)
data_transf = poly.fit_transform(blp_data[x_cols])
x_cols_poly = poly.get_feature_names(x_cols)
data_transf = pd.DataFrame(data_transf, columns=x_cols_poly)
data_transf.index = blp_data.index
[21]:
sel_cols_chiang = list(np.setdiff1d(data_transf.columns,
                                    ['hpwt air mpd', 'hpwt air space',
                                     'hpwt mpd space', 'air mpd space']))
sel_cols_chiang
[21]:
['air',
 'air mpd',
 'air mpd^2',
 'air space',
 'air space^2',
 'air^2',
 'air^2 mpd',
 'air^2 space',
 'air^3',
 'hpwt',
 'hpwt air',
 'hpwt air^2',
 'hpwt mpd',
 'hpwt mpd^2',
 'hpwt space',
 'hpwt space^2',
 'hpwt^2',
 'hpwt^2 air',
 'hpwt^2 mpd',
 'hpwt^2 space',
 'hpwt^3',
 'mpd',
 'mpd space',
 'mpd space^2',
 'mpd^2',
 'mpd^2 space',
 'mpd^3',
 'space',
 'space^2',
 'space^3']
[22]:
blp_data['log_p'] = np.log(blp_data['price'] + 11.761)

y_col = 'y'
d_col = 'log_p'
cluster_cols = ['model.id', 'cdid']
all_z_cols = ['sum.other.hpwt', 'sum.other.mpd', 'sum.other.space']
z_col = all_z_cols[0]
[23]:
dml_df = pd.concat((blp_data[[y_col] + [d_col] + cluster_cols],
                    data_transf[sel_cols_chiang],
                    iv_vars[all_z_cols]),
                   axis=1)
[24]:
dml_df.shape
[24]:
(2217, 37)

Initialize DoubleMLClusterData object

[25]:
dml_data = DoubleMLClusterData(dml_df,
                               y_col=y_col,
                               d_cols=d_col,
                               z_cols=z_col,
                               cluster_cols=cluster_cols,
                               x_cols=sel_cols_chiang)
[26]:
print(dml_data)
================== DoubleMLClusterData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['log_p']
Cluster variable(s): ['model.id', 'cdid']
Covariates: ['air', 'air mpd', 'air mpd^2', 'air space', 'air space^2', 'air^2', 'air^2 mpd', 'air^2 space', 'air^3', 'hpwt', 'hpwt air', 'hpwt air^2', 'hpwt mpd', 'hpwt mpd^2', 'hpwt space', 'hpwt space^2', 'hpwt^2', 'hpwt^2 air', 'hpwt^2 mpd', 'hpwt^2 space', 'hpwt^3', 'mpd', 'mpd space', 'mpd space^2', 'mpd^2', 'mpd^2 space', 'mpd^3', 'space', 'space^2', 'space^3']
Instrument variable(s): ['sum.other.hpwt']
No. Observations: 2217

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
Index: 2217 entries, 1 to 2217
Columns: 37 entries, y to sum.other.space
dtypes: float64(34), object(3)
memory usage: 658.2+ KB

[27]:
learner = LassoCV(max_iter=50000)
[28]:
res_df = pd.DataFrame()
n_rep = 10

Two-Way Clustering with Respect to Product and Market

[29]:
dml_data.z_cols = z_col
dml_data.cluster_cols = ['model.id', 'cdid']
dml_pliv = DoubleMLPLIV(dml_data,
                        clone(learner), clone(learner), clone(learner),
                        n_folds=2, n_rep=n_rep)
dml_pliv.fit()
res = dml_pliv.summary.reset_index(drop=True)
res['z_col'] = dml_data.z_cols[0]
res['clustering'] = 'two-way'
res_df = res_df.append(res)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:526: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 3.515478273505323, tolerance: 1.1187339840850312
  model = cd_fast.enet_coordinate_descent_gram(

One-Way Clustering with Respect to the Product

[30]:
dml_data.z_cols = z_col
dml_data.cluster_cols = 'model.id'
dml_pliv = DoubleMLPLIV(dml_data,
                        clone(learner), clone(learner), clone(learner),
                        n_folds=4, n_rep=n_rep)
dml_pliv.fit()
res = dml_pliv.summary.reset_index(drop=True)
res['z_col'] = dml_data.z_cols[0]
res['clustering'] = 'one-way-product'
res_df = res_df.append(res)

One-Way Clustering with Respect to the Market

[31]:
dml_data.z_cols = z_col
dml_data.cluster_cols = 'cdid'
dml_pliv = DoubleMLPLIV(dml_data,
                        clone(learner), clone(learner), clone(learner),
                        n_folds=4, n_rep=n_rep)
dml_pliv.fit()
res = dml_pliv.summary.reset_index(drop=True)
res['z_col'] = dml_data.z_cols[0]
res['clustering'] = 'one-way-market'
res_df = res_df.append(res)

No Clustering / Zero-Way Clustering

[32]:
dml_data = DoubleMLData(dml_df,
                        y_col=y_col,
                        d_cols=d_col,
                        z_cols=z_col,
                        x_cols=sel_cols_chiang)
[33]:
print(dml_data)
================== DoubleMLData Object ==================

------------------ Data summary      ------------------
Outcome variable: y
Treatment variable(s): ['log_p']
Covariates: ['air', 'air mpd', 'air mpd^2', 'air space', 'air space^2', 'air^2', 'air^2 mpd', 'air^2 space', 'air^3', 'hpwt', 'hpwt air', 'hpwt air^2', 'hpwt mpd', 'hpwt mpd^2', 'hpwt space', 'hpwt space^2', 'hpwt^2', 'hpwt^2 air', 'hpwt^2 mpd', 'hpwt^2 space', 'hpwt^3', 'mpd', 'mpd space', 'mpd space^2', 'mpd^2', 'mpd^2 space', 'mpd^3', 'space', 'space^2', 'space^3']
Instrument variable(s): ['sum.other.hpwt']
No. Observations: 2217

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
Index: 2217 entries, 1 to 2217
Columns: 37 entries, y to sum.other.space
dtypes: float64(34), object(3)
memory usage: 658.2+ KB

[34]:
dml_data.z_cols = z_col
dml_pliv = DoubleMLPLIV(dml_data,
                        clone(learner), clone(learner), clone(learner),
                        n_folds=4, n_rep=n_rep)
dml_pliv.fit()
res = dml_pliv.summary.reset_index(drop=True)
res['z_col'] = dml_data.z_cols[0]
res['clustering'] = 'zero-way'
res_df = res_df.append(res)

Application Results

[35]:
res_df
[35]:
coef std err t P>|t| 2.5 % 97.5 % z_col clustering
0 -5.450152 1.262621 -4.316540 1.584942e-05 -7.924843 -2.975461 sum.other.hpwt two-way
0 -5.757819 0.928947 -6.198218 5.710586e-10 -7.578523 -3.937116 sum.other.hpwt one-way-product
0 -5.811825 0.784483 -7.408479 1.277561e-13 -7.349383 -4.274267 sum.other.hpwt one-way-market
0 -5.754870 0.464284 -12.395136 2.776728e-35 -6.664850 -4.844889 sum.other.hpwt zero-way

References

Berry, S., Levinsohn, J., and Pakes, A. (1995), Automobile Prices in Market Equilibrium, Econometrica: Journal of the Econometric Society, 63, 841-890, doi: 10.2307/2171802.

Cameron, A. C., Gelbach, J. B. and Miller, D. L. (2011), Robust Inference with Multiway Clustering, Journal of Business & Economic Statistics, 29:2, 238-249, doi: 10.1198/jbes.2010.07136.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68, doi: 10.1111/ectj.12097.

Chiang, H. D., Kato K., Ma, Y. and Sasaki, Y. (2021), Multiway Cluster Robust Double/Debiased Machine Learning, Journal of Business & Economic Statistics, doi: 10.1080/07350015.2021.1895815, arXiv: 1909.03489.

Define Helper Functions for Plotting

[36]:
#discrete color scheme
x = sns.color_palette("RdBu_r", 7)
cMap = ListedColormap([x[0], x[3], x[6]])
plt.rcParams['figure.figsize'] = 15, 12
sns.set(font_scale=1.3)

def plt_smpls(smpls, n_folds_per_cluster):
    df = pd.DataFrame(np.zeros([N*M, n_folds_per_cluster*n_folds_per_cluster]))
    for i_split, this_split_ind in enumerate(smpls):
        df.loc[this_split_ind[0], i_split] = -1.
        df.loc[this_split_ind[1], i_split] = 1.

    ax = sns.heatmap(df, cmap=cMap);
    ax.invert_yaxis();
    ax.set_ylim([0, N*M]);
    ax.set_xlabel('Fold')
    ax.set_ylabel('Observation')
    colorbar = ax.collections[0].colorbar
    colorbar.set_ticks([-0.667, 0, 0.667])
    colorbar.set_ticklabels(['Nuisance', '', 'Score'])

def plt_smpls_cluster(smpls_cluster, n_folds_per_cluster):
    for i_split in range(len(smpls_cluster)):
        plt.subplot(n_folds_per_cluster, n_folds_per_cluster, i_split + 1)
        df = pd.DataFrame(np.zeros([N*M, 1]),
                      index = pd.MultiIndex.from_product([range(N), range(M)]),
                      columns=['value'])

        df.loc[pd.MultiIndex.from_product(smpls_cluster[i_split][0]), :] = -1.
        df.loc[pd.MultiIndex.from_product(smpls_cluster[i_split][1]), :] = 1.

        df_wide = df.reset_index().pivot(index="level_0", columns="level_1", values="value")
        df_wide.index.name=''
        df_wide.columns.name=''

        ax = sns.heatmap(df_wide, cmap=cMap);
        ax.invert_yaxis();
        ax.set_ylim([0, M]);
        colorbar = ax.collections[0].colorbar
        colorbar.set_ticks([-0.667, 0, 0.667])

        l = i_split % n_folds_per_cluster + 1
        k = np.floor_divide(i_split, n_folds_per_cluster) + 1
        title = f'Nuisance: $I_{{{k}}}^C \\times J_{{{l}}}^C$; Score: $I_{{{k}}} \\times J_{{{l}}}$'
        ax.set_title(title)
        if l == n_folds_per_cluster:
            colorbar.set_ticklabels(['Nuisance', '', 'Score'])
        else:
            colorbar.set_ticklabels(['', '', ''])
        if l == 1:
            ax.set_ylabel('First Cluster Variable $k$')
        if k == 3:
            ax.set_xlabel('Second Cluster Variable $\ell$')
    plt.tight_layout()