Note

https://docs.doubleml.org/stable/examples/py_double_ml_meets_flaml.ipynb

DoubleML meets FLAML - How to tune learners automatically within `DoubleML`#

Recent advances in automated machine learning make it easier to tune hyperparameters of ML estimators automatically. These optimized learners can be used for the estimation part within DoubleML. In this notebook we are going to explore how to tune learners with AutoML for the DoubleML framework.

This notebook will use FLAML, but there are also many other AutoML frameworks. Particularly useful for DoubleML are packages that provide some way to export the model in sklearn-style.

Examples are: TPOT, autosklearn, H20 or Gama.

Data Generation#

We create synthetic data using the make_plr_CCDDHNR2018() process, with \(1000\) observations of \(50\) covariates as well as \(1\) treatment variable and an outcome. We calibrate the process such that hyperparameter tuning becomes more important.

[1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from doubleml.datasets import make_plr_CCDDHNR2018
import doubleml as dml
from flaml import AutoML
from xgboost import XGBRegressor

# Generate synthetic data
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=1000, dim_x=50, return_type="DataFrame", a0=0, a1=1, s1=0.25, s2=0.25)
data.head()

[1]:

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	...	X43	X44	X45	X46	X47	X48	X49	X50	y	d
0	1.065368	1.162593	1.089964	0.824657	0.157733	-1.228404	-0.675775	-0.223928	0.166238	0.124480	...	-2.021823	-1.662975	-2.100385	-1.225670	-1.223158	0.397536	-0.450031	0.511257	0.845534	-0.784792
1	0.214458	1.699616	3.222882	3.550242	2.692460	1.821970	1.223617	-0.100154	-0.234431	0.375844	...	-0.695711	-0.819507	-1.465424	-0.341472	-0.023537	0.436016	-0.503374	-1.342632	1.987307	0.835035
2	0.725820	-0.310145	-0.586921	-0.879058	0.239267	0.638461	0.131024	0.459436	-1.140081	-0.583692	...	-0.002388	0.716801	0.075942	1.439958	0.674747	-0.268343	0.682122	0.978303	0.154890	-0.168089
3	0.265744	0.479655	0.013313	1.417736	0.908767	1.786090	0.996892	-0.026822	-0.867201	0.433753	...	-0.482616	-0.172628	-0.309539	-0.609522	-0.830263	-0.883953	-1.249986	-2.688641	1.254035	0.161288
4	1.581827	0.926901	2.302382	0.803112	-0.152896	-0.389164	-0.569590	-0.124306	0.055439	-0.383531	...	0.048220	-0.698751	-0.754678	-0.689600	0.726658	0.780068	1.475517	0.777718	1.773769	1.786563

5 rows × 52 columns

Tuning on the full Sample#

In this section, we manually tune two XGBoost models using FLAML for a Partially Linear Regression Model. In the PLR (using the default score) we have to estimate a nuisance \(\eta\) consisting of

\[\eta := \{m_0(x), \ell_0(x)\} = \{\mathbb{E}[D|X], \mathbb{E}[Y|X]\}.\]

We initialize two FLAML AutoML objects and fit them accordingly. Once the tuning has been completed, we pass the learners to DoubleML.

Step 1: Initialize and Train the AutoML Models:#

Note: This cell will optimize the nuisance models for 4 minutes in total.

[2]:

# Initialize AutoML for outcome model (ml_l): Predict Y based on X
automl_l = AutoML()
settings_l = {
    "time_budget": 120,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_l.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=2, **settings_l)

# Initialize AutoML for treatment model (ml_m): Predict D based on X
automl_m = AutoML()
settings_m = {
    "time_budget": 120,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_m.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=2, **settings_m)

Step 2: Evaluate the Tuned Models#

FLAML reports the best loss during training as best_loss attribute. For more details, we refer to the FLAML documentation.

[3]:

rmse_oos_ml_m = automl_m.best_loss
rmse_oos_ml_l = automl_l.best_loss
print("Best RMSE during tuning (ml_m):",rmse_oos_ml_m)
print("Best RMSE during tuning (ml_l):",rmse_oos_ml_l)

Best RMSE during tuning (ml_m): 1.0078540263583833
Best RMSE during tuning (ml_l): 1.1155142425200442

Step 3: Create and Fit DoubleML Model#

We create a DoubleMLData object with the dataset, specifying \(y\) as the outcome variable and \(d\) as the treatment variable. We then initialize a DoubleMLPLR model using the tuned FLAML estimators for both the treatment and outcome components. DoubleML will use copies with identical configurations on each fold.

[4]:

obj_dml_data = dml.DoubleMLData(data, "y", "d")

obj_dml_plr_fullsample = dml.DoubleMLPLR(obj_dml_data, ml_m=automl_m.model.estimator,
                                           ml_l=automl_l.model.estimator)

print(obj_dml_plr_fullsample.fit().summary)

       coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.498286  0.032738  15.220407  2.589147e-52  0.434121  0.562452

DoubleML’s built-in learner evaluation reports the out-of-sample error during cross-fitting. We can compare this measure to the best loss during training from above.

[19]:

rmse_dml_ml_l_fullsample = obj_dml_plr_fullsample.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_fullsample = obj_dml_plr_fullsample.evaluate_learners()['ml_m'][0][0]

print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_fullsample)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_fullsample)

RMSE evaluated by DoubleML (ml_m): 1.0156853566737638
RMSE evaluated by DoubleML (ml_l): 1.1309844442144665

The best RMSE during automated tuning and the out-of-sample error in nuisance prediction are similar, which hints that there is no overfitting. We don’t expect large amounts of overfitting, since FLAML uses cross-validation internally and reports the best loss on a hold-out sample.

Tuning on the Folds#

Instead of externally tuning the FLAML learners, it is also possible to tune the AutoML learners internally. We have to define custom classes for integrating FLAML to DoubleML. The tuning will be automatically be started when calling DoubleML’s fit() method. Training will occure \(K\) times, so each fold will have an individualized optimal set of hyperparameters.

Step 1: Custom API for FLAML Models within `DoubleML`#

The following API is designed to facilitate automated machine learning model tuning for both regression and classification tasks. In this example however, we will only need the Regressor API as the treatment is continous.

[6]:

from sklearn.utils.multiclass import unique_labels

class FlamlRegressorDoubleML:
    _estimator_type = 'regressor'

    def __init__(self, time, estimator_list, metric, *args, **kwargs):
        self.auto_ml = AutoML(*args, **kwargs)
        self.time = time
        self.estimator_list = estimator_list
        self.metric = metric

    def set_params(self, **params):
        self.auto_ml.set_params(**params)
        return self

    def get_params(self, deep=True):
        dict = self.auto_ml.get_params(deep)
        dict["time"] = self.time
        dict["estimator_list"] = self.estimator_list
        dict["metric"] = self.metric
        return dict

    def fit(self, X, y):
        self.auto_ml.fit(X, y, task="regression", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
        self.tuned_model = self.auto_ml.model.estimator
        return self

    def predict(self, x):
        preds = self.tuned_model.predict(x)
        return preds

class FlamlClassifierDoubleML:
    _estimator_type = 'classifier'

    def __init__(self, time, estimator_list, metric, *args, **kwargs):
        self.auto_ml = AutoML(*args, **kwargs)
        self.time = time
        self.estimator_list = estimator_list
        self.metric = metric

    def set_params(self, **params):
        self.auto_ml.set_params(**params)
        return self

    def get_params(self, deep=True):
        dict = self.auto_ml.get_params(deep)
        dict["time"] = self.time
        dict["estimator_list"] = self.estimator_list
        dict["metric"] = self.metric
        return dict

    def fit(self, X, y):
        self.classes_ = unique_labels(y)
        self.auto_ml.fit(X, y, task="classification", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
        self.tuned_model = self.auto_ml.model.estimator
        return self

    def predict_proba(self, x):
        preds = self.tuned_model.predict_proba(x)
        return preds

Step 2: Using the API when calling `DoubleML`’s `.fit()` Method#

We initialize a FlamlRegressorDoubleML and hand it without fitting into the DoubleML object. When calling .fit() on the DoubleML object, copies of the API object will be created on the folds and a seperate set of hyperparameters is created. Since we fit \(K\) times, we reduce the computation time accordingly to ensure comparibility to the full sample case.

[7]:

# Define the FlamlRegressorDoubleML
ml_l = FlamlRegressorDoubleML(time=24, estimator_list=['xgboost'], metric='rmse')
ml_m = FlamlRegressorDoubleML(time=24, estimator_list=['xgboost'], metric='rmse')

# Create DoubleMLPLR object using the new regressors
dml_plr_obj_onfolds = dml.DoubleMLPLR(obj_dml_data, ml_m, ml_l)

# Fit the DoubleMLPLR model
print(dml_plr_obj_onfolds.fit(store_models=True).summary)

       coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.502016  0.033265  15.091263  1.848688e-51  0.436817  0.567215

[9]:

rmse_oos_onfolds_ml_l = np.mean([dml_plr_obj_onfolds.models["ml_l"]["d"][0][i].auto_ml.best_loss for i in range(5)])
rmse_oos_onfolds_ml_m = np.mean([dml_plr_obj_onfolds.models["ml_m"]["d"][0][i].auto_ml.best_loss for i in range(5)])
print("Best RMSE during tuning (ml_m):",rmse_oos_onfolds_ml_m)
print("Best RMSE during tuning (ml_l):",rmse_oos_onfolds_ml_l)

rmse_dml_ml_l_onfolds = dml_plr_obj_onfolds.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_onfolds = dml_plr_obj_onfolds.evaluate_learners()['ml_m'][0][0]

print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_onfolds)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_onfolds)

Best RMSE during tuning (ml_m): 1.0068101213851626
Best RMSE during tuning (ml_l): 1.1151610541568202
RMSE evaluated by DoubleML (ml_m): 1.0084871742256079
RMSE evaluated by DoubleML (ml_l): 1.1272404618426184

Similar to the above case, we see no hints for overfitting.

Comparison to AutoML with less Computation time and Untuned XGBoost Learners#

AutoML with less Computation time#

As a baseline, we can compare the learners above that have been tuned using two minutes of training time each with ones that only use ten seconds.

Note: These tuning times are examples. For this setting, we found 10s to be insuffienct and 120s to be sufficient. In general, necessary tuning time can depend on data complexity, data set size, computational power of the machine used, etc.. For more info on how to use FLAML properly please refer to the documentation and the paper.

[10]:

# Initialize AutoML for outcome model similar to above, but use a smaller time budget.
automl_l_lesstime = AutoML()
settings_l = {
    "time_budget": 10,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_l_lesstime.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=2, **settings_l)

# Initialize AutoML for treatment model similar to above, but use a smaller time budget.
automl_m_lesstime = AutoML()
settings_m = {
    "time_budget": 10,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_m_lesstime.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=2, **settings_m)

[11]:

obj_dml_plr_lesstime = dml.DoubleMLPLR(obj_dml_data, ml_m=automl_m_lesstime.model.estimator,
                                           ml_l=automl_l_lesstime.model.estimator)

print(obj_dml_plr_lesstime.fit().summary)

       coef   std err          t         P>|t|     2.5 %    97.5 %
d  0.436394  0.031007  14.073929  5.493102e-45  0.375621  0.497168

We can check the performance again.

[33]:

rmse_dml_ml_l_lesstime = obj_dml_plr_lesstime.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_lesstime = obj_dml_plr_lesstime.evaluate_learners()['ml_m'][0][0]


print("Best RMSE during tuning (ml_m):", automl_m_lesstime.best_loss)
print("Best RMSE during tuning (ml_l):", automl_l_lesstime.best_loss)
print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_lesstime)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_lesstime)

Best RMSE during tuning (ml_m): 0.9158080176561963
Best RMSE during tuning (ml_l): 1.2197237644227434
RMSE evaluated by DoubleML (ml_m): 1.0739130271918385
RMSE evaluated by DoubleML (ml_l): 1.1362430723104844

We see a more severe difference in oos RMSE between AutoML and DML estimations. This could hint that the learner underfits, i.e. training time was not sufficient.

Untuned (default parameter) XGBoost#

As another baseline, we set up DoubleML with an XGBoost learner that has not been tuned at all, i.e. using the default set of hyperparameters.

[12]:

xgb_untuned_m, xgb_untuned_l = XGBRegressor(), XGBRegressor()

[13]:

# Create DoubleMLPLR object using AutoML models
dml_plr_obj_untuned = dml.DoubleMLPLR(obj_dml_data, xgb_untuned_l, xgb_untuned_m)
print(dml_plr_obj_untuned.fit().summary)

rmse_dml_ml_l_untuned = dml_plr_obj_untuned.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_untuned = dml_plr_obj_untuned.evaluate_learners()['ml_m'][0][0]

       coef  std err          t         P>|t|     2.5 %    97.5 %
d  0.431253  0.03258  13.236884  5.373218e-40  0.367398  0.495108

Comparison and summary#

We combine the summaries from various models: full-sample and on-the-folds tuned AutoML, untuned XGB, and dummy models.

[24]:

summary = pd.concat([obj_dml_plr_fullsample.summary, dml_plr_obj_onfolds.summary, dml_plr_obj_untuned.summary, obj_dml_plr_lesstime.summary],
                     keys=['Full Sample', 'On the folds', 'Default', 'Less time'])
summary.index.names = ['Model Type', 'Metric']

summary

[24]:

		coef	std err	t	P>\|t\|	2.5 %	97.5 %
Model Type	Metric
Full Sample	d	0.498286	0.032738	15.220407	2.589147e-52	0.434121	0.562452
On the folds	d	0.502016	0.033265	15.091263	1.848688e-51	0.436817	0.567215
Default	d	0.431253	0.032580	13.236884	5.373218e-40	0.367398	0.495108
Less time	d	0.436394	0.031007	14.073929	5.493102e-45	0.375621	0.497168

Plot Coefficients and 95% Confidence Intervals#

This section generates a plot comparing the coefficients and 95% confidence intervals for each model type.

[25]:

# Extract model labels and coefficient values
model_labels = summary.index.get_level_values('Model Type')
coef_values = summary['coef'].values

# Calculate errors
errors = np.full((2, len(coef_values)), np.nan)
errors[0, :] = summary['coef'] - summary['2.5 %']
errors[1, :] = summary['97.5 %'] - summary['coef']

# Plot Coefficients and 95% Confidence Intervals
plt.figure(figsize=(10, 6))
plt.errorbar(model_labels, coef_values, fmt='o', yerr=errors, capsize=5)
plt.axhline(0.5, color='red', linestyle='--')
plt.xlabel('Model')
plt.ylabel('Coefficients and 95%-CI')
plt.title('Comparison of Coefficients and 95% Confidence Intervals')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

../_images/examples_py_double_ml_meets_flaml_31_0.png

Compare Metrics for Nuisance Estimation#

In this section, we compare metrics for different models and plot a bar chart to visualize the differences in their performance.

[32]:

fig, axs = plt.subplots(1,2,figsize=(10,4))
axs = axs.flatten()

axs[0].bar(x = ['Full Sample', 'On the folds', 'Default', 'Less time'],
           height=[rmse_dml_ml_m_fullsample, rmse_dml_ml_m_onfolds, rmse_dml_ml_m_untuned, rmse_dml_ml_m_lesstime])

axs[1].bar(x = ['Full Sample', 'On the folds', 'Default', 'Less time'],
           height=[rmse_dml_ml_l_fullsample, rmse_dml_ml_l_onfolds, rmse_dml_ml_l_untuned, rmse_dml_ml_l_lesstime])

axs[0].set_xlabel("Tuning Method")
axs[0].set_ylim((1,1.12))
axs[0].set_ylabel("RMSE")
axs[0].set_title("OOS RMSE for Different Tuning Methods (ml_m)")

axs[1].set_xlabel("Tuning Method")
axs[1].set_ylim((1.1,1.22))
axs[1].set_ylabel("RMSE")
axs[1].set_title("OOS RMSE for Different Tuning Methods (ml_l)")

fig.suptitle("Out of Sample RMSE in Nuisance Estimation by Tuning Method")
fig.tight_layout()

../_images/examples_py_double_ml_meets_flaml_33_0.png

Conclusion#

This notebook highlights that tuning plays an important role and can be easily done using FLAML AutoML. In our recent study we provide more evidence for tuning with AutoML, especially that the full sample case in all investigated cases performed similarly to the full sample case and thus tuning time and complexity can be saved by tuning externally.

See also our fully automated API for tuning DoubleML objects using AutoML, called AutoDoubleML which can be installed from Github for python.

References#

Bach, P., Schacht, O., Chernozhukov, V., Klaassen, S., & Spindler, M. (2024, March). Hyperparameter Tuning for Causal Inference with Double Machine Learning: A Simulation Study. In Causal Learning and Reasoning (pp. 1065-1117). PMLR.

DoubleML meets FLAML - How to tune learners automatically within DoubleML#