Note
-
Download Jupyter notebook:
https://docs.doubleml.org/stable/examples/py_double_ml_meets_flaml.ipynb.
DoubleML meets FLAML - How to tune learners automatically within DoubleML
#
Recent advances in automated machine learning make it easier to tune hyperparameters of ML estimators automatically. These optimized learners can be used for the estimation part within DoubleML. In this notebook we are going to explore how to tune learners with AutoML for the DoubleML framework.
This notebook will use FLAML, but there are also many other AutoML frameworks. Particularly useful for DoubleML are packages that provide some way to export the model in sklearn
-style.
Examples are: TPOT, autosklearn, H20 or Gama.
Data Generation#
We create synthetic data using the make_plr_CCDDHNR2018() process, with \(1000\) observations of \(50\) covariates as well as \(1\) treatment variable and an outcome. We calibrate the process such that hyperparameter tuning becomes more important.
[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from doubleml.datasets import make_plr_CCDDHNR2018
import doubleml as dml
from flaml import AutoML
from xgboost import XGBRegressor
# Generate synthetic data
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=1000, dim_x=50, return_type="DataFrame", a0=0, a1=1, s1=0.25, s2=0.25)
data.head()
[1]:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | ... | X43 | X44 | X45 | X46 | X47 | X48 | X49 | X50 | y | d | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.065368 | 1.162593 | 1.089964 | 0.824657 | 0.157733 | -1.228404 | -0.675775 | -0.223928 | 0.166238 | 0.124480 | ... | -2.021823 | -1.662975 | -2.100385 | -1.225670 | -1.223158 | 0.397536 | -0.450031 | 0.511257 | 0.845534 | -0.784792 |
1 | 0.214458 | 1.699616 | 3.222882 | 3.550242 | 2.692460 | 1.821970 | 1.223617 | -0.100154 | -0.234431 | 0.375844 | ... | -0.695711 | -0.819507 | -1.465424 | -0.341472 | -0.023537 | 0.436016 | -0.503374 | -1.342632 | 1.987307 | 0.835035 |
2 | 0.725820 | -0.310145 | -0.586921 | -0.879058 | 0.239267 | 0.638461 | 0.131024 | 0.459436 | -1.140081 | -0.583692 | ... | -0.002388 | 0.716801 | 0.075942 | 1.439958 | 0.674747 | -0.268343 | 0.682122 | 0.978303 | 0.154890 | -0.168089 |
3 | 0.265744 | 0.479655 | 0.013313 | 1.417736 | 0.908767 | 1.786090 | 0.996892 | -0.026822 | -0.867201 | 0.433753 | ... | -0.482616 | -0.172628 | -0.309539 | -0.609522 | -0.830263 | -0.883953 | -1.249986 | -2.688641 | 1.254035 | 0.161288 |
4 | 1.581827 | 0.926901 | 2.302382 | 0.803112 | -0.152896 | -0.389164 | -0.569590 | -0.124306 | 0.055439 | -0.383531 | ... | 0.048220 | -0.698751 | -0.754678 | -0.689600 | 0.726658 | 0.780068 | 1.475517 | 0.777718 | 1.773769 | 1.786563 |
5 rows × 52 columns
Tuning on the full Sample#
In this section, we manually tune two XGBoost models using FLAML for a Partially Linear Regression Model. In the PLR (using the default score) we have to estimate a nuisance \(\eta\) consisting of
We initialize two FLAML
AutoML objects and fit them accordingly. Once the tuning has been completed, we pass the learners to DoubleML
.
Step 1: Initialize and Train the AutoML Models:#
Note: This cell will optimize the nuisance models for 4 minutes in total.
[2]:
# Initialize AutoML for outcome model (ml_l): Predict Y based on X
automl_l = AutoML()
settings_l = {
"time_budget": 120,
"metric": 'rmse',
"estimator_list": ['xgboost'],
"task": 'regression',
}
automl_l.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=2, **settings_l)
# Initialize AutoML for treatment model (ml_m): Predict D based on X
automl_m = AutoML()
settings_m = {
"time_budget": 120,
"metric": 'rmse',
"estimator_list": ['xgboost'],
"task": 'regression',
}
automl_m.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=2, **settings_m)
Step 2: Evaluate the Tuned Models#
FLAML
reports the best loss during training as best_loss
attribute. For more details, we refer to the FLAML documentation.
[3]:
rmse_oos_ml_m = automl_m.best_loss
rmse_oos_ml_l = automl_l.best_loss
print("Best RMSE during tuning (ml_m):",rmse_oos_ml_m)
print("Best RMSE during tuning (ml_l):",rmse_oos_ml_l)
Best RMSE during tuning (ml_m): 1.0078540263583833
Best RMSE during tuning (ml_l): 1.1155142425200442
Step 3: Create and Fit DoubleML Model#
We create a DoubleMLData
object with the dataset, specifying \(y\) as the outcome variable and \(d\) as the treatment variable. We then initialize a DoubleMLPLR
model using the tuned FLAML
estimators for both the treatment and outcome components. DoubleML
will use copies with identical configurations on each fold.
[4]:
obj_dml_data = dml.DoubleMLData(data, "y", "d")
obj_dml_plr_fullsample = dml.DoubleMLPLR(obj_dml_data, ml_m=automl_m.model.estimator,
ml_l=automl_l.model.estimator)
print(obj_dml_plr_fullsample.fit().summary)
coef std err t P>|t| 2.5 % 97.5 %
d 0.498286 0.032738 15.220407 2.589147e-52 0.434121 0.562452
DoubleML
’s built-in learner evaluation reports the out-of-sample error during cross-fitting. We can compare this measure to the best loss during training from above.
[19]:
rmse_dml_ml_l_fullsample = obj_dml_plr_fullsample.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_fullsample = obj_dml_plr_fullsample.evaluate_learners()['ml_m'][0][0]
print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_fullsample)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_fullsample)
RMSE evaluated by DoubleML (ml_m): 1.0156853566737638
RMSE evaluated by DoubleML (ml_l): 1.1309844442144665
The best RMSE during automated tuning and the out-of-sample error in nuisance prediction are similar, which hints that there is no overfitting. We don’t expect large amounts of overfitting, since FLAML uses cross-validation internally and reports the best loss on a hold-out sample.
Tuning on the Folds#
Instead of externally tuning the FLAML
learners, it is also possible to tune the AutoML learners internally. We have to define custom classes for integrating FLAML
to DoubleML
. The tuning will be automatically be started when calling DoubleML
’s fit()
method. Training will occure \(K\) times, so each fold will have an individualized optimal set of hyperparameters.
Step 1: Custom API for FLAML Models within DoubleML
#
The following API is designed to facilitate automated machine learning model tuning for both regression and classification tasks. In this example however, we will only need the Regressor API as the treatment is continous.
[6]:
from sklearn.utils.multiclass import unique_labels
class FlamlRegressorDoubleML:
_estimator_type = 'regressor'
def __init__(self, time, estimator_list, metric, *args, **kwargs):
self.auto_ml = AutoML(*args, **kwargs)
self.time = time
self.estimator_list = estimator_list
self.metric = metric
def set_params(self, **params):
self.auto_ml.set_params(**params)
return self
def get_params(self, deep=True):
dict = self.auto_ml.get_params(deep)
dict["time"] = self.time
dict["estimator_list"] = self.estimator_list
dict["metric"] = self.metric
return dict
def fit(self, X, y):
self.auto_ml.fit(X, y, task="regression", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
self.tuned_model = self.auto_ml.model.estimator
return self
def predict(self, x):
preds = self.tuned_model.predict(x)
return preds
class FlamlClassifierDoubleML:
_estimator_type = 'classifier'
def __init__(self, time, estimator_list, metric, *args, **kwargs):
self.auto_ml = AutoML(*args, **kwargs)
self.time = time
self.estimator_list = estimator_list
self.metric = metric
def set_params(self, **params):
self.auto_ml.set_params(**params)
return self
def get_params(self, deep=True):
dict = self.auto_ml.get_params(deep)
dict["time"] = self.time
dict["estimator_list"] = self.estimator_list
dict["metric"] = self.metric
return dict
def fit(self, X, y):
self.classes_ = unique_labels(y)
self.auto_ml.fit(X, y, task="classification", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
self.tuned_model = self.auto_ml.model.estimator
return self
def predict_proba(self, x):
preds = self.tuned_model.predict_proba(x)
return preds
Step 2: Using the API when calling DoubleML
’s .fit()
Method#
We initialize a FlamlRegressorDoubleML
and hand it without fitting into the DoubleML object. When calling .fit()
on the DoubleML object, copies of the API object will be created on the folds and a seperate set of hyperparameters is created. Since we fit \(K\) times, we reduce the computation time accordingly to ensure comparibility to the full sample case.
[7]:
# Define the FlamlRegressorDoubleML
ml_l = FlamlRegressorDoubleML(time=24, estimator_list=['xgboost'], metric='rmse')
ml_m = FlamlRegressorDoubleML(time=24, estimator_list=['xgboost'], metric='rmse')
# Create DoubleMLPLR object using the new regressors
dml_plr_obj_onfolds = dml.DoubleMLPLR(obj_dml_data, ml_m, ml_l)
# Fit the DoubleMLPLR model
print(dml_plr_obj_onfolds.fit(store_models=True).summary)
coef std err t P>|t| 2.5 % 97.5 %
d 0.502016 0.033265 15.091263 1.848688e-51 0.436817 0.567215
[9]:
rmse_oos_onfolds_ml_l = np.mean([dml_plr_obj_onfolds.models["ml_l"]["d"][0][i].auto_ml.best_loss for i in range(5)])
rmse_oos_onfolds_ml_m = np.mean([dml_plr_obj_onfolds.models["ml_m"]["d"][0][i].auto_ml.best_loss for i in range(5)])
print("Best RMSE during tuning (ml_m):",rmse_oos_onfolds_ml_m)
print("Best RMSE during tuning (ml_l):",rmse_oos_onfolds_ml_l)
rmse_dml_ml_l_onfolds = dml_plr_obj_onfolds.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_onfolds = dml_plr_obj_onfolds.evaluate_learners()['ml_m'][0][0]
print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_onfolds)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_onfolds)
Best RMSE during tuning (ml_m): 1.0068101213851626
Best RMSE during tuning (ml_l): 1.1151610541568202
RMSE evaluated by DoubleML (ml_m): 1.0084871742256079
RMSE evaluated by DoubleML (ml_l): 1.1272404618426184
Similar to the above case, we see no hints for overfitting.
Comparison to AutoML with less Computation time and Untuned XGBoost Learners#
AutoML with less Computation time#
As a baseline, we can compare the learners above that have been tuned using two minutes of training time each with ones that only use ten seconds.
Note: These tuning times are examples. For this setting, we found 10s to be insuffienct and 120s to be sufficient. In general, necessary tuning time can depend on data complexity, data set size, computational power of the machine used, etc.. For more info on how to use FLAML
properly please refer to the documentation and the paper.
[10]:
# Initialize AutoML for outcome model similar to above, but use a smaller time budget.
automl_l_lesstime = AutoML()
settings_l = {
"time_budget": 10,
"metric": 'rmse',
"estimator_list": ['xgboost'],
"task": 'regression',
}
automl_l_lesstime.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=2, **settings_l)
# Initialize AutoML for treatment model similar to above, but use a smaller time budget.
automl_m_lesstime = AutoML()
settings_m = {
"time_budget": 10,
"metric": 'rmse',
"estimator_list": ['xgboost'],
"task": 'regression',
}
automl_m_lesstime.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=2, **settings_m)
[11]:
obj_dml_plr_lesstime = dml.DoubleMLPLR(obj_dml_data, ml_m=automl_m_lesstime.model.estimator,
ml_l=automl_l_lesstime.model.estimator)
print(obj_dml_plr_lesstime.fit().summary)
coef std err t P>|t| 2.5 % 97.5 %
d 0.436394 0.031007 14.073929 5.493102e-45 0.375621 0.497168
We can check the performance again.
[33]:
rmse_dml_ml_l_lesstime = obj_dml_plr_lesstime.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_lesstime = obj_dml_plr_lesstime.evaluate_learners()['ml_m'][0][0]
print("Best RMSE during tuning (ml_m):", automl_m_lesstime.best_loss)
print("Best RMSE during tuning (ml_l):", automl_l_lesstime.best_loss)
print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m_lesstime)
print("RMSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l_lesstime)
Best RMSE during tuning (ml_m): 0.9158080176561963
Best RMSE during tuning (ml_l): 1.2197237644227434
RMSE evaluated by DoubleML (ml_m): 1.0739130271918385
RMSE evaluated by DoubleML (ml_l): 1.1362430723104844
We see a more severe difference in oos RMSE between AutoML and DML estimations. This could hint that the learner underfits, i.e. training time was not sufficient.
Untuned (default parameter) XGBoost#
As another baseline, we set up DoubleML with an XGBoost learner that has not been tuned at all, i.e. using the default set of hyperparameters.
[12]:
xgb_untuned_m, xgb_untuned_l = XGBRegressor(), XGBRegressor()
[13]:
# Create DoubleMLPLR object using AutoML models
dml_plr_obj_untuned = dml.DoubleMLPLR(obj_dml_data, xgb_untuned_l, xgb_untuned_m)
print(dml_plr_obj_untuned.fit().summary)
rmse_dml_ml_l_untuned = dml_plr_obj_untuned.evaluate_learners()['ml_l'][0][0]
rmse_dml_ml_m_untuned = dml_plr_obj_untuned.evaluate_learners()['ml_m'][0][0]
coef std err t P>|t| 2.5 % 97.5 %
d 0.431253 0.03258 13.236884 5.373218e-40 0.367398 0.495108
Comparison and summary#
We combine the summaries from various models: full-sample and on-the-folds tuned AutoML, untuned XGB, and dummy models.
[24]:
summary = pd.concat([obj_dml_plr_fullsample.summary, dml_plr_obj_onfolds.summary, dml_plr_obj_untuned.summary, obj_dml_plr_lesstime.summary],
keys=['Full Sample', 'On the folds', 'Default', 'Less time'])
summary.index.names = ['Model Type', 'Metric']
summary
[24]:
coef | std err | t | P>|t| | 2.5 % | 97.5 % | ||
---|---|---|---|---|---|---|---|
Model Type | Metric | ||||||
Full Sample | d | 0.498286 | 0.032738 | 15.220407 | 2.589147e-52 | 0.434121 | 0.562452 |
On the folds | d | 0.502016 | 0.033265 | 15.091263 | 1.848688e-51 | 0.436817 | 0.567215 |
Default | d | 0.431253 | 0.032580 | 13.236884 | 5.373218e-40 | 0.367398 | 0.495108 |
Less time | d | 0.436394 | 0.031007 | 14.073929 | 5.493102e-45 | 0.375621 | 0.497168 |
Plot Coefficients and 95% Confidence Intervals#
This section generates a plot comparing the coefficients and 95% confidence intervals for each model type.
[25]:
# Extract model labels and coefficient values
model_labels = summary.index.get_level_values('Model Type')
coef_values = summary['coef'].values
# Calculate errors
errors = np.full((2, len(coef_values)), np.nan)
errors[0, :] = summary['coef'] - summary['2.5 %']
errors[1, :] = summary['97.5 %'] - summary['coef']
# Plot Coefficients and 95% Confidence Intervals
plt.figure(figsize=(10, 6))
plt.errorbar(model_labels, coef_values, fmt='o', yerr=errors, capsize=5)
plt.axhline(0.5, color='red', linestyle='--')
plt.xlabel('Model')
plt.ylabel('Coefficients and 95%-CI')
plt.title('Comparison of Coefficients and 95% Confidence Intervals')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Compare Metrics for Nuisance Estimation#
In this section, we compare metrics for different models and plot a bar chart to visualize the differences in their performance.
[32]:
fig, axs = plt.subplots(1,2,figsize=(10,4))
axs = axs.flatten()
axs[0].bar(x = ['Full Sample', 'On the folds', 'Default', 'Less time'],
height=[rmse_dml_ml_m_fullsample, rmse_dml_ml_m_onfolds, rmse_dml_ml_m_untuned, rmse_dml_ml_m_lesstime])
axs[1].bar(x = ['Full Sample', 'On the folds', 'Default', 'Less time'],
height=[rmse_dml_ml_l_fullsample, rmse_dml_ml_l_onfolds, rmse_dml_ml_l_untuned, rmse_dml_ml_l_lesstime])
axs[0].set_xlabel("Tuning Method")
axs[0].set_ylim((1,1.12))
axs[0].set_ylabel("RMSE")
axs[0].set_title("OOS RMSE for Different Tuning Methods (ml_m)")
axs[1].set_xlabel("Tuning Method")
axs[1].set_ylim((1.1,1.22))
axs[1].set_ylabel("RMSE")
axs[1].set_title("OOS RMSE for Different Tuning Methods (ml_l)")
fig.suptitle("Out of Sample RMSE in Nuisance Estimation by Tuning Method")
fig.tight_layout()
Conclusion#
This notebook highlights that tuning plays an important role and can be easily done using FLAML AutoML. In our recent study we provide more evidence for tuning with AutoML, especially that the full sample case in all investigated cases performed similarly to the full sample case and thus tuning time and complexity can be saved by tuning externally.
See also our fully automated API for tuning DoubleML objects using AutoML, called AutoDoubleML which can be installed from Github for python.