-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Requests:
1. Parameter to specify control variables and to remove effects of these variables for prediction and calculation of model metrics
Control variables are sometimes used in our model builds in addition to offset. These variables are not one of the predictors we are modeling, but have effects that we would like to control for in the model. Thus, we would like to request a parameter that does the following:
When fitting the GLM, control variables are also fitted in the model, same way as a regular predictor.
After the model is fitted, when predicting with the model and calculating metrics, the control variables effects are removed.
2. additional option to remove effects of the offset column for prediction and calculation of model metrics
H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.
H2O.ai Devs only
https://support.h2o.ai/a/tickets/110095
Issue is implemented in this PR:
- GH-16524 GLM - control variables - Regression, Binomial #16601
- GH-16524 GLM - control variables Multinomial #16646
- GLM - remove offset effects from prediction and metrics calculation - Gaussian, Binomial
- GLM - remove offset effects from prediction and metrics calculation - Multinomial
API design:
The user runs the model with control variables using the control_variables
parameter, where he/she specifies columns that are excluded from metric calculation and prediction.
restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...])
restricted_glm.fit(..)
During training, scoring history is calculated with and without control variables. The header of the scoring history will look like this:
restricted_glm.scoring_history()
Scoring History:
timestamp duration iterations Unrestricted negative_log_likelihood Unrestricted objective negative_log_likelihood objective Training RMSE Training LogLoss Training r2 Training AUC Training pr_auc Training Lift Training Classification Error Validation RMSE Validation LogLoss Validation r2 Validation AUC Validation pr_auc Validation Lift Validation Classification Error Unrestricted Training AUC Unrestricted Validation AUC
We are including "Unrestricted" metrics because they are used for training and early stopping, so it is important to see their values in the scoring history.
Other metrics like restricted_glm.auc()
will return a value where control variables are excluded from the calculation.
When the user runs model.predict()
, the model excludes control variables from calculation as well as in metric calculation.
If the user wants to know unrestricted model metrics or predictions, the user can get the unrestricted model like this:
unrestricted_glm = restricted_glm.get_unrestricted_model()
And then the user can get unrestricted metrics like this:
unrestricted_glm.auc()
This approach is easy and user-friendly because:
-
We keep the current metrics API
-
We don’t break training and early stopping functionality
-
We offer the unrestricted model, which is calculated anyway
For removing the offset effect from the calculation metrics and prediction, the user sets the remove_offset_effect parameter to True. The user can combine these two parameters. But first, the control variables will be implemented, and then I will also implement the offset part.
restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...], remove_offset_effect=True/False)
restricted_glm.fit(..)
Why did we decide to change the API like this?
-
In the previous suggestion, we think we can save training information, and from it, we are able to calculate restricted model values. But it is not so easy.
-
It looks more user-friendly to directly train a restricted model, and in case the user wants an unrestricted model, we can provide it to them.