Skip to content

Controls and Offset variables in GLM #16524

@arunaryasomayajula

Description

@arunaryasomayajula

Requests:

1. Parameter to specify control variables and to remove effects of these variables for prediction and calculation of model metrics

Control variables are sometimes used in our model builds in addition to offset. These variables are not one of the predictors we are modeling, but have effects that we would like to control for in the model. Thus, we would like to request a parameter that does the following:
When fitting the GLM, control variables are also fitted in the model, same way as a regular predictor.
After the model is fitted, when predicting with the model and calculating metrics, the control variables effects are removed.

2. additional option to remove effects of the offset column for prediction and calculation of model metrics

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.

H2O.ai Devs only
https://support.h2o.ai/a/tickets/110095

Issue is implemented in this PR:

API design:

The user runs the model with control variables using the control_variables parameter, where he/she specifies columns that are excluded from metric calculation and prediction.

restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...])

restricted_glm.fit(..)

During training, scoring history is calculated with and without control variables. The header of the scoring history will look like this:

restricted_glm.scoring_history()

Scoring History:

            timestamp    duration  iterations  Unrestricted negative_log_likelihood  Unrestricted objective  negative_log_likelihood  objective  Training RMSE  Training LogLoss  Training r2  Training AUC  Training pr_auc  Training Lift  Training Classification Error Validation RMSE  Validation LogLoss  Validation r2  Validation AUC  Validation pr_auc  Validation Lift  Validation Classification Error Unrestricted Training AUC Unrestricted Validation AUC

We are including "Unrestricted" metrics because they are used for training and early stopping, so it is important to see their values in the scoring history.

Other metrics like restricted_glm.auc() will return a value where control variables are excluded from the calculation.

When the user runs model.predict(), the model excludes control variables from calculation as well as in metric calculation.

If the user wants to know unrestricted model metrics or predictions, the user can get the unrestricted model like this:

unrestricted_glm = restricted_glm.get_unrestricted_model()

And then the user can get unrestricted metrics like this:

unrestricted_glm.auc()

This approach is easy and user-friendly because:

  1. We keep the current metrics API

  2. We don’t break training and early stopping functionality

  3. We offer the unrestricted model, which is calculated anyway

For removing the offset effect from the calculation metrics and prediction, the user sets the remove_offset_effect parameter to True. The user can combine these two parameters. But first, the control variables will be implemented, and then I will also implement the offset part.

restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...], remove_offset_effect=True/False)

restricted_glm.fit(..)

Why did we decide to change the API like this?

  1. In the previous suggestion, we think we can save training information, and from it, we are able to calculate restricted model values. But it is not so easy.

  2. It looks more user-friendly to directly train a restricted model, and in case the user wants an unrestricted model, we can provide it to them.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions