Controls and Offset variables in GLM

Requests:

**1.  Parameter to specify control variables and to remove effects of these variables for prediction and calculation of model metrics**

Control variables are sometimes used in our model builds in addition to offset. These variables are not one of the predictors we are modeling, but have effects that we would like to control for in the model. Thus, we would like to request a parameter that does the following:
When fitting the GLM, control variables are also fitted in the model, same way as a regular predictor.
After the model is fitted, when predicting with the model and calculating metrics, the control variables effects are removed.

 
**2. additional option to remove effects of the offset column for prediction and calculation of model metrics** 

H2O already has an option to specify an offset column during model fit. We would like to request an additional option to remove the offset effect during prediction and calculation of model metrics. To clarify, the offset will still be included during the model fit as it is today, but the effects will be removed during predictions and calculation of model metrics. If this can be a toggle user can turn on and off, that would be great.

**H2O.ai Devs only**
https://support.h2o.ai/a/tickets/110095

Issue is implemented in this PR:
- [ ] #16601 
- [ ] #16646
- [ ] GLM - remove offset effects from prediction and metrics calculation - Gaussian, Binomial 
- [ ] GLM - remove offset effects from prediction and metrics calculation - Multinomial


**API design:**

The user runs the model with control variables using the `control_variables` parameter, where he/she specifies columns that are excluded from metric calculation and prediction.  

```
restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...])

restricted_glm.fit(..)
```

During training, scoring history is calculated with and without control variables. The header of the scoring history will look like this:

`restricted_glm.scoring_history()`

```
Scoring History:

            timestamp    duration  iterations  Unrestricted negative_log_likelihood  Unrestricted objective  negative_log_likelihood  objective  Training RMSE  Training LogLoss  Training r2  Training AUC  Training pr_auc  Training Lift  Training Classification Error Validation RMSE  Validation LogLoss  Validation r2  Validation AUC  Validation pr_auc  Validation Lift  Validation Classification Error Unrestricted Training AUC Unrestricted Validation AUC
```

We are including "Unrestricted" metrics because they are used for training and early stopping, so it is important to see their values in the scoring history.  

Other metrics like `restricted_glm.auc()` will return a value where control variables are excluded from the calculation. 

When the user runs `model.predict()`, the model excludes control variables from calculation as well as in metric calculation.

If the user wants to know unrestricted model metrics or predictions, the user can get the unrestricted model like this:

`unrestricted_glm = restricted_glm.get_unrestricted_model()`

And then the user can get unrestricted metrics like this:

`unrestricted_glm.auc()`

**This approach is easy and user-friendly because:**

1) We keep the current metrics API

2) We don’t break training and early stopping functionality

3) We offer the unrestricted model, which is calculated anyway

For removing the offset effect from the calculation metrics and prediction, the user sets the remove_offset_effect parameter to True. The user can combine these two parameters. But first, the control variables will be implemented, and then I will also implement the offset part.

```
restricted_glm = H2OGeneralizedLinearEstimator(control_variables=[...], remove_offset_effect=True/False)

restricted_glm.fit(..)
```

**Why did we decide to change the API like this?**

1) In the previous suggestion, we think we can save training information, and from it, we are able to calculate restricted model values. But it is not so easy.

2) It looks more user-friendly to directly train a restricted model, and in case the user wants an unrestricted model, we can provide it to them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controls and Offset variables in GLM #16524

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controls and Offset variables in GLM #16524

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions