improve EBM default parameters

paulbkoch · paulbkoch · commit 8070ffa11f06 · 2024-10-13T19:22:03.000-07:00
diff --git a/README.md b/README.md
@@ -623,6 +623,7 @@ We also build on top of many great packages. Please check them out!
 # Papers that use or compare EBMs
 
 - [Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models](https://arxiv.org/pdf/2409.14429)
+- [GAMFORMER: In-context Learning for Generalized Additive Models](https://arxiv.org/pdf/2410.04560v1)
 - [Data Science with LLMs and Interpretable Models](https://arxiv.org/pdf/2402.14474v1.pdf)
 - [DimVis: Interpreting Visual Clusters in Dimensionality Reduction With Explainable Boosting Machine](https://arxiv.org/pdf/2402.06885.pdf)
 - [Distill knowledge of additive tree models into generalized linear models](https://detralytics.com/wp-content/uploads/2023/10/Detra-Note_Additive-tree-ensembles.pdf)
@@ -688,6 +689,7 @@ We also build on top of many great packages. Please check them out!
 - [Explainable Boosting Machines for Slope Failure Spatial Predictive Modeling](https://www.mdpi.com/2072-4292/13/24/4991/htm)
 - [Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health](https://arxiv.org/pdf/2109.13770.pdf)
 - [Identifying main and interaction effects of risk factors to predict intensive care admission in patients hospitalized with COVID-19](https://www.medrxiv.org/content/10.1101/2020.06.30.20143651v1.full.pdf)
+- [Leveraging interpretable machine learning in intensive care](https://link.springer.com/article/10.1007/s10479-024-06226-8#Tab10)
 - [Development of prediction models for one-year brain tumour survival using machine learning: a comparison of accuracy and interpretability](https://www.pure.ed.ac.uk/ws/portalfiles/portal/343114800/1_s2.0_S0169260723001487_main.pdf)
 - [Using Interpretable Machine Learning to Predict Maternal and Fetal Outcomes](https://arxiv.org/pdf/2207.05322.pdf)
 - [Calibrate: Interactive Analysis of Probabilistic Model Output](https://arxiv.org/pdf/2207.13770.pdf)
diff --git a/docs/interpret/hyperparameters.md b/docs/interpret/hyperparameters.md
@@ -6,12 +6,19 @@ The parameters below are ordered by tuning importance, with the most important h
 
 
 ## smoothing_rounds
-default: 200
+default: 100
 
-hyperparameters: [0, 50, 100, 200, 500, 1000, 2000, 4000]
+hyperparameters: [0, 50, 100, 200, 500, 1000]
 
 guidance: This is an important hyperparameter to tune.  The optimal smoothing_rounds value will vary depending on the dataset's characteristics. Adjust based on the prevalence of smooth feature response curves.
 
+## learning_rate
+default: 0.01
+
+hyperparameters: [0.2, 0.1, 0.05, 0.025, 0.01, 0.005, 0.0025]
+
+guidance: This is an important hyperparameter to tune.  The conventional wisdom is that a lower learning rate is generally better, but we have found the relationship to be more complex. In general, regression seems to prefer a higher learning rate, binary classification seems to prefer a lower learning rate, and multiclass is in-between.
+
 ## interactions
 default: 0.9
 
@@ -45,18 +52,18 @@ hyperparameters: [8, 16, 32, 64, 128, 256]
 guidance: For max_interaction_bins, more is not necessarily better, unlike with max_bins. A good value on many datasets seems to be 32, but it's worth trying higher and lower values.
 
 ## greedy_ratio
-default: 1.5
+default: 12.0
 
-hyperparameters: [0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 4.0]
+hyperparameters: [0.0, 1.0, 2.0, 5.0, 12.0, 20.0]
 
 guidance: greedy_ratio is a good candidate for hyperparameter tuning as the best value is dataset dependent.
 
 ## cyclic_progress
-default: 1.0
+default: 0.0
 
-hyperparameters: [0.0, 0.5, 1.0]
+hyperparameters: [0.0, 1.0]
 
-guidance: cyclic_progress is a good candidate for hyperparameter tuning as the best value is dataset dependent.
+guidance: Try both.
 
 ## outer_bags
 default: 14
@@ -74,31 +81,24 @@ hyperparameters: [0, 50, 100, 500]
 
 guidance: interaction_smoothing_rounds appears to have only a minor impact on model accuracy. 0 is often the best choice.  0 is often the most accurate choice, but the interaction shape plots will be smoother and easier to interpret with more interaction_smoothing_rounds.
 
-## learning_rate
-default: 0.01
-
-hyperparameters: [0.1, 0.025, 0.01, 0.005, 0.0025]
-
-guidance: A smaller learning_rate promotes finer model adjustments during fitting, but may require more iterations. Generally, we believe a smaller learning_rate should improve the model, but sometimes hyperparameter tuning seems to be needed to select the best value.
-
 ## max_leaves
-default: 3
+default: 2
 
 hyperparameters: [2, 3, 4]
 
-guidance: Generally, the default setting is effective, but it's worth checking if changing to either 2 or 4 can offer better accuracy on your specific data. The max_leaves parameter only applies to main effects.
+guidance: Generally, the default setting is effective, but it's worth checking if changing to either 3 or 4 can offer better accuracy on your specific data. The max_leaves parameter only applies to main effects.
 
 ## min_samples_leaf
-default: 2
+default: 4
 
-hyperparameters: [2, 3, 4]
+hyperparameters: [2, 3, 4, 5, 6]
 
 guidance: The default value usually works well, however experimenting with slightly higher values could potentially enhance generalization on certain datasets.
 
 ## min_hessian
-default: 0.0001
+default: 1e-5
 
-hyperparameters: [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]
+hyperparameters: [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]
 
 guidance: The default min_hessian is a solid starting point.
 
@@ -112,7 +112,7 @@ hyperparameters: [1000000000]
 guidance: The max_rounds parameter serves as a limit to prevent excessive training on datasets where improvements taper off. Set this parameter sufficiently high to avoid premature early stopping. Consider increasing it if small yet consistent gains are observed in longer trainings.
 
 ## early_stopping_rounds
-default: 50
+default: 100
 
 guidance: We typically do not advise changing early_stopping_rounds. The default is appropriate for most cases, adequately capturing the optimal model without incurring unnecessary computational costs.
 
diff --git a/python/interpret-core/interpret/glassbox/_ebm/_ebm.py b/python/interpret-core/interpret/glassbox/_ebm/_ebm.py
@@ -396,7 +396,7 @@ def fit(self, X, y, sample_weight=None, bags=None, init_score=None):
         # with 64 bytes per tensor cell, a 2^20 tensor would be 1/16 gigabyte.
         max_cardinality = 1048576
         nominal_smoothing = True
-        # In the future we might replace min_samples_leaf=2 with min_samples_bin=3 so
+        # In the future we might replace min_samples_leaf=4 with min_samples_bin=3 so
         # that we don't need to have the count when boosting or for interaction
         # detection. Benchmarking indicates switching these would decrease the accuracy
         # slightly, but it might be worth the speedup. Unfortunately, with outer bags
@@ -2448,10 +2448,10 @@ class ExplainableBoostingClassifier(EBMModel, ClassifierMixin, ExplainerMixin):
         Number of inner bags. 0 turns off inner bagging.
     learning_rate : float, default=0.01
         Learning rate for boosting.
-    greedy_ratio : float, default=1.5
+    greedy_ratio : float, default=12.0
         The proportion of greedy boosting steps relative to cyclic boosting steps.
         A value of 0 disables greedy boosting, effectively turning it off.
-    cyclic_progress : bool or float, default=True
+    cyclic_progress : bool or float, default=False
         This parameter specifies the proportion of the boosting cycles that will
         actively contribute to improving the model's performance. It is expressed
         as a bool or float between 0 and 1, with the default set to True(1.0), meaning 100% of
@@ -2460,13 +2460,13 @@ class ExplainableBoostingClassifier(EBMModel, ClassifierMixin, ExplainerMixin):
         it will be used to update internal gain calculations related to how effective
         each feature is in predicting the target variable. Setting this parameter
         to a value less than 1.0 can be useful for preventing overfitting.
-    smoothing_rounds : int, default=200
+    smoothing_rounds : int, default=100
         Number of initial highly regularized rounds to set the basic shape of the main effect feature graphs.
     interaction_smoothing_rounds : int, default=50
         Number of initial highly regularized rounds to set the basic shape of the interaction effect feature graphs during fitting.
     max_rounds : int, default=25000
         Total number of boosting rounds with n_terms boosting steps per round.
-    early_stopping_rounds : int, default=50
+    early_stopping_rounds : int, default=100
         Number of rounds with no improvement to trigger early stopping. 0 turns off
         early stopping and boosting will occur for exactly max_rounds.
     early_stopping_tolerance : float, default=1e-5
@@ -2485,17 +2485,17 @@ class ExplainableBoostingClassifier(EBMModel, ClassifierMixin, ExplainerMixin):
         tradeoff for the ensemble of models --- not the individual models --- a small
         amount of overfitting of the individual models can improve the accuracy of
         the ensemble as a whole.
-    min_samples_leaf : int, default=2
+    min_samples_leaf : int, default=4
         Minimum number of samples allowed in the leaves.
-    min_hessian : float, default=1e-4
+    min_hessian : float, default=1e-5
         Minimum hessian required to consider a potential split valid.
     reg_alpha : float, default=0.0
         L1 regularization.
     reg_lambda : float, default=0.0
         L2 regularization.
     max_delta_step : float, default=0.0
         Used to limit the max output of tree leaves. <=0.0 means no constraint.
-    max_leaves : int, default=3
+    max_leaves : int, default=2
         Maximum number of leaves allowed in each tree.
     monotone_constraints: list of int, default=None
 
@@ -2641,20 +2641,20 @@ def __init__(
         inner_bags: Optional[int] = 0,
         # Boosting
         learning_rate: float = 0.01,
-        greedy_ratio: Optional[float] = 1.5,
-        cyclic_progress: Union[bool, float, int] = True,  # noqa: PYI041
-        smoothing_rounds: Optional[int] = 200,
+        greedy_ratio: Optional[float] = 12.0,
+        cyclic_progress: Union[bool, float, int] = False,  # noqa: PYI041
+        smoothing_rounds: Optional[int] = 100,
         interaction_smoothing_rounds: Optional[int] = 50,
         max_rounds: Optional[int] = 25000,
-        early_stopping_rounds: Optional[int] = 50,
+        early_stopping_rounds: Optional[int] = 100,
         early_stopping_tolerance: Optional[float] = 1e-5,
         # Trees
-        min_samples_leaf: Optional[int] = 2,
-        min_hessian: Optional[float] = 1e-4,
+        min_samples_leaf: Optional[int] = 4,
+        min_hessian: Optional[float] = 1e-5,
         reg_alpha: Optional[float] = 0.0,
         reg_lambda: Optional[float] = 0.0,
         max_delta_step: Optional[float] = 0.0,
-        max_leaves: int = 3,
+        max_leaves: int = 2,
         monotone_constraints: Optional[Sequence[int]] = None,
         objective: str = "log_loss",
         # Overall
@@ -2794,10 +2794,10 @@ class ExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin):
         Number of inner bags. 0 turns off inner bagging.
     learning_rate : float, default=0.01
         Learning rate for boosting.
-    greedy_ratio : float, default=1.5
+    greedy_ratio : float, default=12.0
         The proportion of greedy boosting steps relative to cyclic boosting steps.
         A value of 0 disables greedy boosting, effectively turning it off.
-    cyclic_progress : bool or float, default=True
+    cyclic_progress : bool or float, default=False
         This parameter specifies the proportion of the boosting cycles that will
         actively contribute to improving the model's performance. It is expressed
         as a bool or float between 0 and 1, with the default set to True(1.0), meaning 100% of
@@ -2806,13 +2806,13 @@ class ExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin):
         it will be used to update internal gain calculations related to how effective
         each feature is in predicting the target variable. Setting this parameter
         to a value less than 1.0 can be useful for preventing overfitting.
-    smoothing_rounds : int, default=200
+    smoothing_rounds : int, default=100
         Number of initial highly regularized rounds to set the basic shape of the main effect feature graphs.
     interaction_smoothing_rounds : int, default=50
         Number of initial highly regularized rounds to set the basic shape of the interaction effect feature graphs during fitting.
     max_rounds : int, default=25000
         Total number of boosting rounds with n_terms boosting steps per round.
-    early_stopping_rounds : int, default=50
+    early_stopping_rounds : int, default=100
         Number of rounds with no improvement to trigger early stopping. 0 turns off
         early stopping and boosting will occur for exactly max_rounds.
     early_stopping_tolerance : float, default=1e-5
@@ -2831,17 +2831,17 @@ class ExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin):
         tradeoff for the ensemble of models --- not the individual models --- a small
         amount of overfitting of the individual models can improve the accuracy of
         the ensemble as a whole.
-    min_samples_leaf : int, default=2
+    min_samples_leaf : int, default=4
         Minimum number of samples allowed in the leaves.
-    min_hessian : float, default=1e-4
+    min_hessian : float, default=1e-5
         Minimum hessian required to consider a potential split valid.
     reg_alpha : float, default=0.0
         L1 regularization.
     reg_lambda : float, default=0.0
         L2 regularization.
     max_delta_step : float, default=0.0
         Used to limit the max output of tree leaves. <=0.0 means no constraint.
-    max_leaves : int, default=3
+    max_leaves : int, default=2
         Maximum number of leaves allowed in each tree.
     monotone_constraints: list of int, default=None
 
@@ -2987,20 +2987,20 @@ def __init__(
         inner_bags: Optional[int] = 0,
         # Boosting
         learning_rate: float = 0.01,
-        greedy_ratio: Optional[float] = 1.5,
-        cyclic_progress: Union[bool, float, int] = True,  # noqa: PYI041
-        smoothing_rounds: Optional[int] = 200,
+        greedy_ratio: Optional[float] = 12.0,
+        cyclic_progress: Union[bool, float, int] = False,  # noqa: PYI041
+        smoothing_rounds: Optional[int] = 100,
         interaction_smoothing_rounds: Optional[int] = 50,
         max_rounds: Optional[int] = 25000,
-        early_stopping_rounds: Optional[int] = 50,
+        early_stopping_rounds: Optional[int] = 100,
         early_stopping_tolerance: Optional[float] = 1e-5,
         # Trees
-        min_samples_leaf: Optional[int] = 2,
-        min_hessian: Optional[float] = 1e-4,
+        min_samples_leaf: Optional[int] = 4,
+        min_hessian: Optional[float] = 1e-5,
         reg_alpha: Optional[float] = 0.0,
         reg_lambda: Optional[float] = 0.0,
         max_delta_step: Optional[float] = 0.0,
-        max_leaves: int = 3,
+        max_leaves: int = 2,
         monotone_constraints: Optional[Sequence[int]] = None,
         objective: str = "rmse",
         # Overall