add xgb.Booster methods to feature_effects() and partial_dependence() #60

btupper · 2025-02-27T18:37:29Z

Hello,

We have been using your nice package in a white shark forecasting study where we use tidymodels to shape our workflow. We bumped into issues using effectplots with tidymodel workflows build with xgboost. This pull request offers feature_effects() and partial_dependence() methods for xbg.Booster class objects. We didn't make a test for it, but we can add that if you like. We hope you might consider this addition to effectplots.

Thank you,

Ben and Kyle (@kolive4)

suppressPackageStartupMessages({
  library(xgboost)
  library(dplyr)
  library(effectplots)
})

# Load iris data as a tibble
load_iris = function(){
  iris |>
    dplyr::as_tibble()
}

# split input data from into desired proportion (training/testing)
do_split = function(x = load_iris(), prop = 0.75, ...){
  len = nrow(x)
  ix = sample(len, len * prop, replace = FALSE)
  index = rep(FALSE, len)
  index[ix] = TRUE
  
  label = dplyr::select(x, dplyr::all_of("Species"))
  data = dplyr::select(x, -dplyr::all_of("Species"))
  
  list(
    training = list(data = dplyr::filter(data, index),
                    label = dplyr::filter(label, index) |> 
                      dplyr::pull() |>
                      as.numeric() - 1),
    testing = list(data = dplyr::filter(data, !index),
                   label = dplyr::filter(label, !index) |> 
                     dplyr::pull() |>
                     as.numeric() - 1),
    levels = (as.numeric(x$Species) - 1) |>
      unique() |> 
      sort() |>
      rlang::set_names(levels(x$Species)),
    orig = index)
}


# load and split the data
data = load_iris() |>
  do_split()


# build the models - not casting to matrix data type
bst = xgboost(data = data$training$data |> as.matrix(), 
              label = data$training$label, 
              max.depth = 2, 
              eta = 1, 
              nrounds = 2,
              nthread = 2, 
              num_class = length(data$levels),
              objective = "multi:softmax")

# prediction - again not casting to matrix
pred <- predict(bst, data$testing$data |> as.matrix())

# here we see the added method for `xgb.Booster` class, not we do 
# not need to explicitly cast the input data to matrix (the method handles that)
fe = effectplots::feature_effects(bst,
                     v = colnames(data$training$data),
                     data = data$testing$data)
plot(fe)

# and the same is true for partial_dependence
pd = effectplots::partial_dependence(bst,
                                     v = colnames(data$training$data),
                                     data = data$testing$data)
plot(pd)

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-redhat-linux-gnu
Running under: Rocky Linux 8.10 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblaso-r0.3.15.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.47        
 [5] tidyselect_1.2.1  magrittr_2.0.3    glue_1.8.0        tibble_3.2.1     
 [9] knitr_1.48        htmltools_0.5.8.1 pkgconfig_2.0.3   rmarkdown_2.28   
[13] dplyr_1.1.4       generics_0.1.3    lifecycle_1.0.4   ps_1.7.5         
[17] cli_3.6.3         processx_3.8.1    callr_3.7.3       reprex_2.0.2     
[21] vctrs_0.6.5       withr_3.0.2       compiler_4.4.2    rstudioapi_0.17.1
[25] tools_4.4.2       evaluate_0.24.0   pillar_1.10.1     yaml_2.3.10      
[29] rlang_1.1.5       fs_1.6.4

mayer79 · 2025-02-27T20:24:58Z

Thanks for digging into this.

But: feature_effects() works well with matrix input, which is the native API for XGBoost. It would be unnatural to fit on matrix data and then apply the model to data frames. Do you see what I mean?

PS: XGBoost is currently working on a big release where the xgboost() function will accept data.frames!

PPS: I am currently working on a new release of {effectplots}, getting rid of the update() function.

btupper · 2025-02-27T20:41:18Z

Thanks for the speedy response! Those updates sound great and we are happy to wait.

I'm pretty sure I don't understand "It would be unnatural to fit on matrix data and then apply the model to data frames." They are both just arrays with column variables to my pea-brain. We often fit models with data frames and then apply with matrices (well, raster data). That said, it might not be important for me to understand.

So, if feature_effects() works well with matrices, I wonder where our hang up occurs. I know that somewhere down in its innards xgboost prefers sparse matrices, and we found ourselves needing to cast those as matrix to get to the effects goodies. So, we decided that feature_effects() just needed an xgboost-centric method as we see for ranger and others. It's a little tricky because tidymodels (happily) hides many details from us. We'll dig back into our workflow to see where that crops up.

mayer79 · 2025-02-28T14:52:30Z

I see. Let's craft an example where the problem pops up.

In many situations, one can solve smallish problems with the pred_fun argument, e.g., pred_fun = function(model, data, ...) predict(model, data.matrix(data), ...).

This said: the methods for {ranger} objects are also not really necessary as one could pass pred_fun = function(model, data, ...) predict(model, data, ...)$predictions.

Those for DALEX explainers or h2o models are more relevant in this respect.

add xgb.Booster methods to feature_effects() and partial_dependence()`

f00fa87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add xgb.Booster methods to feature_effects() and partial_dependence() #60

add xgb.Booster methods to feature_effects() and partial_dependence() #60

Uh oh!

btupper commented Feb 27, 2025

Uh oh!

mayer79 commented Feb 27, 2025

Uh oh!

btupper commented Feb 27, 2025

Uh oh!

mayer79 commented Feb 28, 2025

Uh oh!

Uh oh!

add xgb.Booster methods to feature_effects() and partial_dependence() #60

Are you sure you want to change the base?

add xgb.Booster methods to feature_effects() and partial_dependence() #60

Uh oh!

Conversation

btupper commented Feb 27, 2025

Uh oh!

mayer79 commented Feb 27, 2025

Uh oh!

btupper commented Feb 27, 2025

Uh oh!

mayer79 commented Feb 28, 2025

Uh oh!

Uh oh!