Skip to content

add xgb.Booster methods to feature_effects() and partial_dependence() #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

btupper
Copy link

@btupper btupper commented Feb 27, 2025

Hello,

We have been using your nice package in a white shark forecasting study where we use tidymodels to shape our workflow. We bumped into issues using effectplots with tidymodel workflows build with xgboost. This pull request offers feature_effects() and partial_dependence() methods for xbg.Booster class objects. We didn't make a test for it, but we can add that if you like. We hope you might consider this addition to effectplots.

Thank you,

Ben and Kyle (@kolive4)

suppressPackageStartupMessages({
  library(xgboost)
  library(dplyr)
  library(effectplots)
})

# Load iris data as a tibble
load_iris = function(){
  iris |>
    dplyr::as_tibble()
}

# split input data from into desired proportion (training/testing)
do_split = function(x = load_iris(), prop = 0.75, ...){
  len = nrow(x)
  ix = sample(len, len * prop, replace = FALSE)
  index = rep(FALSE, len)
  index[ix] = TRUE
  
  label = dplyr::select(x, dplyr::all_of("Species"))
  data = dplyr::select(x, -dplyr::all_of("Species"))
  
  list(
    training = list(data = dplyr::filter(data, index),
                    label = dplyr::filter(label, index) |> 
                      dplyr::pull() |>
                      as.numeric() - 1),
    testing = list(data = dplyr::filter(data, !index),
                   label = dplyr::filter(label, !index) |> 
                     dplyr::pull() |>
                     as.numeric() - 1),
    levels = (as.numeric(x$Species) - 1) |>
      unique() |> 
      sort() |>
      rlang::set_names(levels(x$Species)),
    orig = index)
}


# load and split the data
data = load_iris() |>
  do_split()


# build the models - not casting to matrix data type
bst = xgboost(data = data$training$data |> as.matrix(), 
              label = data$training$label, 
              max.depth = 2, 
              eta = 1, 
              nrounds = 2,
              nthread = 2, 
              num_class = length(data$levels),
              objective = "multi:softmax")

# prediction - again not casting to matrix
pred <- predict(bst, data$testing$data |> as.matrix())

# here we see the added method for `xgb.Booster` class, not we do 
# not need to explicitly cast the input data to matrix (the method handles that)
fe = effectplots::feature_effects(bst,
                     v = colnames(data$training$data),
                     data = data$testing$data)
plot(fe)

# and the same is true for partial_dependence
pd = effectplots::partial_dependence(bst,
                                     v = colnames(data$training$data),
                                     data = data$testing$data)
plot(pd)

feature_effects-1
partial_dependence-1

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-redhat-linux-gnu
Running under: Rocky Linux 8.10 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblaso-r0.3.15.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.47        
 [5] tidyselect_1.2.1  magrittr_2.0.3    glue_1.8.0        tibble_3.2.1     
 [9] knitr_1.48        htmltools_0.5.8.1 pkgconfig_2.0.3   rmarkdown_2.28   
[13] dplyr_1.1.4       generics_0.1.3    lifecycle_1.0.4   ps_1.7.5         
[17] cli_3.6.3         processx_3.8.1    callr_3.7.3       reprex_2.0.2     
[21] vctrs_0.6.5       withr_3.0.2       compiler_4.4.2    rstudioapi_0.17.1
[25] tools_4.4.2       evaluate_0.24.0   pillar_1.10.1     yaml_2.3.10      
[29] rlang_1.1.5       fs_1.6.4   

@mayer79
Copy link
Owner

mayer79 commented Feb 27, 2025

Thanks for digging into this.

But: feature_effects() works well with matrix input, which is the native API for XGBoost. It would be unnatural to fit on matrix data and then apply the model to data frames. Do you see what I mean?

PS: XGBoost is currently working on a big release where the xgboost() function will accept data.frames!

PPS: I am currently working on a new release of {effectplots}, getting rid of the update() function.

@btupper
Copy link
Author

btupper commented Feb 27, 2025

Thanks for the speedy response! Those updates sound great and we are happy to wait.

I'm pretty sure I don't understand "It would be unnatural to fit on matrix data and then apply the model to data frames." They are both just arrays with column variables to my pea-brain. We often fit models with data frames and then apply with matrices (well, raster data). That said, it might not be important for me to understand.

So, if feature_effects() works well with matrices, I wonder where our hang up occurs. I know that somewhere down in its innards xgboost prefers sparse matrices, and we found ourselves needing to cast those as matrix to get to the effects goodies. So, we decided that feature_effects() just needed an xgboost-centric method as we see for ranger and others. It's a little tricky because tidymodels (happily) hides many details from us. We'll dig back into our workflow to see where that crops up.

@mayer79
Copy link
Owner

mayer79 commented Feb 28, 2025

I see. Let's craft an example where the problem pops up.

In many situations, one can solve smallish problems with the pred_fun argument, e.g., pred_fun = function(model, data, ...) predict(model, data.matrix(data), ...).

This said: the methods for {ranger} objects are also not really necessary as one could pass pred_fun = function(model, data, ...) predict(model, data, ...)$predictions.

Those for DALEX explainers or h2o models are more relevant in this respect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants