Explaining variable importance to a physicist #458

SamuelHodges · 2024-05-14T10:24:24Z

SamuelHodges
May 14, 2024

Hi all, I'm a spatial ecologist on a multidisciplinary project with some atmospheric scientists, where I'm using biomod2 to bring in niche models to look at insect response to weather systems.
I'm currently writing up a paper of our early work, but a recurrent problem is that the atmospheric scientists (the ones I've met at least) don't deal with statistics very well, and I'm often on the spot to explain in mathematical detail what the metrics are doing. In this case, they're sceptical about biomod2's variable importance metric, and want it explained in detail in the paper. A citation alone is not enough.

I've read the CRAN documentation and Thuiller et al. (2009) and I'm pretty sure I understand what it's doing, but I want to check my understanding with the community before I send this paper for review.

So the variable importance metric checks the response of predictions (probability of presence) to the variable of interest, after having fixed the values of every other variable (eg. median, mean). Then another random variable is 'shuffled' with the variable of interest, and the response of predictions to this variable is then correlated with the response to the original, using Pearson's. The correlation coefficient is then subtracted from one to give the var.imp. So the more unique the response to the independent variable, the higher its importance.

But what does 'shuffle' mean in this case?
Is it just checking the model response to the random variable? Or is it something more complicated, where the values of the random variable are fed into the modelled relationship for the original variable somehow, to get the new response?

Thank you for any help you can provide.
Sam Hodges

Answered by MayaGueguen

May 14, 2024

Hello Samuel,

I'll try and help make things clearer 🙂

⚠️ First of all, be careful not to mix up things with the principle of response curves.
For response curves, you want to check how the predicted value evolves in function of one variable.
So when moving along the range of your variable of interest, the other variables are fixed to an average or median value so the variation you see in predicted value is only due to the variation in your variable of interest.

➡️ Here, the principle is the same only in the sense that it is to check the effect of one variable on predicted values.
➡️ But what we want to check here is more : how much a variable has an impact over the predictions.

Here is a …

View full answer

MayaGueguen · 2024-05-14T11:58:15Z

MayaGueguen
May 14, 2024
Maintainer

Hello Samuel,

I'll try and help make things clearer 🙂

⚠️ First of all, be careful not to mix up things with the principle of response curves.
For response curves, you want to check how the predicted value evolves in function of one variable.
So when moving along the range of your variable of interest, the other variables are fixed to an average or median value so the variation you see in predicted value is only due to the variation in your variable of interest.

➡️ Here, the principle is the same only in the sense that it is to check the effect of one variable on predicted values.
➡️ But what we want to check here is more : how much a variable has an impact over the predictions.

Here is a little illustration that might help you understand.

Upper left : the presence / absence points of my data
Lower left : 3 explanatory variables used to build the model
Upper right : the predicted map with original variables

Now, I want to compute variables importance.

I shuffle (randomize) the variable C, and keep the variables A and B the same.
I use the same formula used to obtain previous predictions, to predict a new map (the orange below ⬇️).
I compute Pearson correlation between original predictions (yellow map) and predictions with the variable C shuffled (orange map)
importance of the variable is returned as 1 minus this correlation.

The highest the value, the less reference and shuffled predictions are correlated, and the more influence the variable has on the model. A value of 0 assumes no influence.

⚠️ Note that this calculation does not account for variables' interactions, and importances do not sum to 1 between variables. As it is independent of the model, it enables direct comparison across algorithms.

This method was used within the randomForest package, so you might want to check if they give further indications within their documentation 👀

Maya

1 reply

SamuelHodges May 15, 2024
Author

Hi Maya,

Ah ok, I think I see... So the shuffling is done between the variable values and their spatial positions in the map? And the following correlation is a paired test between the old prediction with true data and the new prediction with randomised positions?

You mentioned that it is important to not become confused with the response curves - this touches on another point: is there a difference between the 'pred' value in the projections and the probability of presence (pred.val) returned by single variable response curves?

Thank you for taking the time to help! The figure also helped considerably.

Best,
Sam Hodges

MayaGueguen · 2024-05-16T08:40:28Z

MayaGueguen
May 16, 2024
Maintainer

Hello Samuel,

Ah ok, I think I see... So the shuffling is done between the variable values and their spatial positions in the map? And the following correlation is a paired test between the old prediction with true data and the new prediction with randomised positions?

Yes 🙂 Let's say your data was a data.frame and not rasters, and was looking like that :

x	y	VarA	VarB	VarC
1.2	5.6	1	85	3.4
1.5	5.6	2	45	8.2
1.7	5.6	3	76	4.3
..	..	..	..	..

The idea is, if looking at variable C, to randomize the values contained in column VarC while keeping all over values in the same order, and I'm using this table to make predictions :

x	y	VarA	VarB	VarC
1.2	5.6	1	85	7.1
1.5	5.6	2	45	3.4
1.7	5.6	3	76	9.8
..	..	..	..	..

You mentioned that it is important to not become confused with the response curves - this touches on another point: is there a difference between the 'pred' value in the projections and the probability of presence (pred.val) returned by single variable response curves?

When calculating response curves, then it is a bit different in the sense that you don't keep the exact values of your variables but only their ranges and mean (or median for example) values.
Let's say I'm still focusing on C, and values in my observed table are ranging from 1.5 to 10.0 for this variable. Mean values for variables A and B are respectively 12 and 55. So I fixed all values for A and B to these mean values, and I sample regularly variable C along its range, and I'm using this table to make predictions :

x	y	VarA	VarB	VarC
1.2	5.6	12	55	1.5
1.5	5.6	12	55	1.6
1.7	5.6	12	55	1.7
..	..	..	..	..

Is it clearer that way ? 👀

Maya

3 replies

SamuelHodges May 28, 2024
Author

Hi Maya,

Thanks, that makes sense for the sampling differences in the 'Pred' and 'Pred.Val'.
I'm not sure why the range for 'Pred' is so much higher than 'Pred.Val' though? On spatial projections 'Pred' ranges from 0-1000, whereas 'Pred.Val' in a response curve ranges from 0-1.

I assumed that 'Pred' is on essentially the same scale as 'Pred.Val', but is transformed to a larger range for convenience. Is this actually the case or is 'Pred' a different type of variable?

MayaGueguen May 29, 2024
Maintainer

Hello Samuel,

The difference of range between predictions, and the scale shown on response curves graphics is indeed purely for convenience.
I just detailed this to Loïc in this issue #463 for more details 🙂

Maya

SamuelHodges May 29, 2024
Author

Hi Maya,

Ah brilliant, thanks again for your support!

Sam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explaining variable importance to a physicist #458

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Explaining variable importance to a physicist #458

Uh oh!

SamuelHodges May 14, 2024

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

MayaGueguen May 14, 2024 Maintainer

Uh oh!

SamuelHodges May 15, 2024 Author

Uh oh!

MayaGueguen May 16, 2024 Maintainer

Uh oh!

SamuelHodges May 28, 2024 Author

Uh oh!

MayaGueguen May 29, 2024 Maintainer

Uh oh!

SamuelHodges May 29, 2024 Author

SamuelHodges
May 14, 2024

Replies: 2 comments 4 replies

MayaGueguen
May 14, 2024
Maintainer

SamuelHodges May 15, 2024
Author

MayaGueguen
May 16, 2024
Maintainer

SamuelHodges May 28, 2024
Author

MayaGueguen May 29, 2024
Maintainer

SamuelHodges May 29, 2024
Author