-
Notifications
You must be signed in to change notification settings - Fork 352
Description
Description
Currently, Shapash provides a correlation plot that displays relationships between variables in the training dataset. This intelligent plot highlights only highly correlated variables, grouping them into specific zones for better readability. The visualization is a two-dimensional matrix, where each cell represents the correlation between two variables using color intensity.
Feature Proposal
We propose an enhanced correlation matrix based on Shapley values instead of raw feature values. This would allow us to analyze correlations in terms of feature importance rather than just feature values.
Key Enhancements
-
Shapley Value Correlation Matrix
- Instead of computing correlations on feature values, correlations will be computed on their Shapley values.
- This allows us to capture relationships based on their impact on the model’s predictions, rather than their raw statistical correlation.
-
Shapley-Weighted Correlations
- The correlation computation should be weighted by the absolute values of Shapley attributions.
- If two features have a 99% identical distribution, but their Shapley values are mostly zero, their correlation is irrelevant.
- Only the remaining 1% where Shapley values are significant should contribute to the correlation score.
-
Consistent Aesthetics & UX
- The visualization should maintain the same look and feel as the existing correlation plot.
- Color mapping should be adjusted to reflect the new correlation computation method.
- The user should be able to interact with the visualization in the same way as the original plot.
Expected Benefits
- Helps understand which features influence predictions similarly, rather than just being statistically correlated.
- Avoids misleading correlations based on raw feature values by focusing on impact correlations.
- Provides better insights into feature interactions in the context of model interpretability.
This feature would enhance Shapash’s explainability tools by allowing users to visualize correlations in a way that aligns more closely with model decision-making, rather than just dataset structure.
Looking forward to feedback and suggestions! 🚀