Sharing impressions from browsing the user guide #1384

fcharras · 2025-03-04T15:05:11Z

fcharras
Mar 4, 2025
Collaborator

So I just wanted to discover what skore is about and went over the user guide and took some notes on the way. I don't have a particular task in mind at the moment (and didn't run code) so those are just based on the user guide and some discussion with @auguste-probabl , that told me that I could give feedback here.

General impression

Browsing the guide made me think that I could use it and find it handy especially on quick prototyping and inspection, establishing first baselines, etc.

Also that it seems to be a good basis of a reasonable API that could extend to more comprehensive features and integrations in the future.

The part that I think is exciting is the ambition of using caching. Especially, as a user/developer, I like that the API is only about the business part (the model, the data, the model parameters, the choice of metric, etc), and that the compute-related worries (and especially, where to store and retrieve things) are completely invisible and do not pollute the important parts.

However the current state of caching raised some questions in my mind that I address at the end. Before, here's some feedback on nitpicks and some other questions.

Feedback (nitpicks) on user guide

It's nice to discover the library with a transversal usecase, with examples that are slowly exposed, it made it clear to me what skore is about.

Some parts are a bit repetitive I think ? I had the impression that I went several times over the same examples of how cache makes the workflow faster.

I regret that the tutorial Cache mechanism is swept under a final Technical details section. I also liked this shorter guide format, that targets a particular aspect, and this one looks really interesting and important so I feel it could be higher in the list of guides with full visibility.

Usage of pd.to_datetime

This section displays raw usage of pd.to_datetime without using the format argument, which is not good regarding date validation. In this case pandas uses some heuristics to try to guess the proper format and apply it to the whole column. But it is not reliable, it can make mistakes, and also, the data might not be clean (e.g. dates with different formats, invalid dates, etc) and require ad-hoc inspection. Better practice is to know ahead of time the expected format and enforce it with the format argument, and then explicitly add proper error management on top in case some dates do not comply (e.g. by using the other arguments of to_datetime).

Since skore seems to be a lot about good practices (cf train_test_split warnings) I thought format argument good be added there so that it looks better in this regard.

Complex model before simple model

In the model evaluation guide, why is it that the first model is a complex model, but the second model is the simple model ? it will clearer for the readers of the guide, and also display a better practice to data scientists (start data science with simple baselines and increase complexity), to have the first model be the simple one and the second the complex one.

Plot estimator report user guide

Here one can read The ultimate goal is to detect potential conflict of interest when it comes to the actual problem that we want to solve.. I don't understand what it means ?

User guide using sklearn.train_test_split

In this section, train_test_split is imported from sklearn, while the introduction mention that skore.train_test_split is going to be presented, it is a bit disorienting. Maybe add an explanation on why the guide starts with sklearn.train_test_split rather than skore.train_test_split ?

Questions and features requests

About test set control in the comparison report

It is mentionned that the comparison report enables comparing on the same test set. Does skore check that the two test sets are indeed identical ? maybe the datasets are stored in the EstimatorReport and their hashes are compared ? maybe not ? it would be nice to have the guide tell me about it, and showcase what happens if I try to use ComparisonReport if the datasets are in fact different.

About using ComparisonReport with CrossValidationReport

IIRC from the user guide, this is not currently supported ?

About the ROC curves

The ROC curves that skore automatically generates display the performance of a random classifier. Would it make sense to also add a curve for a dummy classifier (that always predict the most common class ?).

Controlling the "new" in "new data"

report_metrics "X_y" syntax enable computing metrics on new data. Does skore check that the data is actually new ? Can skore help me check that the new data does not interesect with previous data (especially train data) ?

A report on how two datasets differ could be super useful (number or rows/colums, but also how the data differs on specific data points with similar IDs (if that makes sense)...).

More about the caching mechanisms

Persistence

So my first expectation was that the cache is going to persist between two session, hence will be written to and retrieved from disk. I took this for granted but @auguste-probabl informed me that currently things are just stored in memory. Which is a bit of an issue if I want to demo my notebook the day after, and that I've restarted my session in-between (which have all reasons to happen).

However if I understand correctly, if I Project.put a report after some cache has been created, then the cache will be saved too ?

So, if I want to save all intermediary assets so that I can retrieve it from disk easily, I have to remember to Project.put my report after the cache has been populated.

I have two remarks about that:

the user guide could be more in-depth about this behavior
also I don't think having to put to retain the cache is a good mechanism. Quoting myself earlier:

I like that the API is only about the business part

but with this state of caching, the user have to remember to Project.put to retain cache, so the caching mechanism is less invisible, and more halfway in-between a cache and an object store.

My suggestion would be to consider decoupling cache storage and its persistence, from the object management in the Project store.

How much space does the cache take ?

A worry that comes to mind about this cache, is how much (ram/disk) space is it going to occupy ? What are the typical size of datasets that skore is designed for ?

I would like to have information about the size of objects that I have stored/cached, and have tools to clean it easily based on some filters. (a bit like delta lake does with the parquet files).

Cache should have a max size, and automatically clean the oldest entries to make room for the new entries ?

Some more questions about scaling and hashing

I understand that hashes are computed on the full datasets, I was a bit worried that it could be slow for large datasets (ranging from 100MBs to GBs ?) and that the hash step could itself be memory-intensive. Did you benchmark the hash functions that are used, does it scale to bigger datasets ?

I was thinking that it could be useful later on, to offer users the possibility to use their custom hashing function, especially if some datasets (e.g. pandas dataframe with object column) are not hashable.

Final words

That's it, thank you for scrolling down this far, good work everybody :-)

glemaitre · 2025-03-04T15:23:52Z

glemaitre
Mar 4, 2025
Maintainer

I went through the comments and I think that each feedback should indeed be addressed.

Some do not require changes but it means that we need to document better. Some are more tricky and long term and would require a bit of discussion. So I'll take the time to split in sub-issue.

I'll quickly answer to some of the items here.

0 replies

glemaitre · 2025-03-04T15:29:45Z

glemaitre
Mar 4, 2025
Maintainer

About test set control in the comparison report

It is mentionned that the comparison report enables comparing on the same test set. Does skore check that the two test sets are indeed identical ? maybe the datasets are stored in the EstimatorReport and their hashes are compared ? maybe not ? it would be nice to have the guide tell me about it, and showcase what happens if I try to use ComparisonReport if the datasets are in fact different.

If one is passing a test set, we check for consistency by hashing the input. We need to document

1 reply

MarieSacksick Mar 31, 2025
Maintainer

tracked in #1507

glemaitre · 2025-03-04T15:30:17Z

glemaitre
Mar 4, 2025
Maintainer

About using ComparisonReport with CrossValidationReport

IIRC from the user guide, this is not currently supported ?

Not yet but to come.

0 replies

glemaitre · 2025-03-04T16:02:33Z

glemaitre
Mar 4, 2025
Maintainer

About the ROC curves

The ROC curves that skore automatically generates display the performance of a random classifier. Would it make sense to also add a curve for a dummy classifier (that always predict the most common class ?).

It is already this line if I'm most mistaken:

from sklearn.dummy import DummyClassifier
from sklearn.metrics import RocCurveDisplay

X, y = make_classification()

dummy = DummyClassifier(strategy="most_frequent").fit(X, y)
RocCurveDisplay.from_estimator(dummy, X, y)

3 replies

fcharras Mar 4, 2025
Collaborator Author

Ah, then this is what is meant by "chance level" in the user guide ?

I thought it was a classifier that randomly output 0 or 1 with 50% probability 😓

glemaitre Mar 4, 2025
Maintainer

But I don't like the legend "chance level" from scikit-learn :)
We should explicit what model produce this so I would agree with your comment ;)

fcharras Mar 4, 2025
Collaborator Author

Indeed, I prefer DummyClassifier. But I also realize that my spontaneous interpretation of chance level makes no sense with respect to thresholds in ROC 😓
(or a classifier that outputs random probabilities)

glemaitre · 2025-03-04T16:07:52Z

glemaitre
Mar 4, 2025
Maintainer

Controlling the "new" in "new data"

report_metrics "X_y" syntax enable computing metrics on new data. Does skore check that the data is actually new ? Can skore help me check that the new data does not interesect with previous data (especially train data) ?

Nop, we let you do what you want. We compute the hash so we could potentially check the hash of the internal data. But if it is a portion of data then we are not going to detect it. Also, you can have potential duplicates in terms of feature and I'm not sure that we should raise.

Potentially, I would prefer to have a check / lint accessor where we could compute some TableReport from skrub and also make this type of check (e.g. computing if there duplicate with the internal dataset). In this case, we enable the use case but make the user pulling the info.

A report on how two datasets differ could be super useful (number or rows/colums, but also how the data differs on specific data points with similar IDs (if that makes sense)...).

Yep potentially it could be part of this lint / check on input data

1 reply

MarieSacksick Mar 31, 2025
Maintainer

tracked in #1509

glemaitre · 2025-03-04T16:14:17Z

glemaitre
Mar 4, 2025
Maintainer

Persistence

I think that we should persist the cache by default. We have a way to clear the cache if we want to.

The location of the cache (i.e. RAM vs. disk) could also be discussed. My thought is that could be set via the configuration and therefore add the option that you mentioned regarding the size as well.

I assume that it could go towards the direction of the decoupling.

The only thing that I want to be sure is to not have any overhead on the general behaviour for people that don't know about a cache but offer the flexibility for other.

1 reply

MarieSacksick Mar 31, 2025
Maintainer

Related to #1493

glemaitre · 2025-03-04T16:17:45Z

glemaitre
Mar 4, 2025
Maintainer

I understand that hashes are computed on the full datasets, I was a bit worried that it could be slow for large datasets (ranging from 100MBs to GBs ?) and that the hash step could itself be memory-intensive. Did you benchmark the hash functions that are used, does it scale to bigger datasets ?

I was thinking that it could be useful later on, to offer users the possibility to use their custom hashing function, especially if some datasets (e.g. pandas dataframe with object column) are not hashable.

I think that we should use a better hash function like XXH3 (https://xxhash.com/). Since the hash compute 30GB/s then it should not be a bottleneck anymore. MD5 is indeed slow.

0 replies

glemaitre · 2025-03-04T16:19:30Z

glemaitre
Mar 4, 2025
Maintainer

I agree with all other mentions regarding improving the documentation.

0 replies

MarieSacksick · 2025-03-31T17:34:56Z

MarieSacksick
Mar 31, 2025
Maintainer

Thank you for all your remarks @fcharras! I hadn't seen them before, they are now more documented in issues :) !

0 replies

Sharing impressions from browsing the user guide #1384

Uh oh!

fcharras Mar 4, 2025 Collaborator

General impression

Feedback (nitpicks) on user guide

Usage of pd.to_datetime

Complex model before simple model

Plot estimator report user guide

User guide using sklearn.train_test_split

Questions and features requests

About test set control in the comparison report

About using ComparisonReport with CrossValidationReport

About the ROC curves

Controlling the "new" in "new data"

More about the caching mechanisms

Persistence

How much space does the cache take ?

Some more questions about scaling and hashing

Final words

Replies: 9 comments · 6 replies

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

MarieSacksick Mar 31, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

fcharras Mar 4, 2025 Collaborator Author

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

fcharras Mar 4, 2025 Collaborator Author

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

MarieSacksick Mar 31, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

MarieSacksick Mar 31, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

glemaitre Mar 4, 2025 Maintainer

Uh oh!

MarieSacksick Mar 31, 2025 Maintainer

fcharras
Mar 4, 2025
Collaborator

Replies: 9 comments 6 replies

glemaitre
Mar 4, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

MarieSacksick Mar 31, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

fcharras Mar 4, 2025
Collaborator Author

glemaitre Mar 4, 2025
Maintainer

fcharras Mar 4, 2025
Collaborator Author

glemaitre
Mar 4, 2025
Maintainer

MarieSacksick Mar 31, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

MarieSacksick Mar 31, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

glemaitre
Mar 4, 2025
Maintainer

MarieSacksick
Mar 31, 2025
Maintainer