Replies: 9 comments 6 replies
-
I went through the comments and I think that each feedback should indeed be addressed. Some do not require changes but it means that we need to document better. Some are more tricky and long term and would require a bit of discussion. So I'll take the time to split in sub-issue. I'll quickly answer to some of the items here. |
Beta Was this translation helpful? Give feedback.
-
If one is passing a test set, we check for consistency by hashing the input. We need to document |
Beta Was this translation helpful? Give feedback.
-
Not yet but to come. |
Beta Was this translation helpful? Give feedback.
-
It is already this line if I'm most mistaken: from sklearn.dummy import DummyClassifier
from sklearn.metrics import RocCurveDisplay
X, y = make_classification()
dummy = DummyClassifier(strategy="most_frequent").fit(X, y)
RocCurveDisplay.from_estimator(dummy, X, y) |
Beta Was this translation helpful? Give feedback.
-
Nop, we let you do what you want. We compute the hash so we could potentially check the hash of the internal data. But if it is a portion of data then we are not going to detect it. Also, you can have potential duplicates in terms of feature and I'm not sure that we should raise. Potentially, I would prefer to have a
Yep potentially it could be part of this |
Beta Was this translation helpful? Give feedback.
-
I think that we should persist the cache by default. We have a way to clear the cache if we want to. The location of the cache (i.e. RAM vs. disk) could also be discussed. My thought is that could be set via the configuration and therefore add the option that you mentioned regarding the size as well. I assume that it could go towards the direction of the decoupling. The only thing that I want to be sure is to not have any overhead on the general behaviour for people that don't know about a cache but offer the flexibility for other. |
Beta Was this translation helpful? Give feedback.
-
I think that we should use a better hash function like XXH3 (https://xxhash.com/). Since the hash compute 30GB/s then it should not be a bottleneck anymore. MD5 is indeed slow. |
Beta Was this translation helpful? Give feedback.
-
I agree with all other mentions regarding improving the documentation. |
Beta Was this translation helpful? Give feedback.
-
Thank you for all your remarks @fcharras! I hadn't seen them before, they are now more documented in issues :) ! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
So I just wanted to discover what skore is about and went over the user guide and took some notes on the way. I don't have a particular task in mind at the moment (and didn't run code) so those are just based on the user guide and some discussion with @auguste-probabl , that told me that I could give feedback here.
General impression
Browsing the guide made me think that I could use it and find it handy especially on quick prototyping and inspection, establishing first baselines, etc.
Also that it seems to be a good basis of a reasonable API that could extend to more comprehensive features and integrations in the future.
The part that I think is exciting is the ambition of using caching. Especially, as a user/developer, I like that the API is only about the business part (the model, the data, the model parameters, the choice of metric, etc), and that the compute-related worries (and especially, where to store and retrieve things) are completely invisible and do not pollute the important parts.
However the current state of caching raised some questions in my mind that I address at the end. Before, here's some feedback on nitpicks and some other questions.
Feedback (nitpicks) on user guide
It's nice to discover the library with a transversal usecase, with examples that are slowly exposed, it made it clear to me what skore is about.
Some parts are a bit repetitive I think ? I had the impression that I went several times over the same examples of how cache makes the workflow faster.
I regret that the tutorial Cache mechanism is swept under a final Technical details section. I also liked this shorter guide format, that targets a particular aspect, and this one looks really interesting and important so I feel it could be higher in the list of guides with full visibility.
Usage of pd.to_datetime
This section displays raw usage of
pd.to_datetime
without using the format argument, which is not good regarding date validation. In this case pandas uses some heuristics to try to guess the proper format and apply it to the whole column. But it is not reliable, it can make mistakes, and also, the data might not be clean (e.g. dates with different formats, invalid dates, etc) and require ad-hoc inspection. Better practice is to know ahead of time the expected format and enforce it with the format argument, and then explicitly add proper error management on top in case some dates do not comply (e.g. by using the other arguments ofto_datetime
).Since skore seems to be a lot about good practices (cf
train_test_split
warnings) I thoughtformat
argument good be added there so that it looks better in this regard.Complex model before simple model
In the model evaluation guide, why is it that the first model is a complex model, but the second model is the simple model ? it will clearer for the readers of the guide, and also display a better practice to data scientists (start data science with simple baselines and increase complexity), to have the first model be the simple one and the second the complex one.
Plot estimator report user guide
Here one can read
The ultimate goal is to detect potential conflict of interest when it comes to the actual problem that we want to solve.
. I don't understand what it means ?User guide using sklearn.train_test_split
In this section,
train_test_split
is imported fromsklearn
, while the introduction mention thatskore.train_test_split
is going to be presented, it is a bit disorienting. Maybe add an explanation on why the guide starts withsklearn.train_test_split
rather thanskore.train_test_split
?Questions and features requests
About test set control in the comparison report
It is mentionned that the comparison report enables comparing on the same test set. Does
skore
check that the two test sets are indeed identical ? maybe the datasets are stored in theEstimatorReport
and their hashes are compared ? maybe not ? it would be nice to have the guide tell me about it, and showcase what happens if I try to useComparisonReport
if the datasets are in fact different.About using ComparisonReport with CrossValidationReport
IIRC from the user guide, this is not currently supported ?
About the ROC curves
The ROC curves that skore automatically generates display the performance of a random classifier. Would it make sense to also add a curve for a dummy classifier (that always predict the most common class ?).
Controlling the "new" in "new data"
report_metrics
"X_y" syntax enable computing metrics on new data. Does skore check that the data is actually new ? Can skore help me check that the new data does not interesect with previous data (especially train data) ?A report on how two datasets differ could be super useful (number or rows/colums, but also how the data differs on specific data points with similar IDs (if that makes sense)...).
More about the caching mechanisms
Persistence
So my first expectation was that the cache is going to persist between two session, hence will be written to and retrieved from disk. I took this for granted but @auguste-probabl informed me that currently things are just stored in memory. Which is a bit of an issue if I want to demo my notebook the day after, and that I've restarted my session in-between (which have all reasons to happen).
However if I understand correctly, if I
Project.put
a report after some cache has been created, then the cache will be saved too ?So, if I want to save all intermediary assets so that I can retrieve it from disk easily, I have to remember to
Project.put
my report after the cache has been populated.I have two remarks about that:
the user guide could be more in-depth about this behavior
also I don't think having to
put
to retain the cache is a good mechanism. Quoting myself earlier:but with this state of caching, the user have to remember to
Project.put
to retain cache, so the caching mechanism is less invisible, and more halfway in-between a cache and an object store.My suggestion would be to consider decoupling cache storage and its persistence, from the object management in the Project store.
How much space does the cache take ?
A worry that comes to mind about this cache, is how much (ram/disk) space is it going to occupy ? What are the typical size of datasets that skore is designed for ?
I would like to have information about the size of objects that I have stored/cached, and have tools to clean it easily based on some filters. (a bit like delta lake does with the parquet files).
Cache should have a max size, and automatically clean the oldest entries to make room for the new entries ?
Some more questions about scaling and hashing
I understand that hashes are computed on the full datasets, I was a bit worried that it could be slow for large datasets (ranging from 100MBs to GBs ?) and that the hash step could itself be memory-intensive. Did you benchmark the hash functions that are used, does it scale to bigger datasets ?
I was thinking that it could be useful later on, to offer users the possibility to use their custom hashing function, especially if some datasets (e.g. pandas dataframe with object column) are not hashable.
Final words
That's it, thank you for scrolling down this far, good work everybody :-)
Beta Was this translation helpful? Give feedback.
All reactions