Why is repeated random sampling insufficient? #9

cwognum · 2024-12-06T15:44:28Z

cwognum
Dec 6, 2024
Maintainer

I noticed this comment from @JacksonBurns on ChemRXiv:

First, thanks for putting this together. Long overdue, and I am excited to move away from the dreaded bold table. I'm new to rigorous stats-world, so please forgive me if the below is totally off-base. My question is related to the suggestion that repeated random sampling is undesirable. I prefer this method since (I believe) it rigorously permits parametric testing for comparisons and because it allows using more advanced splitting methods (fingerprint based clustering and partitioning, for example) without having to worry about rigorously 'striping' through the data. From section 3.1.2 (v2): "Commonly used alternatives to CV like bootstrapping and repeated ran- dom splits of the data have also been shown to result in strong dependency between samples and are generally not recommended [13]." Where reference 13 is " Bates, S., Hastie, T. & Tibshirani, R. Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association 119, 1434–1445 (2023). URL http://dx.doi.org/10.1080/01621459.2023.2197686" (1) Where in this paper is this claim? (2) I find it unintuitive that repeated random splits would result in strong dependency, especially given the the suggested Repeated CV is very similar. Repeated random sampling is basically just Repeated CV (5x2) but without the x2 part (?).

Porting it here to centralize all feedback.

JacksonBurns · 2024-12-06T16:40:04Z

JacksonBurns
Dec 6, 2024

Hi thanks for cross posting! ChemRxiv deleted my formatting, I will add it back here.

First, thanks for putting this together. Long overdue, and I am excited to move away from the dreaded bold table. I'm new to rigorous stats-world, so please forgive me if the below is totally off-base.

My question is related to the suggestion that repeated random sampling is undesirable. I prefer this method since (I believe) it rigorously permits parametric testing for comparisons and because it allows using more advanced splitting methods (fingerprint based clustering and partitioning, for example) without having to worry about rigorously 'striping' through the data.

From section 3.1.2 (v2): "Commonly used alternatives to CV like bootstrapping and repeated ran- dom splits of the data have also been shown to result in strong dependency between samples and are generally not recommended [13]."

Where reference 13 is " Bates, S., Hastie, T. & Tibshirani, R. Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association 119, 1434–1445 (2023). URL http://dx.doi.org/10.1080/01621459.2023.2197686"

(1) Where in this paper is this claim?

(2) I find it unintuitive that repeated random splits would result in strong dependency, especially given the the suggested Repeated CV is very similar. Repeated random sampling is basically just Repeated CV (5x2) but without the x2 part (?).

0 replies

jrash · 2024-12-16T03:02:14Z

jrash
Dec 16, 2024
Collaborator

Hi @JacksonBurns, thanks for the feedback! Could you provide more explanation why you think that repeated sampling would accommodate advanced splits more easily than repeated CV? For most cases I can think of, I don't see repeated CV as much more complicated coding wise, 5 repeats of 5 fold CV. Variability in fold sizes should be fine as long as you dont have any extremely small folds. In the example from the paper we perform a fingerprint based clustering split.

CV adds slightly more complexity by imposing non-overlap in the test sets, but this is what helps ensure the dependency assumption isnt violated in statistical tests.

(1) Where in this paper is this claim?

This was actually a misreading of the paper on my part, which I appreciate you pointing out. The paper refers to 80-20 split as “data splitting” and have results showing coverage probabilities are too low compared to standard cv in tables 1-3. I had assumed this was a repeated data split, but now that I read the paper more carefully I realize that this is only one split that uses the error variance in the validation set to form confidence intervals.

However there have been other papers that have demonstrated strong evidence of this claim. From Deitterich's original paper (citation below): “Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits.” Ill add this citation instead.

(2) I find it unintuitive that repeated random splits would result in strong dependency, especially given the the suggested Repeated CV is very similar. Repeated random sampling is basically just Repeated CV (5x2) but without the x2 part (?).

5x2 does a 50/50 train test split so it would be comparable to repeated 50/50 sampling. For our guidelines something like repeated 80-20 splits would be more relevant to 5x5. You would need enough samples to perform a statistical test, so say 25 repeats. This is close to the settings that Deitterich used in simulations where he shows high type I error rate for repeated sampling. The major difference is that for repeated sampling there is overlap in the test sets for each iteration, whereas in each 5 fold CV there is no overlap between the test sets. Since the test set is used to compute performance estimates, this non overlap is important to guarantee sufficient independence between samples from the perfromance sampling distribution. See the Deitterich paper for more explanation.

Since repeated CV is not much more complicated than repeated sampling coding wise, and it has been shown to have better performance in terms of statistical tests, we recommend using repeated CV in general if you can.

I may add repeated sampling to our simulation, since we have been asked about this several times now.

Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10, 1895–1923 (1998).

2 replies

JacksonBurns Jan 13, 2025

It's actually exactly the case of small folds that I am thinking about. The first that comes to mind is QM9 with scaffold-based splitting - this often has one gigantic cluster of molecules with none of the Bemis Murcko scaffolds, and trying to stripe through these clusters would be challenging. Though I guess in this case, that is just a poor choice of splitting approach?

Thank you for the updated citation - that paper lays it our very clearly. Just to be sure, even though the conclusions in that paper are based on classification, the same arguments still apply to regression?

I realize now that I have a different (and apparently incorrect 😆) understanding of what it means for a sampled test set to be independent. My understanding was that independent means that each molecule in a dataset has an equal chance of being put into the test set for each instance of train/test. CV would violate this because, after initial set selection, those molecule appearing in the first test set cannot be selected again. Apparently though the independence we week is between test sets, and that non-overlapping means independence.

jrash Jan 21, 2025
Collaborator

Thanks for the example! Would you be able to point me to some more info on the QM9 split? Right I think that is probably a poor split if it is resulting in one gigantic cluster... Data distributions are rarely that asymmetric. That seems to be an artifact caused by limited Bemis Murcko scaffold definitions. There are other clustering methods that don’t require defining a scaffold, such as fingerprint clustering, which would result in more symmetric clusters. Appendix E states briefly the need to do some sanity checks on your splitting method, perhaps we should expand. Greg Landrum recently had a nice post on some problems with Bemis Murcko scaffolds: https://greglandrum.github.io/rdkit-blog/posts/2024-05-31-scaffold-splits-and-murcko-scaffolds1.html

It is helpful to hear that you found the paper easy to follow. I suspect most people will find this. We may need to stress the results in our paper to encourage our audience to read it. Yes I think we can reasonably assume their results extend to regression models because the 5x5 approach is model agnostic. 5x5 is only resampling the model inputs and if sufficient independence is obtained in samples for classification metrics, we can assume this applies to regression metrics as well.

You have the right understanding of independence now. What is important is that the performance metrics computed from each CV sample are independent because those are the samples used to perform the statistical test. Essentially, non-overlapping implies independent performance metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is repeated random sampling insufficient? #9

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is repeated random sampling insufficient? #9

Uh oh!

cwognum Dec 6, 2024 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

JacksonBurns Dec 6, 2024

Uh oh!

Uh oh!

jrash Dec 16, 2024 Collaborator

Uh oh!

JacksonBurns Jan 13, 2025

Uh oh!

jrash Jan 21, 2025 Collaborator

cwognum
Dec 6, 2024
Maintainer

Replies: 2 comments 2 replies

JacksonBurns
Dec 6, 2024

jrash
Dec 16, 2024
Collaborator

jrash Jan 21, 2025
Collaborator