Why is repeated random sampling insufficient? #9
Replies: 2 comments 2 replies
-
Hi thanks for cross posting! ChemRxiv deleted my formatting, I will add it back here. First, thanks for putting this together. Long overdue, and I am excited to move away from the dreaded bold table. I'm new to rigorous stats-world, so please forgive me if the below is totally off-base. My question is related to the suggestion that repeated random sampling is undesirable. I prefer this method since (I believe) it rigorously permits parametric testing for comparisons and because it allows using more advanced splitting methods (fingerprint based clustering and partitioning, for example) without having to worry about rigorously 'striping' through the data. From section 3.1.2 (v2): "Commonly used alternatives to CV like bootstrapping and repeated ran- dom splits of the data have also been shown to result in strong dependency between samples and are generally not recommended [13]." Where reference 13 is " Bates, S., Hastie, T. & Tibshirani, R. Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association 119, 1434–1445 (2023). URL http://dx.doi.org/10.1080/01621459.2023.2197686" (1) Where in this paper is this claim? (2) I find it unintuitive that repeated random splits would result in strong dependency, especially given the the suggested Repeated CV is very similar. Repeated random sampling is basically just Repeated CV (5x2) but without the x2 part (?). |
Beta Was this translation helpful? Give feedback.
-
Hi @JacksonBurns, thanks for the feedback! Could you provide more explanation why you think that repeated sampling would accommodate advanced splits more easily than repeated CV? For most cases I can think of, I don't see repeated CV as much more complicated coding wise, 5 repeats of 5 fold CV. Variability in fold sizes should be fine as long as you dont have any extremely small folds. In the example from the paper we perform a fingerprint based clustering split. CV adds slightly more complexity by imposing non-overlap in the test sets, but this is what helps ensure the dependency assumption isnt violated in statistical tests.
This was actually a misreading of the paper on my part, which I appreciate you pointing out. The paper refers to 80-20 split as “data splitting” and have results showing coverage probabilities are too low compared to standard cv in tables 1-3. I had assumed this was a repeated data split, but now that I read the paper more carefully I realize that this is only one split that uses the error variance in the validation set to form confidence intervals. However there have been other papers that have demonstrated strong evidence of this claim. From Deitterich's original paper (citation below): “Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits.” Ill add this citation instead.
5x2 does a 50/50 train test split so it would be comparable to repeated 50/50 sampling. For our guidelines something like repeated 80-20 splits would be more relevant to 5x5. You would need enough samples to perform a statistical test, so say 25 repeats. This is close to the settings that Deitterich used in simulations where he shows high type I error rate for repeated sampling. The major difference is that for repeated sampling there is overlap in the test sets for each iteration, whereas in each 5 fold CV there is no overlap between the test sets. Since the test set is used to compute performance estimates, this non overlap is important to guarantee sufficient independence between samples from the perfromance sampling distribution. See the Deitterich paper for more explanation. Since repeated CV is not much more complicated than repeated sampling coding wise, and it has been shown to have better performance in terms of statistical tests, we recommend using repeated CV in general if you can. I may add repeated sampling to our simulation, since we have been asked about this several times now. Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10, 1895–1923 (1998). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I noticed this comment from @JacksonBurns on ChemRXiv:
Porting it here to centralize all feedback.
Beta Was this translation helpful? Give feedback.
All reactions