Relevant article: Does cross-validation work in telling rankings apart? [10.1007/s10100-024-00932-1] #4
Replies: 3 comments
-
@jrash I think you would be best suited to answer this one. |
Beta Was this translation helpful? Give feedback.
-
@cwognum New to this space, but I am very supportive of considering this issue. I've worked on problems of rank aggregation in biological network science and using relevant tests here will definitely matter. Would love to be a part of the discussions and happy to prepare material I'm familiar with as well. Hoeffding's D and recent discussion on nonparametric tests of dependence (using generalizations of ranking) seem relevant to what might be optimal. |
Beta Was this translation helpful? Give feedback.
-
The paper shows that Deitterich’s 5X2 test had near 0 power for simulated cases using a performance metric for ranking the authors have developed called Sum of Ranking Differences. The simulations were run with sample sizes of n = 7, 13, and 32. These samples are unusually low for drug discovery. Typical data sets are in the thousands at least. They provide an example from analytical chemistry. I would be surprised if you were to find a significant difference using a CV procedure with a data set that small. In general I would not recommend using a CV procedure with a data sets of this size. This does raise a useful issue though @cwognum. Perhaps we should advise against performing CV based statistical testing if data sets are too small. I hadn’t considered people might try this with a data set this small… Also, I noticed that the authors followed Deitterich’s implementation exactly. He advocates for using the difference from one repeat, instead of taking the difference of means across all repeats as is typically done in a t-test. This could also lead to loss of power. This approach isnt used by other CV based tests. When developing chemmodlab, we found this to be unconventional, so we perform a repeated measures ANOVA and Tukey HSD in the standard way with all samples. We may want to stress this difference from Deitterich's t-test somewhere. Even though this is somewhat implied by the recommended tests. @mnarayan I am not aware these types of rank aggregation metrics being used commonly in drug discovery. We expect our procedure to work for most commonly used metrics in drug discovery, but I could see how these rank aggregation cases (the number of variables is much larger than samples) could become difficult. Good to be aware of. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm posting this on behalf of Károly Héberger, who doesn't use Github.
The preprint [DOI: 10.26434/chemrxiv-2024-6dbwv-v2] touches an important aspect, but the protocol is far from being perfect.
Our paper [ https://doi.org/10.1007/s10100-024-00932-1 ] examined the three relevant statistical tests (Wilcoxon, Dietterich and Alpaydin) for cross-validation. We established that Dietterich is the worst option, and none of the test performs well in first kind error situations. Seven criteria provided the unambiguous superiority of Wilcoxon test in second kind of error situations.
I am not familiar with Github discussions. Anyway, it needs to be "sign up" and I resist. I would prefer zoom or team discussion with properly prepared discussion partners.
As we elaborated all (main) scenarios, the new perspectives are obvious. Known statistical solutions (e.g. Mallows model) are not better. Practical examples are also convincing.
As all three tests do not reject H0 in type I error situations, elaboration of a new test is warranted, etc.
Beta Was this translation helpful? Give feedback.
All reactions