Feedback on the manuscript #8

agamemnonc · 2024-12-04T16:18:20Z

agamemnonc
Dec 4, 2024

Once again, congrats on this valuable collection of guidelines, which I believe is much needed by our community. Below you will find a list of suggestions for improvements. Although this may look a bit like a paper review -- perhaps due to myself being used to provide feedback in this form -- please note that all comments intend to improve the clarity of the manuscript and you can consider or disregard any points as you see fit. I would be more than happy to provide again feedback in future iterations if you wish, or even contribute to drafting some of these suggestions if you decide to take them in.

Main improvement suggestions

I would suggest explicitly disambiguating the terms “fold” , “split” and “CV iteration” related to cross-validation (e.g. Fig 2). These terms are often used interchangeably which may sometimes lead to confusion. For example, in standard KFold splitting, the number of folds is equal to the number of CV iterations by definition, since every splitting / CV iteration ID corresponds to the same fold ID being used as the test set. Therefore these two terms may be used interchangeably. However, this need not necessarily always be the case, as in other CV strategies the number of splits / CV iterations may not be equal to the number of folds. One such example is GroupKFold where the number of groups (i.e. data folds) does not necessarily equal the number of splits. In those cases, it is important that these two concepts are distinct. In the context of this manuscript, I believe that this is particularly important to clarify as the number of samples of the performance score distribution is not dependent on the number of folds (i.e. data partitions), but rather on the number of splits / CV iterations.
Therefore, I would suggest using the following convention: “fold” refers to a data partition (i.e. each square in Fig. 2); “split” refers to the kind of splitting (e.g. random, scaffold-based, similarity-based etc); “CV iteration” refers to an iteration of the evaluation procedure (i.e. each row and subrow in Fig 2); “set” refers to a collection of folds comprising a subset of the data that is used in a specific manner (i.e. “training set” and “test set”),
Currently in the main text folds are indeed defined as disjoint sets (or partitions) of a dataset (p8). However, in Fig. 2 it seems that “Fold” in 2(a) refers to CV iteration, and so does “Split” in 2(b). Would it perhaps make sense to rename these as “outer” and “inner” CV iterations?
I acknowledge that there is no standardised terminology agreed by the community and the choice of terminology may come down to personal preference. Regardless of the choice made, I would invite the authors to formally introduce the terminology in the text and be consistent with it throughout the manuscript.
I would also encourage the authors to be explicit about the number of samples generated by the suggested repeated CV (in Fig. 2 and main text) as this was not clear to me until I reached the appendix (Section C3 → “25 samples”). Since the motivation here is to end up with training sets across CV iterations with lower overlap, one may wrongly assume that some sort of aggregation (e.g. mean) is used for one of the two CV layers of the repeated CV.
Additionally, it seems that Fig. 2 currently couples the evaluation protocol with hyper-parameter optimization via introducing a “validation set” (Fig 2(c)). In my opinion, HPT is an orthogonal parameter to model evaluation and should be applicable to all CV / evaluation strategies. To avoid confusing the reader here, I would suggest reconsidering whether hyper-parameter optimization should be introduced at this stage or later in the text in a separate section.
Related to the above, in Section 3.1.3 the concept of choice of splitting method is introduced. I believe that a qualitative schematic showing the different orthogonal dimensions of experimental design and their possible values / implementations would be very valuable here. This could take the form of a 3-axes co-ordinate system, for instance, where the axes are: 1) Splitting strategy (random, scaffold-based, similarity clustering etc.); 2) CV strategy for reporting results and benchmarking competitive methods (e.g. KFold, repeated KFold etc.); and 3) whether hyper-parameter optimization is performed or not. I believe that such a diagram would make it clear to the reader that valid experimental designs may be constructed using a combination of the possible choices from each of these three orthogonal parameters / dimensions.
I would appreciate a comment from the authors on how to handle multiple comparisons when performance is assessed with several metrics (e.g. Fig. 7). Should one take any additional steps when the number of statistical tests is increased due to using several metrics (as opposed to increasing the number of groups that are compared against each other)? Or is it perhaps that the Tukey HSD handling of multiple comparisons is adequate? How about the case of non-parametric tests – where the authors suggest using the Holm-Bonferroni correction for multiple testing? Should the denominator in this case be just the number of pairwise comparisons or should it perhaps be multiplied by the number of performance metrics?
Regarding statistical testing, there are often cases where one may not be interested in all pairwise comparisons, e.g. when we only wish to compare a novel method to existing benchmarks / SOTA methods and not necessarily compare the SOTA methods between them. Also, when several simple baselines are included in the benchmark assessing random-level performance in several ways, we may not be interested in comparing such baselines between them. I would appreciate a comment from the authors on how to handle such cases, especially regarding multiple comparison correction.
I would recommend including an additional suggestion to practitioners regarding hyper-parameter tuning. It is often the case that when a new method is introduced, proposers of this new method spend a lot of time & compute in order to optimize the architecture and hyper-parameters of this method but use default hyper-parameter configurations for competitive methods, or simply use the hyper-parameters specified in the original publications. This creates an unfair advantage towards the novel methods, especially considering that optimal selection of hyper-parameters is dataset specific, so if results are reported on a different dataset to the one used in the original publication, its hyper-parameters need to be systematically tuned. I would therefore suggest encouraging practitioners to evenly balance both the effort spent as well as computational budget across the different methods when it comes to HPT.
Once a model has been developed and evaluated using some form of cross-validation procedure (e.g. one of the choices in Fig. 2), one may wish to finally deploy the model to make predictions in the real world. When using CV, there is always the question of which model to deploy. There are a few options here, each one of them with their own pros and cons. The most common choices are: i) to retrain the final model on the entire dataset (caveat of this method is that it is not possible to have an evaluation score for the deployed model); ii) use a model ensemble (one problem here may be the linear increase in inference time); and iii) use one of the models developed as part of CV (the issue here is that it is not straightforward to select which model to use as performance metrics have been evaluated on different partitions of the data). Perhaps the authors may wish to include a brief discussion on this aspect? I acknowledge though that this is not directly relevant to benchmarking / method comparison, which is the main scope of this paper.
In “Guidelines 2 (Statistical testing)” it is suggested that normality assumptions should in most cases be met in small molecule property modelling. I believe that this is dependent on the choice of performance metric. Some metrics, e.g. RMSE, R2, or even the enrichment factor in biological activity detection tasks, are single-bounded and typically do not follow Gaussian distributions (i.e. they have long tails). I would therefore suggest making a comment on the link between choice of performance metrics and normality assumption). This comment also applies to the second point in the “Conclusion” section and Section C.1. in the Appendix.
Section A.2: “We recommend using the Conover-Friedman test for pairwise comparisons and the Holm-Bonferroni correction for multiple testing”. Would it be worth making a recommendation here for a first-level assessment on the effect of the model used in performance, e.g. using the Friedman test, which plays a similar role to repeated measures ANOVA in the parametric case? Provided that the Friedmant test produces a significant result, one may then proceed with post-hoc pairwise tests (e.g. using either the Conover or the Kruskal-Wallis test).
Section D1. Unfortunately, it is a common misconception that R2 = (Perasons’s r) ** 2. Since the target audience of this paper is beginner practitioners, I would explicitly state here that this is not the case. I would also recommend clarifying that the two metrics have different bounds: - inf < R2 <= 1; and -1 <= r <= 1. Perhaps also worth adding that R2 is often called “variance accounted for”. More importantly, I would suggest adding that Perason’s r is scale and offset invariant, which means that a perfect score of 1 can be achieved even when the target and predicted values do not match (i.e. when y_pred = a*y + b), whereas this is not the case with R2. Therefore, I would personally recommend against using this metric in regression (p. 27). Finally, some challenges / leaderboards (e.g. TDC) often use ranking metrics (i.e. Spearman’s r) for evaluating performance in regression tasks. This is also problematic, as these metrics are also scale and offset invariant. Although they may be appropriate for ranking tasks, they should not be used when the prediction error matters. Perhaps also worth clarifying this point here or in the later section where ranking metrics are discussed.
Related to the above, there is a distinction made in D1 for regression metrics into error and correlation metrics. This is an interesting classification, however, I would personally disagree with strictly classifying R2 as a correlation metric. The reason is that the main term of this metric is the residual / error term SS_res = \sum ( y_i - f_i ) ^ 2. In fact, R2 corresponds to a scaled form of this prediction error (normalized by the negative variance of the measured data) and an offset (the “1 -” part). In my opinion, R2 belongs to both categories at the same time.

Minor improvement suggestions

p2: “The stochasticity in modeling methods necessitates the comparison of populations of models different methods generate (e.g. through cross-validation)”. Perhaps also comment here that the stochasticity also originates from the nature of typical molecular datasets (i.e. small sizes, label noise etc.), which makes these scores sensitive to addition / removal of a few data points.
p9. I believe there is a syntax error in the sentence starting with: “This was addressed in a recent paper by Bates et al.”
Fig. 3. I find it slightly confusing that the significance level is marked as an area under the curve in this diagram, as opposed to being shown as two single values in the x-axis where the dashed lines are drawn (α and 1-α). Similarly, the p-value should in my opinion be shown as a a separate horizontal line, as opposed to a region in the 2D space.
p10, Section 3.2: I believe that the sentence “However, we can hypothesize that the two samples come from distributions…” should read “However, we can hypothesize that the two populations of samples come from distributions…”.
p12, last sentence: “However this correction is known to have low statistical power when the number of comparisons is large.”. In statistics the term “conservative” is often used in this setting, so perhaps worth adding it here for clarity.
p13, syntax error in first sentence: “The recommended Tukey HSD test, which is specifically designed for pairwise comparisons and incorporates a correction for multiple testing.”
p.14, Fig 4. Perhaps worth clarifying in the text why the axes are reversed as compared to common practice (i.e. measured (y-axis) vs. predicted (x-axis) as opposed to predicted (y-axis) vs. measured (x-axis))?
p15. “If one used these models as a compound filter at 100 μM, lightGBM would thus reject more molecules with good solubility.” For clarity, it is perhaps worth adding that this corresponds to the number of blue dots in the top left panel of all three subplots in Fig. 4.
p18: “If the experimental variability of the underlying assay is known, it can be used to estimate the maximum expected performance”. Perhaps worth adding this is often referred to as “aleatoric” or “data uncertainty”.
p18. “a ML method” → “an ML method”
p21: “While confidence intervals can be easily calculated for parametric methods, they are not straightforward to obtain with the non-parametric workflow.” Does it make sense here to add that for non-parametric tests, the entire ranges of the differences may be instead reported ?
p.29: “In these cases only the extreme left of the curve is of interest and the AUROC has limited utility.”. For clarity, I would add here that this corresponds to higher threshold values, which may be counterintuitive for beginners in the field wrongly associating the left part of the curve with lower values.
Related to the above, in this section I think it is worth adding a reference to the highly-cited paper below, showing that AUROC is not an appropriate metric for datasets that are highly-skewed towards the negative class, and in such cases AUPRC should be preferred. Perhaps it is also worth adding a brief note that – as shown in the same paper – when calculating the area under the PR curve using the trapezoidal rule and linear interpolation may lead to erroneous results and instead the average precision formula should be used, as is currently implemented in scikit-learn.
Davis, J. and Goadrich, M., 2006, June. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233-240).

jrash · 2024-12-15T22:33:19Z

jrash
Dec 15, 2024
Collaborator

Thanks for this great feedback @agamemnonc! We will work on incorporating this, I will respond once we have figured all of that out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback on the manuscript #8

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feedback on the manuscript #8

Uh oh!

Uh oh!

agamemnonc Dec 4, 2024

Main improvement suggestions

Minor improvement suggestions

Replies: 1 comment

Uh oh!

jrash Dec 15, 2024 Collaborator

agamemnonc
Dec 4, 2024

jrash
Dec 15, 2024
Collaborator