Comments on Guidelines 1 and 4 from the Inductive Bio team #5
Replies: 6 comments 4 replies
-
Hi Inductive Bio team, love the detailed and constructive feedback! Thank you for taking the time to write this up.
I can be brief here: I think this is an excellent suggestion!
Same here! I think this is a great suggestion too. For classification problems, it would instead be a confusion matrix. I'll respond to your last suggestion separately so we have a separate thread to discuss it in more detail! |
Beta Was this translation helpful? Give feedback.
-
This is an interesting one! @jrash and myself actually discussed this as we were writing a first draft of the paper. Three thoughts come to mind here.
Does that make sense? Do you agree?
This is an interesting argument that reminds me of Sutton's The Bitter Lesson. On alternatives to 5x5 CV, I'll let @jrash comment! |
Beta Was this translation helpful? Give feedback.
-
Thanks for taking the time to read the paper carefully and to provide extensive feedback, we appreciate it! First, there are several questions about alternate approaches. I wanted to be clear on the goal of the paper. We wanted to provide a clear set of recommendations that we expect to work well in general. We also wanted to avoid listing off options, which could further confuse the reader. Instead, we state that you may decide to deviate based on your needs, and that is fine as long as you are transparent about what you are doing. There is a lot to unpack here, but let me try to answer the two main questions you’re asking. (1) is 5-fold CV “good enough”? We argue in the paper that it is not. We use CV to sample performance distributions, which we then compare with statistical tests. For such a use case, 25 samples is a minimum sample size that is typically recommended for a t-test to have good performance in terms of type I and II error rate. The Tukey HSD test we recommend is an extension to the t-test. Clearly 5 samples is not enough for a statistical test, but maybe you are saying statistical testing is not necessary? If so, that is a separate discussion. We discuss in Section 2 why statistical testing is necessary for drug discovery data sets. (2) if 5-fold CV is not “good enough,” is 5x5 CV the right alternative? We believe it is, in general. We described this in Section 3.1.2, but to summarize the key points:
With respect to the added complexity, 5x5 CV is 2.5 times 10 fold CV, which is also common practice. This is a reasonable compute increase to obtain statistical rigor. We think it is worth reducing hyperparameter search or training epochs to save compute budget for a rigorous methods comparison. That said, we do appreciate that this is a departure from current practices and will take some getting used to, which is why we provided code examples to make the transition as easy as possible. We also will touch upon this again in our upcoming paper on generalization and splitting. I hope that helps! What I take away from this is that we can improve the writing to bring these points across more clearly. We are open to suggestions. I will address some other questions in detail in follow up posts. |
Beta Was this translation helpful? Give feedback.
-
The Wager paper is on the asymptotic consistency of CV to estimate performance differences between models. It doesn’t go into estimating uncertainty of performance estimates or statistical testing, and we need to be careful drawing conclusions about the method comparisons workflow in our paper. Bates talks about the Wager paper in their asymptotics section. They say: “Wager (2020) show that for CV, comparing two models is a statistically easier task than estimating the prediction error, in some sense.” Wager shows that the difference in model performance in CV is a consistent estimator of the performance difference on a new test set, meaning that the estimator is unbiased and converges to the test set difference as sample size increases. Interestingly, CV is not consistent for individual model performances. This makes inference for differences simpler in some ways (in terms of deriving confidence intervals, statistical tests etc). The Wager paper stops at asymptotics and does not go into statistical testing. It is certainly not saying that 5 fold CV is good enough for statistical tests. With a typical drug discovery data set size and a small number of CV samples you can still easily see performance differences by chance. If Wager were to follow the progression of the Bates paper and derive a nested CV test, that would likely require 200 repeats of nested CV as Bates did.
You are right that the Bates paper and our simulation focuses on estimates of individual model performance. We do address this in Appendix B briefly, we should probably elaborate more but were trying to keep things accessible to our audience. Note the last sentence of the Bates paper is: “We anticipate that nested CV can be extended to give valid confidence intervals for the difference in prediction error between two models.” Results for the comparison of methods in a t-test framework are actually implied by their paper. The paired t-test statistic is computed using the variance of the performance difference between models, which will be underestimated if the variance of the individual performances is underestimated. Our Tukey HSD procedure is an extension to the paired t-test. The results from the Bates paper (and our simulation) show that the variance is underestimated for individual methods by standard CV. This implies that the variance for the difference between models will also be underestimated, resulting in a poor statistical test with elevated type I error rate. Since underestimation of variance is the main problem for statistical testing with CV, the Bates approach should lead to a valid statistical test. A statistical test based on the Bates approach has not been derived yet. The focus on the simulation was to show that repeated CV can be used to approximate the results of Bates. Since statistical testing has been worked for repeated CV and this can be generalized easily to different performance metrics and splitting methods, we proceed with repeated CV. Since the testing results are implied by our simulation, I didn’t run a testing simulation. But perhaps we should run it to make things concrete for readers. I also thought we had more discussion of the above points in the simulation write up, but I guess that got cut at some point. I will add some more discussion there and maybe some to the main text.
Drug program splitting: Our example in the paper demonstrates handling clustered data where clusters are defined by chemical scaffold. The same approach can be applied to grouping by drug program, though I would probably group by scaffold or cluster. Time splitting: The main focus of this paper is on benchmarking in the literature. Most benchmarking data sets in the public domain are not time stamped, but we often do have timestamps in industry. The time split case is a special case where you might want to deviate from the guidelines somewhat. We have an upcoming paper on splitting where we will address this. For statistical testing, the key is to get at least 25 approximately independent samples that capture the data structure of interest. You can do this with a rolling time split and further divide splits into folds for additional samples, for example. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the simulation suggestions. There were some good ideas in there. However, I do want to clarify that our goal is not to write a comprehensive simulation paper like Bates et al. did. We are primarily relying on others who have already run such experiments in the literature for statistical testing with repeated cross validation (cited below). The simulation is more intended to show how results translate to a drug discovery data set.
Addressed in the previous post.
Figure 1 and 2 are intended to demonstrate characteristics of the intervals in a way that is easy for the reader to visualize and understand. Figure 3 is the main result.
It is interesting that for some metrics Bates didn't have the target coverage. In the end, the paper isn’t proposing nested CV, though. We only compare against it because it is the current SOTA. We are proposing repeated CV and we show we can replicate the Bates results with this method so we didn’t look into it further. Maybe if we have time. We provide the code for the simulation. Also, Ill add more details on the data set that was an oversight. References:
|
Beta Was this translation helpful? Give feedback.
-
Hi @jrash, apologies for the late response here! This got lost in the shuffle of end-of-year busyness. I appreciate the thorough and thoughtful responses to our comments. If there's one thing on the cross-validation side that I'd still love a deeper dive into in the final paper, it's why 5x5 CV is expected to work much better than repeated random splitting. This is essentially the same question raised by @JacksonBurns in this question: #9. If i'm following the 5x5 method right, then within each 5x CV there is no overlap across test sets, but across the different CV sets there is overlap. And then, to my understanding, the 25 sets of metrics are pooled together for statistical analysis. So it seems like for any given sample you have 4 other samples where you achieve independence, but then 20 samples where you do not have any more independence than you would have in 25x random re-sampling of the data. It's not obvious to me why that would lead to much better statistical behavior. This is a totally intuitive argument, for which i've done no formal analysis, but if there was a way for this to be made clear and compelling in the paper, I think that would be a great addition! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Polaris team, I’m sharing these comments on behalf of myself and the team at Inductive Bio, including @gunjanbaid, @pmaher86, @BryceEakin, and @benb111.
We’re excited to see the first paper take shape from the Polaris collaboration. These types of guideline papers can be really valuable for a field. We particularly loved the focus on relating model performance to specific decisions, looking at effect size, and considering the role of dynamic range in perceived model performance. It’s also fantastic to see a paper taking this approach of collecting community comments before going to publication.
We wanted to share two comments related to Guideline 1 (Performance sampling distribution) and one related to Guideline 4 (presenting the result).
We’ve shared some in depth notes below - we hope these are helpful as the paper progresses into the peer review process, and are always happy to have continued discussion either via Github or other venues!
Importance of appropriate non-random splitting approach
The manuscript currently addresses alternative (non-random) splitting approaches in the section “3.1.3 Cross-validation with advanced splits,” and indicates that this will be the focus of a future manuscript. We understand that it can be impossible to cover everything in one paper, but we think there should be a stronger disclaimer about how important this issue is.
To our minds, the use of random and scaffold splitting is one of the most important reasons that modeling results in the small-molecule ML literature can be so misleading. As we know this group of authors is aware, random and scaffold splitting can leave highly similar pairs of molecules split across train and test, and this often leads to performance metrics on a test set that are much higher than can be expected on truly novel compounds. We’ve written about this a bit here (but certainly aren’t the first).
Readers coming from the broader non-chemistry ML field may already have a good understanding of cross-validation and statistical testing but not the highly chemistry-specific issues with random splitting of datasets and how to address them. Without a clear disclaimer, there’s a risk that by following the guidelines “to a T”, practitioners could form a misplaced, statistically-strengthened confidence in how well their models are performing, only to have their models fail in practice due to their reliance on a misleading splitting approach.
Our thought on how to include this disclaimer would be directly in Guideline 1, e.g. (added text shown in bold):
We recommend using a 5x5 repeated cross-validation procedure to sample the performance distribution. This procedure suits typical dataset sizes used in small molecule property modeling (e.g., 500 - 100,000). The training set can be further split into a training and validation set if needed. Care should be taken to consider how the choice of data-splitting approach might systematically over-estimate or under-estimate model performance.
And then adding another paragraph or two to section 3.1.3 that specifically cites papers such as Sheridan (2013), Landrum et al. (2023), and Guo et al. (2024) so readers are aware of potential issues and can familiarize themselves with proposed solutions.
Questions and comments regarding 5x5 repeated CV
Compared to 5-fold CV, the recommendation of 5x5 repeated CV represents a 5x increase in computational complexity for all experiments run in small-molecule ML, and is also a divergence from standard practice in other ML subfields. We’d like to see either a stronger justification for it or a simpler method proposed. Our comment on 5x5 repeated CV boils down to two questions: (1) is it possible that standard 5-fold CV, while imperfect, is “good enough” for model comparison? (2) if 5-fold CV is not good enough, is 5x5 repeated CV the best alternative?
CV Question 1: is 5-fold CV “good enough”?
The guidelines manuscript points to an interesting paper by Bates et al. that dives into the tendency for standard K-fold CV to have poor coverage in its estimates of model performance. But it’s unclear how damning this really is for standard K-fold CV as a tool for model comparison. The Bates paper notes that they are specifically interested in evaluating the error distribution of a particular trained model (what they call ErrXY) versus evaluating the differences between two types of models.
For small-molecule ML papers, readers are usually most interested in differences between models - e.g., they’re more interested in whether Model Type A is better than Model Type B at predicting solubility than on the exact error rate on the test set (in part because practitioners know that exact error rates will probably be different for their project due to data distribution shifts). Bates et al. note that this is a distinct use case and refer to this recent paper by Wager. The Wager paper performs an analysis showing that while standard CV is quite poor at estimating the error distribution of a single model, it’s actually quite good at telling you which of two models is better.
Based on this reading of the literature, it seems like standard CV may still be “good enough” as a tool to compare models, particularly if a standardized set of folds are published for each dataset so that all researchers can perform consistent comparisons. To convince the field that standard CV is not good enough, we’d like to see analyses showing it does a poor job of picking the best choice among competing models, rather than that it has poor coverage for the error rate of a single model.
CV Question 2: if 5-fold CV is not “good enough,” is 5x5 CV the right alternative?
If 5-fold CV is shown to be not good enough, then we’d like to see more extensive comparison of alternatives before determining a new standard approach for the field. A few possibilities to consider are:
To test these alternatives, we’d like to see more thorough testing than is currently provided in the supplementary analysis. Some things we think could strengthen the supplemental analysis are:
The choice of the best cross-validation approach is a really difficult one, and it will become even more difficult when moving outside of random splitting. In a dataset containing timestamps to enable time splitting, or in a dataset consisting of several distinct drug programs that should be separated between train and test, it is not immediately clear how one would apply 5x5 repeated CV. Given this, our recommendation is to bias towards simplicity and towards CV approaches that other subfields of ML have aligned on. However, we hope the above questions will be helpful for strengthening the Guideline 1 section of the paper no matter what direction is chosen.
Inclusion of a predicted vs measured scatterplot in reporting results
We think the guidelines should recommend that a scatterplot of predicted vs measured values be included in paper results for at least the best-performing model, on either a single test-set split or on the aggregated test sets from cross-validation, and using an appropriate scale for the property, e.g. log scale for many chemical properties. While performance metrics and statistical tests are immensely useful, there are aspects of the model’s performance that are missed by these summaries but are made clear in this more “raw” representation of the model’s performance (consider the classic Anscombe’s quartet). As a few examples: are there notable outliers, or even a cluster of outliers? Is there a meaningful nonlinearity between predicted and measured values? Is there heteroskedasticity in the predictions? Do the predictions show low dynamic range relative to the measured values?
Figure 3 of the guidelines preprint is a great example of how useful this kind of scatterplot can be. The figure makes immediately obvious that the experimental data is highly concentrated between 100 and 300µM, in a way that none of the metrics show. Seeing this raises relevant follow-up questions, like “do model predictions around 1µM tend to have larger errors than predictions around 100µM?” We think that recommending that papers show these sorts of scatterplots as standard practice will strengthen the field.
Beta Was this translation helpful? Give feedback.
All reactions