Adapt ZairaChem to regression tasks

## Motivation

At the moment, ZairaChem only works with **binary classification** tasks. However, in a real-world scenario, we often encounter **regression tasks**, for example, to predict the IC50 values or pChEMBL values. We would like to extend ZairaChem to work with regression tasks.

## Suggested approach

We see two possible approaches to the problem:

- **Extend ZairaChem with AutoML regression modules**: The natural approach would be to extend ZairaChem with AutoML regression modules, like the ones provided by FLAML, AutoGluon etc. While this sounds very reasonable, it may present additional challenges, such as new metrics for validations, harmonization of the _y_ variable, etc.
- **Divide the regression problem into _n_ classification tasks**: An alternative solution would be to simply divide the regression problem into _n_ classification tasks, for example, cutting at different percentiles. Then, for each percentile, we would have classification problem for which we could use the vanilla ZairaChem. At the end of the procedures, we could do a meta-regressor based on the predicted probabilities at each cutoff. This approach would be much slower, obviously, but it may be robust and easier to implement.

It is not clear yet which approach is best. I am personally inclined towards the second option, although it may end up being too computationally demanding. In the roadmap below, I assume we take this option.

## Roadmap

- [ ] Harmonize _y_ data for a given regression task. Sometimes, regression values are awkwardly distributed and we need to clean them up previous to training. For example, we may want to log-transform values, or power-transform them, or simply remove outliers. While this has been partially implemented in ZairaChem already, a production-ready module is not available yet.
- [ ] Parallelize or, at least, organize multiple ZairaChem runs (for each binary classification cutoff) in a centralized manner, including a shared folder.
- [ ] Write a meta-regressor that takes the output probabilities at each cutoff as input features and returns a regression value. The architecture of the meta-regressor should be as simple as possible, ideally a linear regression or an SVR.
- [ ] Extend default ZairaChem plots to illustrate performance in a regression scenario.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapt ZairaChem to regression tasks #31

Motivation

Suggested approach

Roadmap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adapt ZairaChem to regression tasks #31

Description

Motivation

Suggested approach

Roadmap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions