Reinvent transfer learning with large dataset (~500k) #253

ShijiZ · 2025-06-28T19:07:51Z

ShijiZ
Jun 28, 2025

Hello,

Since reinvent was trained on Chembl dataset, I would like to retrain the reinvent model with a new dataset (~500k SMILES) that covers different chemical space as Chembl. Is transfer learning the right way to do it? For now I am testing TL from the provided reinvent.prior model, and I plan to do TL from an empty model as discussed in #87

However, I am facing an issue that huge amount of memory is needed, causing my job being killed by the system. Job report with qacct command shows:

maxvmem                  567.077G
maxrss                   438.520G

Is this normal behavior for my dataset? Since Chembl is much bigger than my dataset, how much memory was needed for the original reinvent training? Below is my input toml file:

run_type = "transfer_learning"
device = "cuda:0"
tb_logdir = "tb_TL"


[parameters]

num_epochs = 50
save_every_n_epochs = 2
batch_size = 100
num_refs = 0
sample_batch_size = 2000

input_model_file = "[path_to_priors]/reinvent.prior"
output_model_file = "TL_reinvent.model"
smiles_file = "input.smi"
# validation_smiles_file = "input.smi"
standardize_smiles = true
randomize_smiles = true
randomize_all_smiles = false
internal_diversity = true

halx · 2025-06-30T06:43:05Z

halx
Jun 30, 2025
Maintainer

Hi,

there is no simple recipe for this, I think. With only 500K compounds you may struggle creating a prior generating a sufficiently high percentage of valid SMILES. This assumes that you would be using the same network size as the ChEMBL prior. We have not tested what the LSTM hyperparameters should be for "small" training sets.

What you could do, is to use the current ChEMBL prior and apply TL with your own dataset.

Your reported memory footprint does not make sense to me. As you are running on a GPU, the main determining factor is GPU memory. You should be able to train a prior of your size with less than 10GB GPU memory. I note that batch_size has significant influence on memory usage (and also conmpute time). A size of 100 is, however, rather sensible though. Also note that batch_size is a hyoer parameter and therefore influences training.

Cheers,
Hannes.

11 replies

halx Jul 16, 2025
Maintainer

Do you have online storage where I can drop those files?

AyaIsmail284 Jul 16, 2025

Does Box work? If so, what is an email I can use to send you a folder to upload to? Thanks

halx Jul 16, 2025
Maintainer

I don't mind as long as I have access to your storage.

AyaIsmail284 Jul 16, 2025

I've sent a Box folder to your email. Let me know if there's any issue. Thank you so much for your time and support.

halx Jul 16, 2025
Maintainer

Please keep in mind that those priors are purely experimental and you should have no expectations that they work well for you. Take care.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reinvent transfer learning with large dataset (~500k) #253

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 11 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reinvent transfer learning with large dataset (~500k) #253

Uh oh!

ShijiZ Jun 28, 2025

Replies: 1 comment · 11 replies

Uh oh!

halx Jun 30, 2025 Maintainer

Uh oh!

halx Jul 16, 2025 Maintainer

Uh oh!

AyaIsmail284 Jul 16, 2025

Uh oh!

halx Jul 16, 2025 Maintainer

Uh oh!

AyaIsmail284 Jul 16, 2025

Uh oh!

halx Jul 16, 2025 Maintainer

ShijiZ
Jun 28, 2025

Replies: 1 comment 11 replies

halx
Jun 30, 2025
Maintainer

halx Jul 16, 2025
Maintainer

halx Jul 16, 2025
Maintainer

halx Jul 16, 2025
Maintainer