The OOM error in tensorflow backend: how the batch_size and the system size influence the GPU memory usage? #4711

AiPacino9 · 2025-04-21T03:07:21Z

AiPacino9
Apr 21, 2025

I fixed batch_size DP_INFER_BATCH_SIZE= 40,000. Each batch of the model will call 40,000 atoms of data for training. Then the GPU memory usage should be about the same, right? But why do I use a larger system (700 atoms, reviously 400 atoms) for training, OOM happens. Could it be that I misunderstood the memory usage of deepmd-kit?
Besides, for the same data, the backend of torch won't run OOM, but tensorflow will, although tf run much faster. Could it be that tf is trading GPU memory for efficiency?

I want to know if there is any relationship between GPU memory usage, system size(atoms number) and batch_size.

njzjz · 2025-04-21T08:47:03Z

njzjz
Apr 21, 2025
Maintainer

I fixed batch_size DP_INFER_BATCH_SIZE= 40,000. Each batch of the model will call 40,000 atoms of data for training.

This environment variable is for inference, not for training.

The input JSON file controls the batch size during training.

1 reply

AiPacino9 Apr 22, 2025
Author

Thanks for your reply, but I did some tests today:
OOM error happens again when DP_INFER_BATCH_SIZE is not defined during training, although batch_size is set to 1 for all systems in the input json file.
When DP_INFER_BATCH_SIZE is set to 20,000, The training is carried out normally.
When DP_INFER_BATCH_SIZE is set to 40,000, OOM error happens again.

I wonder how the batch_size and the system size(atoms number) influence the GPU memory usage?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The OOM error in tensorflow backend: how the batch_size and the system size influence the GPU memory usage? #4711

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

The OOM error in tensorflow backend: how the batch_size and the system size influence the GPU memory usage? #4711

Uh oh!

AiPacino9 Apr 21, 2025

Replies: 1 comment · 1 reply

Uh oh!

njzjz Apr 21, 2025 Maintainer

Uh oh!

Uh oh!

AiPacino9 Apr 22, 2025 Author

AiPacino9
Apr 21, 2025

Replies: 1 comment 1 reply

njzjz
Apr 21, 2025
Maintainer

AiPacino9 Apr 22, 2025
Author