Replies: 1 comment 1 reply
-
This environment variable is for inference, not for training. The input JSON file controls the batch size during training. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I fixed batch_size DP_INFER_BATCH_SIZE= 40,000. Each batch of the model will call 40,000 atoms of data for training. Then the GPU memory usage should be about the same, right? But why do I use a larger system (700 atoms, reviously 400 atoms) for training, OOM happens. Could it be that I misunderstood the memory usage of deepmd-kit?

Besides, for the same data, the backend of torch won't run OOM, but tensorflow will, although tf run much faster. Could it be that tf is trading GPU memory for efficiency?
I want to know if there is any relationship between GPU memory usage, system size(atoms number) and batch_size.
Beta Was this translation helpful? Give feedback.
All reactions