Skip to content

Model Training ‐ Comparison ‐ [Precision]

Nikita K edited this page Sep 29, 2023 · 4 revisions

Models | Logs | Graphs | Configs


Precision is responsible for the format of representing floating point numbers. We can choose different precision for both training the model and saving the model.

Floating point numbers have 3 parts: sign, exponent and mantissa. Depending on the number format, a different number of bits are allocated to the exponent and mantissa:

  • fp32 - 8E, 23M;

  • fp16 - 5E, 10M;

  • bf16 - 8E, 7M.

When you select the bf16 or fp16 format, the model is trained on a mixture of 32-bit and 16-bit data. However, you can train the model exclusively on 16-bit data by enabling the Full fp16 Training or Full bf16 Training setting, depending on the selected format.


So, compared values:

  • bf16,

  • fp16,

  • fp32,

  • fp16 + full fp16,

  • bf16 + full bf16.


DLR(step)

And the very first graph directly tells us that in the case of fp16 + full fp16 the model was not trained at all, but in all other cases the DLR is approximately the same.


Loss(epoch)

Despite this, the loss graph of this model is quite beautiful, which is strange. The remaining 4 graphs merged into one lower graph.

This parameter also affects VRAM and model training time:

  • bf16 - 8.4 Gb, 17 min;

  • fp16 - 8.4 Gb, 16 min;

  • fp32 - 11.2 Gb, 42 min;

  • fp16 + full fp16 - 7.3 Gb, 15 min;

  • bf16 + full bf16 - 7.3 Gb, 16 min.

Also, when choosing fp32, the model file size is 2 times larger than in other cases.



Since in the case of fp16 + full fp16 the model was not trained, I excluded it from the last grids.

The results in all the cases are almost identical. As you can see, fp32 is absolutely no different from fp16, while significantly increasing VRAM consumption and training time.


CONCLUSION

If your GPU supports bf16, then use it. If not, then fp16 is no worse. You can turn on Full bf16 Training if you have VRAM bottleneck. However, If you have enough VRAM, then it will be safer to leave it turned off, although it had almost no effect on the results in this case.


Next - Model Training ‐ Comparison - [Number of CPU Threads per Core]

Clone this wiki locally