-
Notifications
You must be signed in to change notification settings - Fork 5
Model Training ‐ Comparison ‐ [Precision]
Models | Logs | Graphs | Configs
Precision
is responsible for the format of representing floating point numbers. We can choose different precision for both training the model and saving the model.
Floating point numbers have 3 parts: sign, exponent and mantissa. Depending on the number format, a different number of bits are allocated to the exponent and mantissa:
-
fp32
-8E
,23M
; -
fp16
-5E
,10M
; -
bf16
-8E
,7M
.
When you select the bf16
or fp16
format, the model is trained on a mixture of 32-bit and 16-bit data. However, you can train the model exclusively on 16-bit data by enabling the Full fp16 Training
or Full bf16 Training
setting, depending on the selected format.
So, compared values:
-
bf16
, -
fp16
, -
fp32
, -
fp16 + full fp16
, -
bf16 + full bf16
.
DLR(step)
And the very first graph directly tells us that in the case of fp16 + full fp16
the model was not trained at all, but in all other cases the DLR
is approximately the same.
Loss(epoch)
Despite this, the loss
graph of this model is quite beautiful, which is strange. The remaining 4
graphs merged into one lower graph.
This parameter also affects VRAM and model training time:
-
bf16
-8.4 Gb
,17 min
; -
fp16
-8.4 Gb
,16 min
; -
fp32
-11.2 Gb
,42 min
; -
fp16 + full fp16
-7.3 Gb
,15 min
; -
bf16 + full bf16
-7.3 Gb
,16 min
.
Also, when choosing fp32
, the model file size is 2
times larger than in other cases.
Since in the case of fp16 + full fp16
the model was not trained, I excluded it from the last grids.
The results in all the cases are almost identical. As you can see, fp32
is absolutely no different from fp16
, while significantly increasing VRAM consumption and training time.
If your GPU supports bf16
, then use it. If not, then fp16
is no worse. You can turn on Full bf16 Training
if you have VRAM bottleneck. However, If you have enough VRAM, then it will be safer to leave it turned off, although it had almost no effect on the results in this case.
Next - Model Training ‐ Comparison - [Number of CPU Threads per Core]
- Introduction
- Examples
- Dataset Preparation
- Model Training ‐ Introduction
- Model Training ‐ Basics
- Model Training ‐ Comparison - Introduction
Short Way
Long Way
- Model Training ‐ Comparison - [Growth Rate]
- Model Training ‐ Comparison - [Betas]
- Model Training ‐ Comparison - [Weight Decay]
- Model Training ‐ Comparison - [Bias Correction]
- Model Training ‐ Comparison - [Decouple]
- Model Training ‐ Comparison - [Epochs x Repeats]
- Model Training ‐ Comparison - [Resolution]
- Model Training ‐ Comparison - [Aspect Ratio]
- Model Training ‐ Comparison - [Batch Size]
- Model Training ‐ Comparison - [Network Rank]
- Model Training ‐ Comparison - [Network Alpha]
- Model Training ‐ Comparison - [Total Steps]
- Model Training ‐ Comparison - [Scheduler]
- Model Training ‐ Comparison - [Noise Offset]
- Model Training ‐ Comparison - [Min SNR Gamma]
- Model Training ‐ Comparison - [Clip Skip]
- Model Training ‐ Comparison - [Precision]
- Model Training ‐ Comparison - [Number of CPU Threads per Core]
- Model Training ‐ Comparison - [Checkpoint]
- Model Training ‐ Comparison - [Regularisation]
- Model Training ‐ Comparison - [Optimizer]