Skip to content

Model Training ‐ Comparison ‐ [Resolution]

Nikita K edited this page Sep 29, 2023 · 5 revisions

Models | Logs | Graphs | Configs


So, we can train the model at different resolutions. Model resolution affects the minimum resolution of the training images.


Compared values:

  • 512x512 - D,

  • 768x768 - B,

  • 1024x1024,

  • 1280x1280.


And let's immediately look at one nuance that will arise with this parameter, as well as with some others.

Loss(epoch)

If we look at the loss(epoch) graphs, we will see that the number of epochs changes depending on the resolution. But we didn't change it, did we? The answer is Batch Size and Buckets.

A Bucket is a group of images with the same resolution.

The principle of how they work is well described in the Kohya GUI documentation:

Each training image will be sorted into separate buckets by {Bucket Resolution Steps} (default is 64) pixels according to their size. This sorting is done for each vertical and horizontal.

If the image size does not fit the specified size of the bucket, the protruding part will be cut off.

For example, if the maximum resolution is 512 pixels and the bucket step size is every 64 pixels , then the buckets will be 512, 448, 384... but a 500 pixel image will be put into a 448 pixel bucket, with an extra 52 pixels are clipped.

Batch size is the number of images processed in parallel.

Regardless of the chosen resolution, all larger images are downscaled to it, while smaller images are either upscaled to it (if the corresponding setting is enabled) or, when cropped, fall into the group of images with lower resolution according to the step size.

Now, let's break it down with our example. All images in the dataset are initially cropped to a 1:1 aspect ratio, and the Batch Size is set to 3. The minimum image resolution in the dataset is 862x862. Essentially, this is the maximum model resolution we can choose to ensure all images fall into one group. So, when we select 512x512 and 768x768 resolutions, the number of epochs remains unchanged because all the images fit into one corresponding group with larger resolution. However, when we choose a resolution larger than the resolution of the smallest image in the dataset, such as 1024x1024 and 1280x1280 in our case, all the images with resolutions smaller than the selected one start to split into groups. In general, the more of these groups there are, the worse the training becomes.

And this is how Batch Size is involved:

Note that we always load images from the same bucket for each batch, so having too few images in a bucket will unintentionally reduce the number of batches.

Since all the images read at the same time for each batch must be the same size, if the sizes of the training images are different, the number of images that are processed simultaneously may be less than the specified number of batches.

So, if the number of images in a group is not a multiple of Batch Size, then Batch Size is somehow reduced. This reduction leads to a decrease in the number of epochs, even though the total number of steps remains the same. Large number of image groups introduces instability into the training process. More details on Batch Size as well as the Aspect Ratio of the training images will be discussed a bit further.


DLR(step)

With GR = 1.02, as the resolution increases, DLR also slightly increases. However, with GR = ∞ the logic changes, and for 1280x1280, DLR is lower than for 1024x1024, which is unusual. Perhaps in this case, the Buckets and Batch Size played a role.


Loss(step)

The loss increases with increasing the resolution. Since SD1.5 was originally trained on 512x512 resolution images, and some checkpoints based on it were fine-tuned on 768x768, the lowest loss is achieved for them.

Additionally, resolution is the first among the parameters we've discussed that directly impacts training time and VRAM:

  • 512x512 - 11 min, 8.6 Gb;

  • 768x768 - 17 min, 9.7 Gb;

  • 1024x1024 - 25 min, 10.9 Gb;

  • 1280x1280 - 39 min, 13.1 Gb.



The grids clearly show that the checkpoint has a significant impact on the results. On EpicRealism, results on models with higher resolutions are much less stable and of lower quality compared to DreamShaper. This aligns with the observation that checkpoints indeed have a substantial influence. Some results on DS in 1024x1024 and 1280x1280 resolutions are notably better than those at lower resolutions. However, with ER the difference is not as clear. There are often deformations in the results for the 1280x1280 model, and the 512x512 results appear less detailed and of lower quality.


The question arises: "If we've trained the model at a higher resolution, should we generate the images at a higher resolution as well?" Let's try increasing the resolution to 1024x1024.

So, even models trained at a higher resolution don't guarantee that results at higher resolution will be of good quality. In the case of DS, you can still notice some improvement in quality, and the percentage of deformations is not very high. However, with ER all images at high resolution immediately become worse. In other words, we introduce even more randomness into the result, depending on the specific checkpoint.


CONCLUSION

Here's a summary:

  1. 512x512 occasionally suffers from quality issues and deformations, so it's recommended only if you're limited by VRAM.

  2. 768x768 produces the most stable and high-quality results, both at 768x768 and 1024x1024.

  3. 1024x1024 may improve results on some checkpoints while increasing the chance of deformations but not too high. You can train models at this resolution in addition to 768x768, but it's unlikely to replace it.

  4. 1280x1280 significantly increases the deformation rate without substantial improvements in results. Given the vastly increased training time and VRAM consumption, there's no practical reason to train models at this resolution.

I think it is better to leave high resolution to the SDXL.


Next - Model Training ‐ Comparison - [Aspect Ratio]

Clone this wiki locally