SDNQ Quantization

SD.Next Quantization provides full cross-platform quantization to reduce memory usage and increase performance for any device.
SDNQ was originally based on NNCF but SD.Next expanded, re-implemented and optimized enough to the point of re-implementing everything and became its own quantization method.

Usage

Go into Quantization Settings
Enable the desired Quantization options under the SDNQ menu
(Model, Transformer, TE and LLM are the main targets for most use cases)
Reload the model

Note: VAE Upcast has to be set to false if you use the VAE option.
If you get black images with SDXL models, use the FP16 Fixed VAE.

Features

Supports int8, uint8, int4, uint4, float8_e4m3fn and float8_e5m2 quantization schemes
- int8 is very close to the original 16 bit quality
Supports compute optimizations using Triton via torch.compile
Supports quantized MatMul with significant speedups on INT8 supported GPUs
Supports on the fly quantization during model load with DiT models (called as pre mode)
Supports quantization for the convolutional layers with UNet models
Supports post load quantization for any model
Supports on the fly usage of LoRa models
Supports balanced offload

Options

Quantization enabled

Used to decide which parts of the model will get quantized.
Recommended options are Model and TE with post mode or Transformer, TE and LLM on pre mode.
Default is none.

Model is used quantize the UNet on post mode or every model part on pre mode.
Transformer is used to quantize the DiT models.
VAE is used to quantize the VAE. Using the VAE option is not recommended.
TE is used to quantize the Text Encoders.
Video is used to quantize the Video models.
LLM is used to quantize the LLM part of the models that uses LLMs as Text Encoders.
ControlNet is used to quantize ControlNets.

Quantization mode

Used to decide when the quantization step will happen on model load.
Pre mode will quantize the model while the model is loading. Reduces system RAM usage.
Post mode will quantize the model after the model is loaded into system RAM.
Pre mode is compatible with DiT and Video models like Flux but older UNet models like SDXL are only compatible with post mode.
Default is pre.

Quantization type

Used to decide the data type used to store the model weights.
Recommended types are int8 for 8 bit, float8_e4m3fn for fp8 and uint4 for 4 bit.
Default is int8.

INT8 and FP8 quants have very similar quality to the full 16 bit precision while using 2 times less memory.
INT4 quants have lower quality and less performance but uses 4 times less memory.

Unsigned quants have the extra u added to the start of their name while the symetric quants don't have any prefix.
Unsigned (asymetric) types: uint8 and uint4
Symetric types: int8, int4, float8_e4m3fn and float8_e5m2

Unsigned quants uses unsigned integers, meaning they can't store negative values and will use another variable called zero point for this purpose.
Symetric quants can store negative and positive values meaning they don't have extra zero point value and they run faster than unsigned quants because of this.

int8 uses int8 and has -128 to 127 range.
uint8 uses uint8 and has 0 to 255 range.
int4 uses two int4 values packed into a single uint8 value and has -8 to 7 range.
uint4 uses two uint4 values packed into a single uint8 value and has 0 to 15 range.
float8_e4m3fn uses float8_e4m3fn and has -448 to 448 range.
float8_e5m2 uses float8_e5m2 and has -57344 to 57344 range.

Group size

Used to decide how many elements of a tensor will share the same quantization group.
Higher values have better performance but less quality.
Default is 0, meaning it will decide the group size based on your quantization type setting.
INT4 quants will use group size 64 by default.
INT8 and FP8 quants won't use any grouping by default.
Setting the group size to -1 will disable grouping.

Quantize the convolutional layers

Enabling this option will quantize the convolutional layers in UNet models too.
Has better memory savings but lower quality.
Quantizing the VAE is not recommended with this option.
Using 4 bit quants are not recommended with this option.
Group sizes are not supported on convolutions.
Disabled by default.

Decompress using full precision

Enabling this option will use FP32 on the decompression step.
Has higher quality outputs but lower performance.
Disabled by default.

Decompress using torch.compile

Uses Triton via torch.compile on the decompression step.
Has significantly higher performance.
This setting requires a full restart of the webui to apply.
Enabled by default if Triton is available.

Use quantized MatMul

Enabling this option will use direct INT8 MatMul instead of BF16 / FP16.
Has significantly higher performance on GPUs with INT8 support but has lower quality.
Direct INT8 MatMul is only compatible with int8 and int4 quants.
Groups sizes will be disabled when direct INT8 MatMul is enabled.
Convolutions won't use direct INT8 MatMul.
Disabled by default.

Quantize with the GPU

Enabling this option will use the GPU with the quantization calculations on model load.
Can be faster with weak CPUs but can also be slower because of GPU to CPU communication overhead.

When Model load device map in the Models & Loading settings is set to default or cpu this option will send a part of the model weights to the GPU and quantize it, then will send it back to the CPU right away.
If device map is set to gpu, model weights will be loaded directly into GPU and the quantized model weights will be kept in the GPU until the quantization of the current model part is over.

If Model offload mode is set to none, quantized model weights will be sent to the GPU regardless of this setting and will stay in the GPU.
If Model offload mode is set to model, quantized model weights will be sent to the GPU regardless of this setting and will be sent back to the CPU after the quantization of the current model part is over.

Memory usage results

These results compares SDNQ int8 to 16 bit.
For performance results, please check out the benchmarks on the Quantization Wiki.

Model:
Compresses UNet or Transformers part of the model.
This is where the most memory savings happens for Stable Diffusion.

SDXL: 2500 MB~ memory savings.
SD 1.5: 750 MB~ memory savings.
PixArt-XL-2: 600 MB~ memory savings.
Text Encoder:
Compresses Text Encoder parts of the model.
This is where the most memory savings happens for PixArt.

PixArt-XL-2: 4750 MB~ memory savings.
SDXL: 750 MB~ memory savings.
SD 1.5: 120 MB~ memory savings.
VAE:
Compresses VAE part of the model.
Memory savings from compressing VAE is pretty small.

SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.

Uh oh!

SDNQ Quantization

SDNQ Quantization

Usage

Features

Options

Quantization enabled

Quantization mode

Quantization type

Group size

Quantize the convolutional layers

Decompress using full precision

Decompress using torch.compile

Use quantized MatMul

Quantize with the GPU

Memory usage results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!