Open
Description
Feature request
Currently, bnb only supports block size of 64 or above. It would be great if it can support block size of 32, like llama.cpp
Motivation
Better output quality after quantization.
Your contribution
It seems like this change will not be straight-forward, as each cuda wrap expect 32 threads and the current implementation will result in 1 element per-thread, which prevents the implementation from packing 2 4-bit quants into 8bit. I currently don't know how to solve this.