Skip to content

Support block size of 32 #986

Open
@puyuanOT

Description

@puyuanOT

Feature request

Currently, bnb only supports block size of 64 or above. It would be great if it can support block size of 32, like llama.cpp

Motivation

Better output quality after quantization.

Your contribution

It seems like this change will not be straight-forward, as each cuda wrap expect 32 threads and the current implementation will result in 1 element per-thread, which prevents the implementation from packing 2 4-bit quants into 8bit. I currently don't know how to solve this.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions