Skip to content

[Triton/XPU] Support 4bit dequantization logic on Triton #1629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Devjiu
Copy link

@Devjiu Devjiu commented May 8, 2025

This PR adds xpu backend and Triton kernel for dequantization nf4 dtype.
Trtion used as an optional import.

Tests:

  • tests/test_functional.py::TestQuantize4BitFunctional supported nf4/fp4 cases
  • tests/test_functional.py::Test8BitBlockwiseQuantizeFunctional implemented quantize_blockwise with binary search that works faster for XPU
  • tests/test_linear4bit.py

Contains WA for gemv_4bit for XPU, for some reason directly passed code causes errors in several tests.
For example: tests/test_functional.py::TestQuantize4BitFunctional::test_gemv_4bit[dim=128-uint8-fp32-fc1-fp4-DQ_True-xpu]

Signed-off-by: Dmitrii Makarenko dmitrii.makarenko@intel.com

@Devjiu Devjiu force-pushed the dmitriim/add_xpu_triton_kernel branch 5 times, most recently from a1faeb4 to 679cedc Compare May 14, 2025 16:49
@Devjiu
Copy link
Author

Devjiu commented May 14, 2025

BNB_TEST_DEVICE="xpu" pytest -s tests/test_linear4bit.py 
88 passed in 11.91s 

BNB_TEST_DEVICE="xpu" pytest -s tests/test_functional.py
953 passed, 170 skipped, 9 deselected, 37 warnings in 235.89s (0:03:55)

@Devjiu Devjiu force-pushed the dmitriim/add_xpu_triton_kernel branch from 679cedc to ea15027 Compare May 14, 2025 16:56
@Devjiu Devjiu marked this pull request as ready for review May 14, 2025 16:59
@Devjiu Devjiu changed the title [xpu/triton] Add trtion dequantization kernel [Triton/XPU] Support 4bit dequantization logic on Triton May 14, 2025
@jiqing-feng
Copy link
Contributor

Thanks for your contribution, but this PR seems to have a conflict with bitsandbytes-intel. We might need to further discuss to determine the priority.

@Devjiu
Copy link
Author

Devjiu commented May 15, 2025

Thanks for your contribution, but this PR seems to have a conflict with bitsandbytes-intel. We might need to further discuss to determine the priority.

Roughly speaking, this is not a conflict. It is a different implementation that can be used depending on the availability of ipex.

@Egor-Krivov
Copy link
Contributor

Thanks for your contribution, but this PR seems to have a conflict with bitsandbytes-intel. We might need to further discuss to determine the priority.

Could you clarify the nature of the conflict? This PR provides 4bit implementation for users that just install bitsandbytes without any additional plugins or libraries like IPEX or bitsandbytes-intel. For example, by installing PEFT, that will only install bitsandbytes.

Given current implementation if the user additionally installs bitsandbytes-intel it should just replace kernels defined in the main repo.

@jiqing-feng
Copy link
Contributor

jiqing-feng commented May 16, 2025

When @matthewdouglas says we'd like to enable on CPU without IPex path, that's because non-Intel CPUs do not support IPex. But for XPU, it's an Intel-specific device, so they all support IPex. We'd better install IPex on XPU by default so we can get a significant speed-up.

More specifically, not all ops in XPU have ipex optimization. I can see most of ops in this PR are duplicated with my PR (As they were the same as CPU implementation, I was thinking could we just move these ops to the default op?). So the design is a little confusing to me. Should we keep both repo to implement XPU ops?

Anyway, the example of PEFT is a good point. Let's sync it offline. Would like to hear your opinion. :)

@yao-matrix
Copy link

since triton is platform agnostic, is it possible we try to upstream your ops to bitsandbytes/triton folder?

@Devjiu
Copy link
Author

Devjiu commented May 16, 2025

since triton is platform agnostic, is it possible we try to upstream your ops to bitsandbytes/triton folder?

@matthewdouglas Please take a look.
@yao-matrix For my approach I get approval to use Triton for XPU, but maybe we can share the code base. But as you know, in Triton different HW requires slightly different cores to be efficient, so ultimately it is not completely platform agnostic

This PR adds xpu backend and trtion kernel for dequantization nf4 dtype.
Trtion is an optional import.
Tests:
	tests/test_functional.py::TestQuantize4BitFunctional supported nf4/fp4 cases
	tests/test_functional.py::Test8BitBlockwiseQuantizeFunctional
implemented quantize_blockwise with binary search that works faster for XPU
        tests/test_linear4bit.py

Contains WA for gemv_4bit for XPU, for some reason directly passed code causes errors in several tests.
For example: `tests/test_functional.py::TestQuantize4BitFunctional::test_gemv_4bit[dim=128-uint8-fp32-fc1-fp4-DQ_True-xpu]`

Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>
@Devjiu Devjiu force-pushed the dmitriim/add_xpu_triton_kernel branch from ea15027 to fbb2d00 Compare May 16, 2025 13:38
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas added this to the v0.47.0 milestone May 20, 2025
@Devjiu
Copy link
Author

Devjiu commented May 22, 2025

Local test run on PVC:

BNB_TEST_DEVICE="xpu"  pytest -rf --ignore test_optim.py --ignore test_triton.py --ignore test_cuda_setup_evaluator.py
2196 passed, 1555 skipped, 178 deselected, 33 xfailed, 189 warnings in 357.17s (0:05:57)

@yao-matrix
Copy link

@matthewdouglas, could you pls take a look on it? The background is: we'd like contribute triton ops to bnb and make XPU support bnb triton backend. Thx very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants