-
Notifications
You must be signed in to change notification settings - Fork 698
[Triton/XPU] Support 4bit dequantization logic on Triton #1629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Triton/XPU] Support 4bit dequantization logic on Triton #1629
Conversation
a1faeb4
to
679cedc
Compare
BNB_TEST_DEVICE="xpu" pytest -s tests/test_linear4bit.py
88 passed in 11.91s
BNB_TEST_DEVICE="xpu" pytest -s tests/test_functional.py
953 passed, 170 skipped, 9 deselected, 37 warnings in 235.89s (0:03:55) |
679cedc
to
ea15027
Compare
Thanks for your contribution, but this PR seems to have a conflict with bitsandbytes-intel. We might need to further discuss to determine the priority. |
Roughly speaking, this is not a conflict. It is a different implementation that can be used depending on the availability of ipex. |
Could you clarify the nature of the conflict? This PR provides 4bit implementation for users that just install bitsandbytes without any additional plugins or libraries like Given current implementation if the user additionally installs |
When @matthewdouglas says we'd like to enable on CPU without IPex path, that's because non-Intel CPUs do not support IPex. But for XPU, it's an Intel-specific device, so they all support IPex. We'd better install IPex on XPU by default so we can get a significant speed-up. More specifically, not all ops in XPU have ipex optimization. I can see most of ops in this PR are duplicated with my PR (As they were the same as CPU implementation, I was thinking could we just move these ops to the default op?). So the design is a little confusing to me. Should we keep both repo to implement XPU ops? Anyway, the example of PEFT is a good point. Let's sync it offline. Would like to hear your opinion. :) |
since |
@matthewdouglas Please take a look. |
This PR adds xpu backend and trtion kernel for dequantization nf4 dtype. Trtion is an optional import. Tests: tests/test_functional.py::TestQuantize4BitFunctional supported nf4/fp4 cases tests/test_functional.py::Test8BitBlockwiseQuantizeFunctional implemented quantize_blockwise with binary search that works faster for XPU tests/test_linear4bit.py Contains WA for gemv_4bit for XPU, for some reason directly passed code causes errors in several tests. For example: `tests/test_functional.py::TestQuantize4BitFunctional::test_gemv_4bit[dim=128-uint8-fp32-fc1-fp4-DQ_True-xpu]` Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>
ea15027
to
fbb2d00
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Local test run on PVC: BNB_TEST_DEVICE="xpu" pytest -rf --ignore test_optim.py --ignore test_triton.py --ignore test_cuda_setup_evaluator.py
2196 passed, 1555 skipped, 178 deselected, 33 xfailed, 189 warnings in 357.17s (0:05:57) |
@matthewdouglas, could you pls take a look on it? The background is: we'd like contribute triton ops to bnb and make XPU support bnb triton backend. Thx very much. |
This PR adds xpu backend and Triton kernel for dequantization nf4 dtype.
Trtion used as an optional import.
Tests:
tests/test_functional.py::TestQuantize4BitFunctional
supported nf4/fp4 casestests/test_functional.py::Test8BitBlockwiseQuantizeFunctional
implemented quantize_blockwise with binary search that works faster for XPUtests/test_linear4bit.py
Contains WA for gemv_4bit for XPU, for some reason directly passed code causes errors in several tests.
For example:
tests/test_functional.py::TestQuantize4BitFunctional::test_gemv_4bit[dim=128-uint8-fp32-fc1-fp4-DQ_True-xpu]
Signed-off-by: Dmitrii Makarenko dmitrii.makarenko@intel.com