Skip to content

Building on Jetson AGX Xavier Development Kit fails #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
g588928812 opened this issue Mar 26, 2023 · 50 comments
Closed

Building on Jetson AGX Xavier Development Kit fails #221

g588928812 opened this issue Mar 26, 2023 · 50 comments

Comments

@g588928812
Copy link

Hi,

i am trying to build bitsandbytes on a Nvidia Jetson AGX Xavier Kit, but it fails, not finding emmintrin.h:

/home/g/bitsandbytes# CUDA_VERSION=114 make cuda11x_nomatmul

ENVIRONMENT
============================
CUDA_VERSION: 114
============================
NVCC path: /usr/local/cuda/bin/nvcc
GPP path: /usr/bin/g++ VERSION: g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CUDA_HOME: /usr/local/cuda
CONDA_PREFIX:
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin/
LD_LIBRARY_PATH:
============================
/usr/local/cuda/bin/nvcc -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -Xcompiler '-fPIC' --use_fast_math -Xptxas=-v -dc /home/g/bitsandbytes/csrc/ops.cu /home/g/bitsandbytes/csrc/kernels.cu -I /home/g/sse2neon -I /usr/local/cuda/include -I /home/g/bitsandbytes/csrc -I /include -I /home/g/bitsandbytes/include -L /usr/local/cuda/lib64 -lcudart -lcublas -lcublasLt -lcurand -lcusparse -L /lib --output-directory /home/g/bitsandbytes/build -D NO_CUBLASLT
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
In file included from /home/g/bitsandbytes/include/BinSearch.h:5,
from /home/g/bitsandbytes/csrc/ops.cu:10:
/home/g/bitsandbytes/include/SIMD.h:32:10: fatal error: emmintrin.h: No such file or directory
32 | #include <emmintrin.h>
| ^~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:83: cuda11x_nomatmul] Error 1

Did a bit of research and, not knowing what i am doing, I changed SMID.h to include sse2neon.h instead of emmintrin.h. NOW it fails again, catastrophically, not finding builtin functions:

/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(38): error: identifier "__Int8x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(39): error: identifier "__Int16x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(40): error: identifier "__Int32x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(41): error: identifier "__Int64x1_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(42): error: identifier "__Float16x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(43): error: identifier "__Float32x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(44): error: identifier "__Poly8x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(45): error: identifier "__Poly16x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(46): error: identifier "__Uint8x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(47): error: identifier "__Uint16x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(48): error: identifier "__Uint32x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(49): error: identifier "__Float64x1_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(50): error: identifier "__Uint64x1_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(51): error: identifier "__Int8x16_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(52): error: identifier "__Int16x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(53): error: identifier "__Int32x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(54): error: identifier "__Int64x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(55): error: identifier "__Float16x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(56): error: identifier "__Float32x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(57): error: identifier "__Float64x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(58): error: identifier "__Poly8x16_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(59): error: identifier "__Poly16x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(60): error: identifier "__Poly64x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(61): error: identifier "__Poly64x1_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(62): error: identifier "__Uint8x16_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(63): error: identifier "__Uint16x8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(64): error: identifier "__Uint32x4_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(65): error: identifier "__Uint64x2_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(67): error: identifier "__Poly8_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(68): error: identifier "__Poly16_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(69): error: identifier "__Poly64_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(70): error: identifier "__Poly128_t" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(72): error: identifier "__fp16" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(795): error: identifier "__builtin_aarch64_saddlv8qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(802): error: identifier "__builtin_aarch64_saddlv4hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(809): error: identifier "__builtin_aarch64_saddlv2si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(816): error: identifier "__builtin_aarch64_uaddlv8qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(824): error: identifier "__builtin_aarch64_uaddlv4hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(832): error: identifier "__builtin_aarch64_uaddlv2si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(840): error: identifier "__builtin_aarch64_saddl2v16qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(847): error: identifier "__builtin_aarch64_saddl2v8hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(854): error: identifier "__builtin_aarch64_saddl2v4si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(861): error: identifier "__builtin_aarch64_uaddl2v16qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(869): error: identifier "__builtin_aarch64_uaddl2v8hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(877): error: identifier "__builtin_aarch64_uaddl2v4si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(885): error: identifier "__builtin_aarch64_saddwv8qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(892): error: identifier "__builtin_aarch64_saddwv4hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(899): error: identifier "__builtin_aarch64_saddwv2si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(906): error: identifier "__builtin_aarch64_uaddwv8qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(914): error: identifier "__builtin_aarch64_uaddwv4hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(922): error: identifier "__builtin_aarch64_uaddwv2si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(930): error: identifier "__builtin_aarch64_saddw2v16qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(937): error: identifier "__builtin_aarch64_saddw2v8hi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(944): error: identifier "__builtin_aarch64_saddw2v4si" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(951): error: identifier "__builtin_aarch64_uaddw2v16qi" is undefined
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(959): error: identifier "__builtin_aarch64_uaddw2v8hi" is undefined

SETUP:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Sun_Oct_23_22:16:07_PDT_2022
Cuda compilation tools, release 11.4, V11.4.315
Build cuda_11.4.r11.4/compiler.31964100_0

Flashed using JetPack 5.1 (Ubuntu 20.04)

R35 (release), REVISION: 2.1, GCID: 32413640, BOARD: t186ref, EABI: aarch64, DATE: Tue Jan 24 23:38:33 UTC 2023
Linux ubuntu 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux

full_output_nvcc-verbose.txt

Any help would be greatly appreciated, thank you!

@janrinze
Copy link

Same issue here. Apparently nvcc does not support NEON.

@g588928812
Copy link
Author

OK it works now, I have included the arm_neon intrinsics and moved the parts using those intructions to a separate .cpp file.

The compiled library loads correctly in python, does not complain but produces wrong results (I think).

Here's what happens when quantitizing a very small model (code from https://huggingface.co/blog/hf-bitsandbytes-integration):

import torch
import torch.nn as nn

import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt

fp16_model = nn.Sequential(
    nn.Linear(64, 64),
    nn.Linear(64, 64)
)

torch.save(fp16_model.state_dict(), "model.pt")

int8_model = nn.Sequential(
    Linear8bitLt(64, 64, has_fp16_weights=False),
    Linear8bitLt(64, 64, has_fp16_weights=False)
)

int8_model.load_state_dict(torch.load("model.pt"))

int8_model[0].weight

Output (=before quantitization):

Parameter containing:
Parameter(Int8Params([[-0.0271,  0.0165,  0.0010,  ...,  0.0662,  0.1013, -0.1198],
            [-0.0512,  0.0790,  0.0168,  ...,  0.0641,  0.0217, -0.0149],
            [ 0.0428, -0.0957,  0.0995,  ...,  0.0269,  0.1067,  0.0185],
            ...,
            [ 0.0833, -0.0097,  0.0922,  ..., -0.0274,  0.0309, -0.0728],
            [-0.0933, -0.0087,  0.0862,  ..., -0.1061, -0.0052, -0.1229],
            [ 0.0371,  0.0090,  0.1147,  ...,  0.0673,  0.0801, -0.0073]]))

Now calling int8_model = int8_model.to(0) to 8bitquantitize and again looking at int8_model[0].weight:

Parameter containing:
Parameter(Int8Params([[  0,  17,   1,  ...,  67, 103,   0],
            [  0,  81,  17,  ...,  65,  22,   0],
            [ 44,   0, 101,  ...,  27, 109,  19],
            ...,
            [ 85,   0,  94,  ...,   0,  32,   0],
            [  0,   0,  88,  ...,   0,   0,   0],
            [ 38,   9, 119,  ...,  70,  83,   0]], device='cuda:0',
           dtype=torch.int8))

It seems like all the negative values have been set to zero. Running the same thing in Google Colab works produces negative integers, no zeros.

Any ideas why that is? (Note: I did not modify any of the .cpp/.cu functions, I just moved stuff around)

@ghost
Copy link

ghost commented Apr 3, 2023

can you share the modified code? we are facing the same problem, and can debug together.

@janrinze
Copy link

janrinze commented Apr 3, 2023

OK it works now, I have included the arm_neon intrinsics and moved the parts using those intructions to a separate .cpp file.

The compiled library loads correctly in python, does not complain but produces wrong results (I think).

Here's what happens when quantitizing a very small model (code from https://huggingface.co/blog/hf-bitsandbytes-integration):

import torch
import torch.nn as nn

import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt

fp16_model = nn.Sequential(
    nn.Linear(64, 64),
    nn.Linear(64, 64)
)

torch.save(fp16_model.state_dict(), "model.pt")

int8_model = nn.Sequential(
    Linear8bitLt(64, 64, has_fp16_weights=False),
    Linear8bitLt(64, 64, has_fp16_weights=False)
)

int8_model.load_state_dict(torch.load("model.pt"))

int8_model[0].weight

Output (=before quantitization):

Parameter containing:
Parameter(Int8Params([[-0.0271,  0.0165,  0.0010,  ...,  0.0662,  0.1013, -0.1198],
            [-0.0512,  0.0790,  0.0168,  ...,  0.0641,  0.0217, -0.0149],
            [ 0.0428, -0.0957,  0.0995,  ...,  0.0269,  0.1067,  0.0185],
            ...,
            [ 0.0833, -0.0097,  0.0922,  ..., -0.0274,  0.0309, -0.0728],
            [-0.0933, -0.0087,  0.0862,  ..., -0.1061, -0.0052, -0.1229],
            [ 0.0371,  0.0090,  0.1147,  ...,  0.0673,  0.0801, -0.0073]]))

Now calling int8_model = int8_model.to(0) to 8bitquantitize and again looking at int8_model[0].weight:

Parameter containing:
Parameter(Int8Params([[  0,  17,   1,  ...,  67, 103,   0],
            [  0,  81,  17,  ...,  65,  22,   0],
            [ 44,   0, 101,  ...,  27, 109,  19],
            ...,
            [ 85,   0,  94,  ...,   0,  32,   0],
            [  0,   0,  88,  ...,   0,   0,   0],
            [ 38,   9, 119,  ...,  70,  83,   0]], device='cuda:0',
           dtype=torch.int8))

It seems like all the negative values have been set to zero. Running the same thing in Google Colab works produces negative integers, no zeros.

Any ideas why that is? (Note: I did not modify any of the .cpp/.cu functions, I just moved stuff around)

@g588928812 If you create a fork in your github account you can do the changes there. That way it is possible to collaborate and also to send a pull request.

@janrinze
Copy link

janrinze commented Apr 3, 2023

is the issue of no negative numbers related to this: pytorch/pytorch#52146 ?

@g588928812
Copy link
Author

can you share the modified code? we are facing the same problem, and can debug together.

sure. here's the fork: https://github.com/g588928812/bitsandbytes_jetsonX

@g588928812
Copy link
Author

is the issue of no negative numbers related to this: pytorch/pytorch#52146 ?

i'm using pytorch 1.14 and this should be have been fixed with 1.8.1 already if i understand this correctly

@g588928812
Copy link
Author

is the issue of no negative numbers related to this: pytorch/pytorch#52146 ?

you were right! I systematically replaced all chars with in8_t and it works now, it was somewhere in kernels.cu. will find out which change exactly did it and update the repository later

@ghost
Copy link

ghost commented Apr 3, 2023

thanks for your repo

the test with torch 2.0 has a similar effect.

now that you have located the kernels.cu, i will also try to modify it

@janrinze
Copy link

janrinze commented Apr 3, 2023

would -fsigned-char suffice?
only on the affected c/cpp file. however, -fsigned-char does make code slower on some parts. So use with caution. Best way is always to use int8_t if that's how the math is supposed to work. Using char in math is probably ambiguous.

@g588928812
Copy link
Author

thanks for your repo

the test with torch 2.0 has a similar effect.

now that you have located the kernels.cu, i will also try to modify it

it's done, repository updated. but i dont know yet if the rest of the library works fine, i will check the pytests and see what they produce

@janrinze
Copy link

janrinze commented Apr 5, 2023

Support for Apple silicon #252 shows another Aarch64 approach. Would be a good idea to merge these efforts.

@rickardp
Copy link
Contributor

rickardp commented Apr 5, 2023

Another thing that needs looking into is building proper platform-specific wheels. I've started investigating this in #257, but it is not 100% working yet. I don't know if someone else have started looking into it. As I don't have access to CUDA on Linux currently (my NVIDIA PC is hijacked by my kids for gaming and it runs Win11 right now :) ) I've also tried setting it up on GitHub pipelines.
+1 on janrinze, let's coordinate efforts. I am happy to adapt to whatever is preferred and pitch in where needed! I can probably contribute the most around the build and packaging right now, as I've never used the Neon instructions (though I have coded my fair share of SSE back in the days).

@rickardp
Copy link
Contributor

rickardp commented Apr 10, 2023

I think I might have solved this in #257, but I have no hardware to test it on so I can't verify. Wheel is built by https://github.com/rickardp/bitsandbytes/actions/runs/4653237487/jobs/8233937589 (go to Summary, Artifacts then download the bdist_wheel zip and get the correct one).

I had to move some #ifdefs and includes around. Then it seems nvcc doesn't like Neon intrinsics, so I had to compile the Cuda version without Neon support. If anyone has the hardware to try, please check out this build

@g588928812
Copy link
Author

g588928812 commented Apr 10, 2023

would -fsigned-char suffice?

tried it and seems like it doesn't do anything (?). Changing char to int8_t in various places fixes some of the unit tests.

Some of the unit tests (test_autograd) still fail however, I'm not sure why. Apart from the tests, inference works now and I've got Open Assistant running. There still seems to be an issue tough, it only runs with a an llm_int8_threshold of <=1.8, if I use any value higher than that (including the default =6.0) it fails with RuntimeError: probability tensor contains either inf, nan or element < 0. I'm not sure if this is related to running bitsandbytes on ARM or is something totally unrelated. If it is related to the char vs int8_t issue then I have NO idea how to fix it because I tried replacing all instances of char with int8_t already (excluding signed char and unsigned char) and nothing changed. 🤯

@g588928812
Copy link
Author

nvcc doesn't like Neon intrinsics

this can be fixed by separating CUDA and Neon calls in way that nvcc compiled code is not using neon instructions. I've done this by removing the includes BinSearch.h common.h in ops.cu (see diff).
And SIMD.h was modified to include sse2neon.h (diff)

@rickardp
Copy link
Contributor

nvcc doesn't like Neon intrinsics

this can be fixed by separating CUDA and Neon calls in way that nvcc compiled code is not using neon instructions. I've done this by removing the includes BinSearch.h common.h in ops.cu (see diff). And SIMD.h was modified to include sse2neon.h (diff)

Yes, sorry this is what I meant. No code that runs through nvcc can have Neon, the g++-compiled code can.

@androiddrew
Copy link

@g588928812

I tried compiling on a Jetson Orin Jetpack 5.0.2 from your fork with

CUDA_VERSION=114 make

And

CUDA_VERSION=114 make cuda11x

It still said the bitsandbytes lib was compiled without GPU support.

Do I need to use

CUDA_VERSION=114 make cuda11x_nomatmul instead?

@g588928812
Copy link
Author

g588928812 commented Apr 13, 2023

Do I need to use

CUDA_VERSION=114 make cuda11x_nomatmul instead?

Make cuda11x should be fine, the compute capability of the Orin is >7.5.

Could you post the output of make cuda11x please

Also you're using CUDA 11.4 right? Just making sure

@androiddrew
Copy link

androiddrew commented Apr 13, 2023

Do I need to use
CUDA_VERSION=114 make cuda11x_nomatmul instead?

Make cuda11x should be fine, the compute capability of the Orin is >7.5.

Could you post the output of make cuda11x please

Also you're using CUDA 11.4 right? Just making sure

Ok, I started over from scratch and documented what I am doing in a gist here: https://gist.github.com/androiddrew/9470fc5cfde190a71a5971abc7c2aa9f

It appears that I do have the correct binary built now libbitsandbytes_cuda114.so but it raises an error when I try to use it:

python ~/check_bits_and_bytes.py 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.4/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /home/toor/workspace/bitsandbytes_jetsonX/env2/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda114.so...
Error invalid device function at line 119 in file /home/toor/workspace/bitsandbytes_jetsonX/csrc/ops.cu

ops.cu

114                                 CUDA_CHECK_RETURN(cudaMemset(unorm, 0, 1*sizeof(float)));
115         kPreconditionOptimizer32bit2State<T, OPTIMIZER, 4096, 8><<<num_blocks, 512>>>(g, p, state1, state2, unorm, beta1, beta2, eps, weight_decay, step, lr, gnorm_scale, n);
116         CUDA_CHECK_RETURN(cudaPeekAtLastError());
117       }
118                         kOptimizer32bit2State<T, OPTIMIZER><<<num_blocks, 1024>>>(g, p, state1, state2, unorm, max_unorm, param_norm, beta1, beta2, eps, weight_decay, step, lr, gnorm_scale, skip_zero    s, n);
119       CUDA_CHECK_RETURN(cudaPeekAtLastError());
120                         break;
121                 case MOMENTUM:
122     case RMSPROP:
123     case ADAGRAD:
124 

Seems to not like this line CUDA_CHECK_RETURN(cudaPeekAtLastError());

Could I be missing some apt packages that were required for your fork to work?

@foldericon
Copy link

Same here, Jetson Orin Jetpack 5.1

@g588928812
Copy link
Author

Same here, Jetson Orin Jetpack 5.1

No additional packages needed. The problem is I'm working on the Xavier, building and using the cuda11x_nomatmul version and it works for me. Could you please try make cuda11x_nomatmul just to see if that works

@androiddrew
Copy link

androiddrew commented Apr 13, 2023

I think I have it fixed using the https://github.com/g588928812/bitsandbytes_jetsonX fork! I added the sm_87 in the make file.

diff --git a/Makefile b/Makefile
index 7bee7ef..0285514 100644
--- a/Makefile
+++ b/Makefile
@@ -46,7 +46,7 @@ CC_CUDA110 += -gencode arch=compute_80,code=sm_80
 CC_CUDA11x := -gencode arch=compute_75,code=sm_75
 CC_CUDA11x += -gencode arch=compute_80,code=sm_80
 CC_CUDA11x += -gencode arch=compute_86,code=sm_86
-
+CC_CUDA11x += -gencode arch=compute_87,code=sm_87
 
 CC_cublasLt110 := -gencode arch=compute_75,code=sm_75
 CC_cublasLt110 += -gencode arch=compute_80,code=sm_80
@@ -54,6 +54,7 @@ CC_cublasLt110 += -gencode arch=compute_80,code=sm_80
 CC_cublasLt111 := -gencode arch=compute_75,code=sm_75
 CC_cublasLt111 += -gencode arch=compute_80,code=sm_80
 CC_cublasLt111 += -gencode arch=compute_86,code=sm_86
+CC_cublasLt111 += -gencode arch=compute_87,code=sm_87
 
 CC_ADA_HOPPER := -gencode arch=compute_89,code=sm_89
 CC_ADA_HOPPER += -gencode arch=compute_90,code=sm_90
@@ -129,7 +130,7 @@ $(ROOT_DIR)/dependencies/cub:
        cd dependencies/cub; git checkout 1.11.0
 
 clean:
-       rm build/*
+       rm -rf build/*

Following the same workflow in https://gist.github.com/androiddrew/9470fc5cfde190a71a5971abc7c2aa9f I was able to use
CUDA_VERSION=114 make cuda11x and python -m build

python3 ~/check_bits_and_bytes.py 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.4/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /home/toor/workspace/bitsandbytes_jetsonX2/env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda114.so...
SUCCESS!
Installation was successful!
(env) toor@orin:~/workspace/bitsandbytes_jetsonX$ pip freeze
bitsandbytes @ file:///home/toor/workspace/bitsandbytes_jetsonX2/dist/bitsandbytes-0.37.2-py3-none-any.whl
build==0.10.0
numpy==1.24.2
packaging==23.1
pkg_resources==0.0.0
pyproject_hooks==1.0.0
tomli==2.0.1
torch @ file:///home/toor/jetpack_5_0_wheels/torch-1.13.0a0%2B340c4120.nv22.06-cp38-cp38-linux_aarch64.whl
typing_extensions==4.5.0
(env) toor@orin:~/workspace/bitsandbytes_jetsonX$ 

The jetpack wheel for pytorch is from jetson zoo and was compiled for my version of Jetpack (5.0.2)

@g588928812
Copy link
Author

I think I have it fixed using the https://github.com/g588928812/bitsandbytes_jetsonX fork! I added the sm_87 in the make file.

true! Jetson Orin has CC 8.7. Thanks, i've updated the repository

@g588928812
Copy link
Author

Now that the library loads successfully on the Orin, could you guys please check if the unit tests work?

@foldericon
Copy link

I started the tests a while ago and there is one error in tests/test_cuda_setup_evaluator.py and 4 more in tests/test_functional.py. Now it seems to be stuck at 85%. I will report back tomorrow.

@hillct
Copy link

hillct commented Apr 22, 2023

I too experienced the same problem compiling the main branch on AGX Orin 32GB. Trace below:

================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so...
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=114 make cuda11x
python setup.py install
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.8/runpy.py", line 144, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
import(pkg_name)
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

exit

========================

I'll be testing @g588928812's branch later this evening and will update with results

UPDATE:

Using the branch ha would appear to support compute capability 87 the following errors are returned:

================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so...
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=114 make cuda11x
python setup.py install
Traceback (most recent call last):
File "webapp.py", line 1, in
from llama import ModelArgs, Transformer, Tokenizer, LLaMA, default_quantize
File "/app/llama/init.py", line 4, in
from .generation import LLaMA
File "/app/llama/generation.py", line 9, in
from llama.model import Transformer
File "/app/llama/model.py", line 13, in
import bitsandbytes as bnb
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

============
UPDATE 2:
the libbitsandbytes_cuda114.so library is definitely in place, with the proper permissions, but the behavior remains - as if he file is still inaccessible

@g588928812
Copy link
Author

the libbitsandbytes_cuda114.so library is definitely in place, with the proper permissions, but the behavior remains - as if he file is still inaccessible

so, the bitsandbytes_jetsonX builds without complaints but the error is still the same?

Did you install the built library using python setup.py install ? Alternatively you can check if the new library works (without installing it) by calling python in the src directory and importing bitsandbytes, ie.:

bash$: cd bitsandbytes_jetsonX
bash$: CUDA_VERSION=114 make cuda11x
blabla .. 
(build successful)
bash$: python3
> import bitsandbytes

@g588928812
Copy link
Author

the libbitsandbytes_cuda114.so library is definitely in place, with the proper permissions, but the behavior remains - as if he file is still inaccessible

the error message is misleading. from what i understand, this error is thrown when 1) the file does not exist OR 2) the file exists but is incompatible

@janrinze
Copy link

what does ldd /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so say?

@hillct
Copy link

hillct commented Apr 22, 2023

what does ldd /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so say?

As you suspected, it's not being built correctly.

root@8b298b4eea43:/usr/local/lib/python3.8/dist-packages/bitsandbytes# ldd libbitsandbytes_cuda114.so
	not a dynamic executable
root@8b298b4eea43:/usr/local/lib/python3.8/dist-packages/bitsandbytes# 

I noticed in the build, that it compiles for all available/supported Cuda 11x compute capabilities, but didn't see in the Makefile, an existing option to specify only a single compute capability value. In my case above, I used (in a dockerfile build):

RUN git clone https://github.com/g588928812/bitsandbytes_jetsonX.git && cd bitsandbytes_jetsonX \ 
          && CUDA_VERSION=114 make cuda11x && python setup.py install && cd ..

The LDD error seems to suggest a larger issue though. At second glance, i looks like the SM_87 args aren' being passed in the case of the cuda11x so I'll retry with the cuda11x_nomatmul target as the CC_cuda11x args although I'm not clear why they're not defined wih the rest in a consolidaed COMPUTE_CAPABILITIES secio as both are being passed https://github.com/g588928812/bitsandbytes_jetsonX/blob/main/Makefile#L29
Will report back with results

UPDATE:
When built with target cuda11x_nomatmul:

bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so...
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone git@github.com:TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=114 make cuda11x
python setup.py install
Traceback (most recent call last):
File "webapp.py", line 1, in
from llama import ModelArgs, Transformer, Tokenizer, LLaMA, default_quantize
File "/app/llama/init.py", line 4, in
from .generation import LLaMA
File "/app/llama/generation.py", line 9, in
from llama.model import Transformer
File "/app/llama/model.py", line 13, in
import bitsandbytes as bnb
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

@hillct
Copy link

hillct commented Apr 22, 2023

Strangely, while ldd chokes on it as above, file seems to think it's fine (if I weren't on ARM hardware)...

# file /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=f8d245eb366661652901cfa310c3cb4503a9868d, not stripped

So it's compiling for x86_64 not ARM64 (aarm64) as it should be...
Now where is that flag being set...?

UPDATE:
It's not in my environment, nor in the makefile... nvcc shouldn't be cross-compiling by default... Not sure where else to look

Sanity check - I'm using his build container: NGC PyTorch

@g588928812
Copy link
Author

file /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so

Are you sure this is the file you've built rather than a file that was there before?

@hillct
Copy link

hillct commented Apr 22, 2023

/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so

You're right. Silly me for thinking an installation script might actually install the artifacts from a build/compile operation...
It turns out the actual recently compiled library wasn' being copied from /app/bitsandbytes_jetsonX/build/lib/bitsandbytes at all...

@hillct
Copy link

hillct commented Apr 22, 2023

When the newly compiled library is manually copied into the proper production location and python -m bitsandbytes is executed, he following error is displayed:

AttributeError: /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so: 
undefined symbol: clion32bit_g32

I'm not familiar enough with the bitsandbytes codebase to say what's oing on here but I found others, on Windows are seing the same issue which is apparently resolved in version 0.37.2 oobabooga/text-generation-webui#1193

I'll take some branch merging to test that, but it's also worth noting that when making the target cuda11x_nomatmul the resulting codebase still tries to use libbitsandbytes_cuda114.so, so when copying the artifact to the proper production location, I renamed it as well, o assure the codebase would use it (rather than updating underlying calls to it. This yielded he same clion32bit_g32 symbol error as above.

@g588928812
Copy link
Author

@hillct did you manage to fix this?

@hillct
Copy link

hillct commented Apr 25, 2023

@hillct did you manage to fix this?

I haven't moved past the clion32bit_g32 undefined symbol/ function not found error. Again, I don't know the codebase but I presume it just needs to be moved between the NEON and non-NEON related headers (without having checked if ha makes any sense at all). If I can circle back to it this week I will, but haven't been able to so far. The one hint I saw was when compiled with cuda11x_nomatmul it threw undefined symbol, and when compiled with cuda11x it threw function no found, which is why I'm guessing it's a header problem. While a slight clarification of my earlier comment, it does seem rather significant.

@hillct
Copy link

hillct commented Apr 25, 2023

I turns out I may have had a cache issue or otherwise unclean environment. I'm now able to build using:
CUDA_VERSION=114 make cuda11x && python setup.py install
Tests:

  • test_autograd.py: 832 passed, 976 warnings (deprecation, and casting)
  • test_linear8bitlt.py: 2 passed
  • test_modules.py: 6 passed, 2 warnings (deprecation)
  • test_optim.py: 88 passed, 414 warnings (deprecation)
  • test functional.py: Someone interpret this for me... 44% something...
F................................................................................................FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFssssssssroot@08c36af24c07:/app/tests# ```

@hillct
Copy link

hillct commented Apr 25, 2023

Finally, test_cuda_setup_evaluator.py fails completely because it's trying to determine CUDA version in a profoundly stupid way involving parsing LD_LIBRARY_PATH

As for interpreting the tes_fnctional.py results, there are 344 tests but only 298 characters there so one can't map them to test success/failure

In case anyone wants it, a trivial Dockerfile for testing in a clean environment https://gist.github.com/hillct/42045e9664834f8c666017d82fa276dd

@g588928812
Copy link
Author

thanks! that doesnt look too bad

  • test functional.py: Someone interpret this for me... 44% something...

try running pytest with the -v (verbose) flag, will hopefully will be more informative

@g588928812
Copy link
Author

I'll take some branch merging to test that, but it's also worth noting that when making the target cuda11x_nomatmul the resulting codebase still tries to use libbitsandbytes_cuda114.so, so when copying the artifact to the proper production location, I renamed it as well, o assure the codebase would use it (rather than updating underlying calls to it. This yielded he same clion32bit_g32 symbol error as above.

i've merged the bitsandbytes 0.38 into my fork, if you have some time please check if this resolved the clion32bit_g32 errors

@hillct
Copy link

hillct commented Apr 26, 2023

thanks! that doesnt look too bad

  • test functional.py: Someone interpret this for me... 44% something...

try running pytest with the -v (verbose) flag, will hopefully will be more informative

python -m pytest test_functional.py -v:

  • 336 passed, 8 skipped, 275 warnings (deprecation)

@hillct
Copy link

hillct commented Apr 26, 2023

i've merged the bitsandbytes 0.38 into my fork, if you have some time please check if this resolved the clion32bit_g32 errors

These are resolved. they were a symptom of my uclean build environment (which is what prompted creation of the docker image for cleaner testing

@deep-pipeline
Copy link

Hi - this issue is referenced from another issue relating to Apple Silicon, presumably because if an ARM architecture CPU-only version of bitsandbytes could work on the Jetson then it might be able to compile and work on a Mac.. however, clearly the Jetson has on-board CUDA.. so..

Can I ask (please) anyone who has this working on a Jetson - is it running with CPU only or is it running using on-board CUDA?

Many thanks in advance.

@androiddrew
Copy link

androiddrew commented Jun 7, 2023 via email

@g588928812
Copy link
Author

Can I ask (please) anyone who has this working on a Jetson - is it running with CPU only or is it running using on-board CUDA?

you should be able to compile a CPU-only version though using make cpuonly

@rickardp
Copy link
Contributor

rickardp commented Jun 7, 2023

The CPU only version is not feature complete

@WarrenDream
Copy link

@g588928812 is it working on jetson with cuda now ? with fork bitsandbytes_jetsonX ?

@g588928812
Copy link
Author

@g588928812 is it working on jetson with cuda now ? with fork bitsandbytes_jetsonX ?

it (my fork) compiles but some tests fail, you would simply have to try and see. i'm not working on it anymore though, sorry

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants