Replies: 2 comments
-
Support for ngqa is being added, See: #860 |
Beta Was this translation helpful? Give feedback.
0 replies
-
Waiting for docs |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
when I run every Llama2 70B model,i will get this error:
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr ggml_init_cublas: found 1 CUDA devices:
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama.cpp: loading model from /root/Local/cublas/models/llama-2-70b.ggmlv3.q6_K.bin
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: format = ggjt v3 (latest)
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_vocab = 32000
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_ctx = 4096
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_embd = 8192
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_mult = 4096
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_head = 64
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_head_kv = 64
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_layer = 80
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_rot = 128
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_gqa = 1
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: n_ff = 24576
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: freq_base = 10000.0
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: freq_scale = 1
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: ftype = 18 (mostly Q6_K)
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: model size = 65B
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: ggml ctx size = 53965.41 MB
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_model_load_internal: using CUDA for GPU acceleration
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_load_model_from_file: failed to load model
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr llama_init_from_gpt_params: error: failed to load model '/root/Local/cublas/models/llama-2-70b.ggmlv3.q6_K.bin'
4:15AM DBG GRPC(llama-2-70b.ggmlv3.q6_K.bin-127.0.0.1:37993): stderr load_binding_model: error: unable to load model
I see "n_gqa" parameter in the llama-cpp-python 0.1.77.
I wanna know how can I select this parameter in the Local_AI ?
Thanks!!
Beta Was this translation helpful? Give feedback.
All reactions