llm_load_tensors: VRAM used: 7337 MB? i have RTX 4060 laptop with 8188MiB, system is using only 77mb so how can this be 7400mb only and OOM? #3190

hiqsociety · 2023-09-15T10:04:04Z

hiqsociety
Sep 15, 2023

i finally got cuda offload working on ubuntu 22.04 without docker BUT
my RTX 4060 (laptop) has 8GB vram however,

i can only offload 40 layers of the 43 layers total. how do i offload all 43 without crashing?

llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 7337 MB
...................................................................................................

CUDA error 2 at ggml-cuda.cu:6952: out of memory

at 41 / 43 layers, its still OOM. maybe i shld ask how do i make sure my ubuntu is not using 75mb or anything of the GPU? i'm using HP victus laptop and "hybrid" mode in bios setting. i thought discrete is using gpu 100%? so using hybrid. pls correct me if i am wrong and appreciate all help in getting this working without OOM.

llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   88.01 MB (+  400.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloaded 42/43 layers to GPU
llm_load_tensors: VRAM used: 7137 MB
...................................................................................................

CUDA error 2 at ggml-cuda.cu:6952: out of memory
current device: 0

root@ubuntu:/usr/local/src/llama.cpp# nvidia-smi
Fri Sep 15 06:16:24 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8               3W /  80W |     77MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     30051      G   /usr/lib/xorg/Xorg                           67MiB |
|    0   N/A  N/A     84776      G   nvidia-settings                               2MiB |
+---------------------------------------------------------------------------------------+

Answered by staviq

Sep 15, 2023

If you want to "free" that 77M, you have to stop the desktop environment service and work from physical console, because "desktop" is rendered through the GPU too, and it takes away from its memory.

I believe the VRAM used value only shows the memory used by "model" itself, additional memory is used for couple more things.

You seem to be very close to fitting the whole thing on the GPU, check if -b 1 helps ( but it will make prompt processing slower, you can fiddle with this value to see how high can you go before oom ), you can also try reducing the context size, and if none of that helps, use a smaller quant of the model.

View full answer

staviq · 2023-09-15T19:20:49Z

staviq
Sep 15, 2023

If you want to "free" that 77M, you have to stop the desktop environment service and work from physical console, because "desktop" is rendered through the GPU too, and it takes away from its memory.

I believe the VRAM used value only shows the memory used by "model" itself, additional memory is used for couple more things.

You seem to be very close to fitting the whole thing on the GPU, check if -b 1 helps ( but it will make prompt processing slower, you can fiddle with this value to see how high can you go before oom ), you can also try reducing the context size, and if none of that helps, use a smaller quant of the model.

2 replies

hiqsociety Sep 15, 2023
Author

-b 1 seems to help. any more tricks to make it fit more? (basically i'll increase size of -c)

staviq Sep 15, 2023

Other than going for smaller quantisation of the model, no, not really.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llm_load_tensors: VRAM used: 7337 MB? i have RTX 4060 laptop with 8188MiB, system is using only 77mb so how can this be 7400mb only and OOM? #3190

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

llm_load_tensors: VRAM used: 7337 MB? i have RTX 4060 laptop with 8188MiB, system is using only 77mb so how can this be 7400mb only and OOM? #3190

Uh oh!

Uh oh!

hiqsociety Sep 15, 2023

Replies: 1 comment · 2 replies

Uh oh!

staviq Sep 15, 2023

Uh oh!

hiqsociety Sep 15, 2023 Author

Uh oh!

staviq Sep 15, 2023

hiqsociety
Sep 15, 2023

Replies: 1 comment 2 replies

staviq
Sep 15, 2023

hiqsociety Sep 15, 2023
Author