Question Regarding Assigning Backend Memory for loading Data on Tensor for llama.cpp model #11993

akapoor3518 · 2025-02-21T04:42:30Z

akapoor3518
Feb 21, 2025

Hi Ggerganov,
I am trying to run Tinylamma which has around 75 Tensors. Currently My backend only support ADD & MUL(Scalar or Vector). Hence i have two backend my GPU Backend(Custom Hardware) and CPU. The Tensor which i want to run at MY Custom Hardware must use my Custom Memory(coming from GPU) not CPU memory. since at every Tensor which i am running at my hardware i have added Customer header+Data and tensor->data pointing to Data hence i need Custom header in-order to run my customer Kernel, hence i need to do at my backend init_tensor
tensor->data = (void*)(sizeof(tensor_data_header) + (char *)tensor->data);
Now at my backend graph compute i do following
void *p = tensor->data
custom header *hp = (custom_header *) p;
--hp(this will have my header where i can fill my custom header detail) but since This memory(tensor->data) is created by CPU i am crashing here. How to make sure tensor->data memory allocated by GPU for supported OP, example particular node tensor, & leaf tensor: node->src[0] , node->src[1] for supported OP by GPU.
We dont have any documentation hence i am going over code, but it will help me if you can guide how to enforce tensor to use GPU memory for supported OP and related Leaf node.
SOme more information, I have following API some part of code implemented as below
ggml_backend_custom_buffer_type_get_alloc_size
return (sizeof(tensor_data_header) + ggml_nbytes(tensor));

  ggml_backend_custom_buffer_init_tensor
  tensor->data  =  (void*)(sizeof(tensor_data_header) + (char *)tensor->data)
  
  My Unit test case work fine for ADD/MUL under GGML Repo.. some API what i am using at Unit test-case(simple-backend-custom.cpp)

##Backend setup
model.backend = ggml_backend_custom_init();

  model.a = ggml_new_tensor_1d(model.ctx, data_type, elements_A);
model.b = ggml_new_tensor_1d(model.ctx, data_type, elements_B);
// Here at ggml Context we have only two input tensors, hence backend memory is
// created for two input tensors
model.buffer = ggml_backend_alloc_ctx_tensors(model.ctx, model.backend);

// loading the data to tensor
ggml_backend_tensor_set(model.a, a, 0, ggml_nbytes(model.a));
ggml_backend_tensor_set(model.b, b, 0, ggml_nbytes(model.b));

Graph creation
  ggml_new_graph(ctx0);

// build operations nodes
ggml_build_forward_expand(gf, result);

Graph compute
ggml_backend_graph_compute(model.backend, gf);

// in this case, the output tensor is the last one in the graph
return ggml_graph_node(gf, -1);

Validate result.

Thanks in advance,
Anoop

akapoor3518 · 2025-02-22T04:32:18Z

akapoor3518
Feb 22, 2025
Author

Hi Ggerganov,
If i use llama-cli, its working fine, i have updated ADD & GGML_OP_MUL and rest going to CPU. same GGML not working with llama-simple, any idea, we were issue at mention above
./build/bin/llama-cli -p "my cat is" -m ./models/tmpo/models/pretrained_300M_515000-305M-F16.gguf --device custom_device

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question Regarding Assigning Backend Memory for loading Data on Tensor for llama.cpp model #11993

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question Regarding Assigning Backend Memory for loading Data on Tensor for llama.cpp model #11993

Uh oh!

Uh oh!

akapoor3518 Feb 21, 2025

Replies: 1 comment

Uh oh!

akapoor3518 Feb 22, 2025 Author

akapoor3518
Feb 21, 2025

akapoor3518
Feb 22, 2025
Author