Executing neural network using AMX backend #1236

jkalya · 2025-05-22T15:48:18Z

jkalya
May 22, 2025

I am trying to execute a CNN explicitly using Intel AMX as much as possible for some performance evaluation. I already have the entire CNN implemented in ggml and some model parameters trained with a pytorch program and exported to the ggml format, but all the weights and biases are FP32s.

I looked through the codebase and it looks like all AMX operations in ggml are in the tinygemm_kernel_amx function which is called by ggml_backend_amx_mul_mat function located in src/ggml-cpu/amx/mmq.cpp and that only signed-signed int8 operations (_tile_dpbssd) are supported by that. This is fine, I am able to quantize my model down to int8.

To simplify the problem for now, I've created a simple ggml program that does a single 64x64, 64x64 matrix multiplication and added some print statements to ggml_backend_amx_mul_mat to check if it's being called but it looks like that function is never called. The note above the definition of ggml_backend_amx_mul_mat says that src0 must be quantized in some way (I just did fp16 which according to my understanding of the code will execute AVX512 and not AMX, but I'm just trying to figure out how to get this function to execute for now), src1 must be fp32, and destination must be fp32. Despite all this, the function never executes but the program successfully does the matrix multiplication.

I've confirmed that ggml was built with AMX and AVX512 support enabled and that the ggml_cpu_has_amx_int8, ggml_cpu_has_avx512, ggml_cpu_has_avx512_vnni, ggml_cpu_has_avx512_vbmi, and ggml_cpu_has_avx512_bf16 functions all return true. I also explicitly use the cpu backend with the line model.backend = ggml_backend_cpu_init();

For reference, here is the entire code. It is based off of this tutorial: https://balisujohn.github.io/converting-pytorch-to-ggml/

#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include "ggml-impl.h"

#include <cassert>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <map>
#include <string>
#include <vector>

void ggml_log_callback_default(ggml_log_level level, const char * text, void * user_data) {
    (void) level;
    (void) user_data;
    fputs(text, stderr);
    fflush(stderr);
}

// This is a simple model with a weight matrix and bias vector
struct simple_model {

    struct ggml_tensor * weight;
    struct ggml_tensor * bias;


    std::map<std::string, struct ggml_tensor *> tensors;


    // the backend to perform the computation (CPU, CUDA, METAL)
    ggml_backend_t backend = NULL;

    // the backend buffer to storage the tensors data of a and b
    ggml_backend_buffer_t buffer;

    // the context to define the tensor information (dimensions, size, memory address)
    struct ggml_context * ctx;
};

bbool simple_model_load(const std::string &fname, simple_model &model) {
  printf("%s: loading model from '%s'\n", __func__, fname.c_str());

  auto fin = std::ifstream(fname, std::ios::binary);
  if (!fin) {
    fprintf(stderr, "%s: failed to open '%s'\n", __func__, fname.c_str());
    return false;
  }

  // verify magic
  {
    uint32_t magic;
    fin.read((char *)&magic, sizeof(magic));
    if (magic != GGML_FILE_MAGIC) {
      fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__,
              fname.c_str());
      return false;
    }
  }

  size_t buffer_size = 0;

  buffer_size += 64*64  * ggml_type_sizef(GGML_TYPE_F16); // weight
  buffer_size += 64 * ggml_type_sizef(GGML_TYPE_F16); // bias


  printf("%s: ggml tensor size    = %d bytes\n", __func__,
         (int)sizeof(ggml_tensor));
  printf("%s: backend buffer size = %6.2f MB\n", __func__,
         buffer_size / (1024.0 * 1024.0));

  struct ggml_init_params params = {
      ggml_tensor_overhead() * (size_t)(2), // mem size
      NULL,                                   // mem buffer
      true,                                   // no alloc
  };

  model.ctx = ggml_init(params);

  if (!model.ctx) {
    fprintf(stderr, "%s: ggml_init() failed\n", __func__);
    return false;
  }

  // fallback to CPU backend
  fprintf(stderr, "%s: using CPU backend\n", __func__);
  model.backend = ggml_backend_cpu_init();

  if (!model.backend) {
    fprintf(stderr, "%s: ggml_backend_cpu_init() failed\n", __func__);
    return false;
  }


  auto &ctx = model.ctx;

  model.weight = ggml_new_tensor_2d(ctx, GGML_TYPE_F16, 64, 64);
  model.bias = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, 64);

  model.tensors["weight"] = model.weight;
  model.tensors["bias"] = model.bias;

  {
    // ggml_allocr * alloc = ggml_allocr_new_from_buffer(model.buffer);
    model.buffer = ggml_backend_alloc_ctx_tensors(ctx, model.backend);

    size_t total_size = 0;

    bool has_lm_head = false;

    std::vector<char> read_buf;

    while (true) {
      int32_t n_dims;
      int32_t length;
      int32_t ttype;

      fin.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
      fin.read(reinterpret_cast<char *>(&length), sizeof(length));
      fin.read(reinterpret_cast<char *>(&ttype), sizeof(ttype));

      if (fin.eof()) {
        break;
      }

      int32_t nelements = 1;
      int32_t ne[2] = {1, 1};
      for (int i = 0; i < n_dims; ++i) {
        fin.read(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
        nelements *= ne[i];
      }

      std::string name(length, 0);
      fin.read(&name[0], length);

      if (model.tensors.find(name) == model.tensors.end()) {
        fprintf(stderr, "%s: unknown tensor '%s' in model file\n", __func__,
                name.c_str());
        return false;
      }

      auto tensor = model.tensors[name];
      ggml_set_name(tensor, name.c_str());
	  printf("%d\n", ggml_nelements(tensor));
	  printf("%d\n", nelements);
	  
      if (ggml_nelements(tensor) != nelements) {
        fprintf(stderr, "%s: tensor '%s' has wrong size in model file\n",
                __func__, name.c_str());
        return false;
      }

      if (tensor->ne[0] != ne[0] || tensor->ne[1] != ne[1]) {
        fprintf(stderr,
                "%s: tensor '%s' has wrong shape in model file: got [%d, %d], "
                "expected [%d, %d]\n",
                __func__, name.c_str(), (int)tensor->ne[0], (int)tensor->ne[1],
                ne[0], ne[1]);
        return false;
      }

      // for debugging
      if (1) {
        printf("%24s - [%5d, %5d], type = %6s, %6.2f MB, %9zu bytes\n",
               name.c_str(), ne[0], ne[1], ggml_type_name(ggml_type(ttype)),
               ggml_nbytes(tensor) / 1024.0 / 1024.0, ggml_nbytes(tensor));
      }

      const size_t bpe = ggml_type_size(ggml_type(ttype));

      if ((nelements * bpe) / ggml_blck_size(tensor->type) !=
          ggml_nbytes(tensor)) {
        fprintf(stderr,
                "%s: tensor '%s' has wrong size in model file: got %zu, "
                "expected %zu\n",
                __func__, name.c_str(), ggml_nbytes(tensor), nelements * bpe);
        return false;
      }

      if (ggml_backend_buffer_is_host(model.buffer)) {
        // for some backends such as CPU and Metal, the tensor data is in system
        // memory and we can read directly into it
        fin.read(reinterpret_cast<char *>(tensor->data), ggml_nbytes(tensor));
      } else {
        // read into a temporary buffer first, then copy to device memory
        read_buf.resize(ggml_nbytes(tensor));
        fin.read(read_buf.data(), ggml_nbytes(tensor));
        ggml_backend_tensor_set(tensor, read_buf.data(), 0,
                                ggml_nbytes(tensor));
      }

      total_size += ggml_nbytes(tensor);
    }

    printf("%s: model size  = %8.2f MB\n", __func__,
           total_size / 1024.0 / 1024.0);
  }

  fin.close();

  return true;
}

// build the compute graph
struct ggml_cgraph * build_graph(const simple_model& model) {
    static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
    static std::vector<uint8_t> buf(buf_size);

    struct ggml_init_params params0 = {
        /*.mem_size   =*/ buf_size,
        /*.mem_buffer =*/ buf.data(),
        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_allocr_alloc_graph()
    };

    // create a temporally context to build the graph
    struct ggml_context * ctx0 = ggml_init(params0);

    struct ggml_cgraph  * gf = ggml_new_graph(ctx0);


    struct ggml_tensor * input_vector = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 64,64);

    ggml_set_name(input_vector, "input_vector");

    struct ggml_tensor * result = ggml_mul_mat(ctx0, ggml_cont(ctx0,ggml_transpose(ctx0,model.weight)), input_vector);
	// struct ggml_tensor * result = ggml_mul_mat(ctx0, ggml_cont(ctx0,model.weight), input_vector);


    // result = ggml_add(ctx0, result, model.bias);

    result = ggml_relu(ctx0, result);

    // build operations nodes
    ggml_build_forward_expand(gf, result);

    // delete the temporally context used to build the graph
    ggml_free(ctx0);
    return gf;
}

// compute with backend
struct ggml_tensor * compute(const simple_model & model, ggml_gallocr_t allocr) {
    // reset the allocator to free all the memory allocated during the previous inference

    struct ggml_cgraph * gf = build_graph(model);

    // allocate tensors
    ggml_gallocr_alloc_graph(allocr, gf);

    int n_threads = 1; // number of threads to perform some operations with multi-threading

    if (ggml_backend_is_cpu(model.backend)) {
        ggml_backend_cpu_set_n_threads(model.backend, n_threads);
    }


    // std::vector<float> input_data = {1,1,1};
	float input_data[64*64];
	for (int i = 0; i < 64*64; i++) {
		input_data[i] = 1.0f;
	}

    struct ggml_tensor *input_vector = ggml_graph_get_tensor(gf, "input_vector");

    ggml_backend_tensor_set(input_vector, input_data, 0, 64*64 * ggml_element_size(input_vector));


    ggml_backend_graph_compute(model.backend, gf);

    // in this case, the output tensor is the last one in the graph
    return gf->nodes[gf->n_nodes - 1];
}

int main(void) {
    ggml_time_init();


    simple_model model;


    std::string simple_model_file_path = "./simple-ggml-model_fp16.bin";

    // load the model
    {
        if (!simple_model_load(simple_model_file_path, model)) {
        fprintf(stderr, "%s: failed to load model from '%s'\n", __func__,
                simple_model_file_path.c_str());
        exit(1);
        }
    }

	// return 1;

    // calculate the temporaly memory required to compute
    ggml_gallocr_t allocr = NULL;

    {
        allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));

        // create the worst case graph for memory usage estimation
        struct ggml_cgraph * gf = build_graph(model);
        ggml_gallocr_reserve(allocr, gf);
        size_t mem_size = ggml_gallocr_get_buffer_size(allocr, 0);

        fprintf(stderr, "%s: compute buffer size: %.4f KB\n", __func__, mem_size/1024.0);
    }


    // perform computation
    struct ggml_tensor * result = compute(model, allocr);

    // create a array to print result
    std::vector<float> out_data(ggml_nelements(result));

    // bring the data from the backend memory
    ggml_backend_tensor_get(result, out_data.data(), 0, ggml_nbytes(result));

/* skip printing */
    /* printf("mul mat (%d x %d) (simple result):\n[", (int) result->ne[0], (int) result->ne[1]); */
    /* for (int j = 0; j < result->ne[1] /\* rows *\/; j++) { */
    /*     if (j > 0) { */
    /*         printf("\n"); */
    /*     } */

    /*     for (int i = 0; i < result->ne[0] /\* cols *\/; i++) { */
    /*         printf(" %.2f", out_data[i * result->ne[1] + j]); */
    /*     } */
    /* } */
    /* printf(" ]\n"); */

    // release backend memory used for computation
    ggml_gallocr_free(allocr);

    // free memory
    ggml_free(model.ctx);

    // release backend memory and free backend
    ggml_backend_buffer_free(model.buffer);
    ggml_backend_free(model.backend);
    return 0;
}

slaren · 2025-05-22T16:10:42Z

slaren
May 22, 2025
Maintainer

To use AMX the weights need to be stored in an AMX buffer type. If you are experimenting I suggest making ggml_backend_amx_buffer_type public and using ggml_backend_alloc_ctx_tensors_from_buft(ggml_backend_amx_buffer_type()) to allocate the weights.

The more portable way to do this would be to use ggml_backend_dev_get_extra_bufts and test each tensor to determine the best buffer type to use, but that can be complicated and the only examples of how to do this at moment are llama.cpp and whisper.cpp.

0 replies

jkalya · 2025-05-23T15:40:15Z

jkalya
May 23, 2025
Author

Ok, I've made that change, namely, I changed the line

model.buffer = ggml_backend_alloc_ctx_tensors(ctx, model.backend);

to

model.buffer = ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_amx_buffer_type());

and in order to make ggml_backend_amx_buffer_type() "public" I added #include "ggml-cpu/amx/amx.h" at the top and -I$HOME/ggml/src/ggml-cpu to the compiler flags. It successfully built but AMX is still not being utilized with these changes. I was wondering, do I need to manually quantize my weights to int8, does ggml not do it automatically?

1 reply

slaren May 23, 2025
Maintainer

No, ggml will not automatically change the type of a tensor. You would need to create a tensor of the appropriate type and load quantized data into it.

jkalya · 2025-05-23T21:25:57Z

jkalya
May 23, 2025
Author

Ok, I've taken the weight tensor and converted it to int8 and updated the tensor loading code accordingly. My input vector is still fp32. Now, when I call ggml_backend_graph_compute(model.backend, gf); in my compute() function in the code it segfaults.

I tried using int8 for the input tensor as well but that causes an assertion failure in ggml_compute_forward_mul_mat in ggml-cpu.c, it seems to be expecting a fp32 as the input.

What would be the easiest way to debug this?

0 replies

jkalya · 2025-05-29T21:08:12Z

jkalya
May 29, 2025
Author

Alright, I've found why the segfault is occurring. It had nothing to do with allocating amx tensors as I reverted my code to the original which simply calls ggml_backend_alloc_ctx_tensors instead of ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_amx_buffer_type());

So to be clear, I am simply trying to matrix-multiply a 64x64 int8 ("weight") matrix with a 64x64 fp32 ("input") matrix for now.

I debugged the segfault down to ggml_compute_forward_mul_mat_one_chunk in ggml-cpu.c. The segfault is occurring on line 1246:

vec_dot(ne00, &tmp[ir0 - iir0], (num_rows_per_vec_dot > 1 ? 16 : 0), src0_row + ir0 * nb01, (num_rows_per_vec_dot > 1 ? nb01 : 0), src1_col, (num_rows_per_vec_dot > 1 ? src1_col_stride : 0), num_rows_per_vec_dot);

and gdb says that vec_dot is a function pointer to 0x0. vec_dot is defined above on line 1184 as vec_dot = type_traits_cpu[type].vec_dot;. type is GGML_TYPE_I8 and sure enough, the type_traits_cpu struct array definition on line 200 does not have an entry for GGML_TYPE_I8.

0 replies

jkalya · 2025-06-02T16:10:32Z

jkalya
Jun 2, 2025
Author

Alright, I think I figured everything out. I'll put my observations here in case this is useful to anyone later on.

The first issue was this line:

struct ggml_tensor * result = ggml_mul_mat(ctx0, ggml_cont(ctx0,ggml_transpose(ctx0,model.weight)), input_vector);

This creates a total of 3 tensors - the output of ggml_transpose(weight), the output of ggml_cont(ggml_transpose(weight)), and the output of ggml_mul_mat(ggml_cont(ggml_transpose(weight)), input) which is also the end result in this example. The output tensor and the input tensor are ggml_backend_amx_buffer_type presumably due to this line

model.buffer = ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_amx_buffer_type());

but the two intermediate tensors are just the standard CPU backend type which means that internally, the AMX matmul wasn't running since its weight tensor was not AMX type, instead the standard CPU backend matmul was running. The CPU matmul uses a function pointer vec_dot to execute, the specific function executed is determined in the type_traits struct and the function for input of type I8 was 0x0, hence the segfault. To temporarily fix this, I just took out the ggml_cont and ggml_transpose operations ensuring all tensors being operated on were in ggml_backend_amx_buffer_type

I still ran into this error: Unsupported quantized data type: 24, 24 is GGML_TYPE_I8. All data types supported by the AMX backend are in the GGML_DISPATCH_QTYPES struct in ggml-cpu/amx/mmq.cpp. I "converted" my weights to GGML_TYPE_Q8_0 by changing the tensor type in the model file (but didn't actually quantize them, so the output is meaningless) and ran the program, it exited with no errors. To make sure AMX was being used, I ran the program with perf stat -e exe.amx_busy ./main_q8_0 and saw the AMX unit was busy for about 500-600 cycles.

Before I close this out, I just had a couple more questions.
First, how can I explicitly make sure intermediate tensors in the computations are ggml_backend_amx_buffer_type()?
Second, what's the easiest way to quantize my original model to q8_0 when the model is in the ggml (not gguf) format?

1 reply

slaren Jun 10, 2025
Maintainer

First, how can I explicitly make sure intermediate tensors in the computations are ggml_backend_amx_buffer_type()?

This buffer type should only be used for the static tensors (typically the model weights). It is not necessary (or desirable) to store intermediate tensors in it.

Second, what's the easiest way to quantize my original model to q8_0 when the model is in the ggml (not gguf) format?

You would have to write your own tool to do so. You could use https://github.com/ggml-org/ggml/blob/master/examples/gpt-2/quantize.cpp as a starting point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Executing neural network using AMX backend #1236

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Executing neural network using AMX backend #1236

Uh oh!

jkalya May 22, 2025

Replies: 5 comments · 2 replies

Uh oh!

slaren May 22, 2025 Maintainer

Uh oh!

jkalya May 23, 2025 Author

Uh oh!

slaren May 23, 2025 Maintainer

Uh oh!

jkalya May 23, 2025 Author

Uh oh!

jkalya May 29, 2025 Author

Uh oh!

Uh oh!

jkalya Jun 2, 2025 Author

Uh oh!

slaren Jun 10, 2025 Maintainer

jkalya
May 22, 2025

Replies: 5 comments 2 replies

slaren
May 22, 2025
Maintainer

jkalya
May 23, 2025
Author

slaren May 23, 2025
Maintainer

jkalya
May 23, 2025
Author

jkalya
May 29, 2025
Author

jkalya
Jun 2, 2025
Author

slaren Jun 10, 2025
Maintainer