About fp16 mat_mul #3757

Zhikaiiii · 2023-10-24T03:40:10Z

Zhikaiiii
Oct 24, 2023

In llama.cpp, when weight A is fp16, activation B is fp32, the mat_mul will first convert B into fp16 and store in wdata

  if (params->type == GGML_TASK_INIT) {
      if (src1->type != vec_dot_type) {
          char * wdata = params->wdata;
          const size_t row_size = ne10*ggml_type_size(vec_dot_type)/ggml_blck_size(vec_dot_type);

          for (int64_t i13 = 0; i13 < ne13; ++i13) {
              for (int64_t i12 = 0; i12 < ne12; ++i12) {
                  for (int64_t i11 = 0; i11 < ne11; ++i11) {
                      from_float_to_vec_dot((float *)((char *) src1->data + i13*nb13 + i12*nb12 + i11*nb11), (void *) wdata, ne10);
                      wdata += row_size;
                  }
              }
          }
      }

      return;
  }

But when do the actually vec_dot, both A and B will be convert back to fp32 and do the calculation.

    float sumf = 0.0;

    for (int i = 0; i < nb; i++) {
        int sumi = 0;

        for (int j = 0; j < qk/2; ++j) {
            const int v0 = (x[i].qs[j] & 0x0F);
            const int v1 = (x[i].qs[j] >>   4);

            sumi += (v0 * y[i].qs[j]) + (v1 * y[i].qs[j + qk/2]);
        }

        sumf += (GGML_FP16_TO_FP32(x[i].d)*y[i].d)*sumi + GGML_FP16_TO_FP32(x[i].m)*y[i].s;
    }

    *s = sumf;

Why use this way? And is this the reason why fp16 prediction is slower than origin fp32 and int8.?
Or did I ignore something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About fp16 mat_mul #3757

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

About fp16 mat_mul #3757

Uh oh!

Zhikaiiii Oct 24, 2023

Replies: 0 comments

Zhikaiiii
Oct 24, 2023