Skip to content

Key and Query are multiplied in an opposite direction #30

@baehyunsol

Description

@baehyunsol

Hello, I love your implementation, thanks so much!! I've been trying to implement KV cache in femtoGPT, and I found something weird. I'd like to share it with you and get some information.

In the implementation, it transposes q matrix and multiplies k matrix with q^t (k * q^t). Below is the code

femtoGPT/src/gpt.rs

Lines 206 to 207 in f0afe9e

let q_t = g.call(Transpose::new(), &[q])?;
let kq = g.call(MatMul::new(), &[k, q_t])?;

But in the paper, it is q * k^t.

Image

Also, in nanoGPT, it's q * k^t. code


As far as I know, using k * q^t instead of q * k^t doesn't degrade the performance of the model. K in femtoGPT does what Q in the other GPT does, and Q in femtoGPT does what K in other GPT does. It's like using the name oppositely.

There's a dilemma:

  1. If we don't fix it, femtoGPT and the other GPTs use names oppositely. It might confuse the new comers.
  2. If we do fix it, all the previously trained models are not compatible with the new version of femtoGPT. We can manually swap head_{}_{}_q tensors and head_{}_{}_k tensors, but that'd be a lot of work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions