-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
Hello, I love your implementation, thanks so much!! I've been trying to implement KV cache in femtoGPT, and I found something weird. I'd like to share it with you and get some information.
In the implementation, it transposes q matrix and multiplies k matrix with q^t (k * q^t
). Below is the code
Lines 206 to 207 in f0afe9e
let q_t = g.call(Transpose::new(), &[q])?; | |
let kq = g.call(MatMul::new(), &[k, q_t])?; |
But in the paper, it is q * k^t
.

Also, in nanoGPT, it's q * k^t
. code
As far as I know, using k * q^t
instead of q * k^t
doesn't degrade the performance of the model. K in femtoGPT does what Q in the other GPT does, and Q in femtoGPT does what K in other GPT does. It's like using the name oppositely.
There's a dilemma:
- If we don't fix it, femtoGPT and the other GPTs use names oppositely. It might confuse the new comers.
- If we do fix it, all the previously trained models are not compatible with the new version of femtoGPT. We can manually swap
head_{}_{}_q
tensors andhead_{}_{}_k
tensors, but that'd be a lot of work.
Metadata
Metadata
Assignees
Labels
No labels