Key and Query are multiplied in an opposite direction

Hello, I love your implementation, thanks so much!! I've been trying to implement KV cache in femtoGPT, and I found something weird. I'd like to share it with you and get some information.

In the implementation, it transposes q matrix and multiplies k matrix with q^t (`k * q^t`). Below is the code

https://github.com/keyvank/femtoGPT/blob/f0afe9e1b19c6d3497fc6b4c1f7ece4be89e5d08/src/gpt.rs#L206-L207

But in the paper, it is `q * k^t`.

<img width="479" height="105" alt="Image" src="https://github.com/user-attachments/assets/5786892a-08c6-4404-a037-f3387882d070" />

Also, in nanoGPT, it's `q * k^t`. [code](https://github.com/karpathy/nanoGPT/blob/93a43d9a5c22450bbf06e78da2cb6eeef084b717/model.py#L67)

---

As far as I know, using `k * q^t` instead of `q * k^t` doesn't degrade the performance of the model. K in femtoGPT does what Q in the other GPT does, and Q in femtoGPT does what K in other GPT does. It's like using the name oppositely.

There's a dilemma:

1. If we don't fix it, femtoGPT and the other GPTs use names oppositely. It might confuse the new comers.
2. If we do fix it, all the previously trained models are not compatible with the new version of femtoGPT. We can manually swap `head_{}_{}_q` tensors and `head_{}_{}_k` tensors, but that'd be a lot of work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Key and Query are multiplied in an opposite direction #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	let q_t = g.call(Transpose::new(), &[q])?;
	let kq = g.call(MatMul::new(), &[k, q_t])?;

Key and Query are multiplied in an opposite direction #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions