support for per-head scales for cosine sim attention

usually with cosine-sim models I'd train with learned per-head scales for the attention logits, I guess I can get this from multiplying by `q` & `k` by `sqrt(scales)` before the dot product but that's probably less stable