I’m digging into Section 2.1 of the V3.2-Exp technical report and noticed the term “attention score” is used in several places.
Could you please confirm its precise definition in this context?
- Is it the raw pre-softmax query-key dot product?
- The post-softmax probability weight?
- Or something else (e.g., a scaled/renormalized value introduced in V3.2)?
A short clarification would help a lot for reproducing the experiments. Thanks!