How does launch_transform4d_0213 work inside Transformer Kernel #1415
Unanswered
li-yi-dong
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm confused with this function
launch_transform4d_0213inside Transformer Kernel.It's input is the output of
_attn_context.Forward(buf_1), of which size ishidden_size * seq_length * batch_size.But inside
launch_transform4d_0213, it launch a CUDA kernel that hasdim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count)Notice that the
blockDim.ycould easily larger than heads.Inside the CUDA kernel,
d0_stride = hidden_dim * seq_lengthand accessing the input byfloat4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3], which means that d1 * d1_stride should never larger than d0_stride. Then,d1_stride = d0_stride / heads, which means thatd1should never exceedheads.But,
d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1). Forheads> 8,((seq_length - 1) / blockDim.y + 1)= 1 andd1would be equal toblockIdx.y, which may exceedheads. This will make the index of input to be larger than meaningful.Do I misunderstanding the output from
_attn_context.Forwardor thelaunch_transform4d_0213?Below are the source codes:
Beta Was this translation helpful? Give feedback.
All reactions