Skip to content

Simple Block Reduce Fails when using while loops #330

Open
@anicusan

Description

@anicusan

Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...

I'm implementing the block-wise reduce following this tutorial with this simple-looking code:

using KernelAbstractions
using CUDA
using CUDAKernels


@kernel function block_reduce(out, in)

    # Get block / workgroup size
    bs = @uniform @groupsize()[1]

    # Block / group index, thread index within block, global thread index
    bi = @index(Group, Linear)
    ti = @index(Local, Linear)
    gi = @index(Global, Linear)

    # Copy each thread's corresponding item from global to shared memory
    cache = @localmem eltype(out) (bs,)
    cache[ti] = in[gi]
    @synchronize

    # Reduce elements in shared memory using sequential addressing following
    # https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    @private s = bs ÷ 2
    while s > 0
        if ti < s
            cache[ti] += cache[ti + s]
        end

        @synchronize

        s = s >> 1
    end

    # Copy result back to global memory
    if ti == 1
        out[bi] = cache[1]
    end
end


num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size

in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray


kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)

wait(ev)
println(out)

It shouldn't be more exotic than the example code in the docs - however, these two lines:

    @private s = bs ÷ 2
    while s > 0

Produce the following errors:

Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:32
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to >)

[...Stacktrace...]

I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1] and @private s = bs ÷ 2 are not inferred as being integers.

If I switch the arrays and device to CPU() I get the following error:

    nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})

Would you know why these errors appear or how I could investigate (and fix..) them?

Thanks,
Leonard

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions