A generic parallel redcution using a single block (implemented in CUDA).
# run from repo dir
nvcc -o out/generic-reduction generic-reduction.cu
- push the integer version
- write a generic kernel (using C++ templates and functors)
- write a shared memory version
- write an efficient segmented reduction