Description
In JuliaGPU/CUDA.jl#1895, I made the size tuple of CuDeviceArray
32 bits so that we can emit better code (lowering register pressure, making it possible to execute compute & indexing instructions in parallel, etc) However, the NVPTX back-end defaults to using 64 bits for indexing pointers, resulting in 64-bits GEPs being introduced by the front-end and optimization. I tried to change that by specifying a 32-bit pointer index size in the data layout, #444, but that breaks 64-bits indices which can still get reintroduced by optimization (see e.g. #461).
Either we try this again on LLVM 17 (where a bug has been fixed that was introducing 64-bits GEP offsets), or we instead create an optimization pass that demotes GEP indices to 32 bits if possible (e.g., if they are constants, or come from the size field of a device array).