Skip to content

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
oleksandr-pavlyk opened this issue Mar 14, 2025 · 2 comments · Fixed by #4697
Assignees
Labels
feature request New feature or request.

Comments

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Mar 14, 2025

Is this a duplicate?

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.

Supporting these algorithms for problem sizes in excess of INT_MAX can be done with streaming approach with streaming logic encoded in algorithm's dispatcher. Dispatcher needs to increment iterators on the host.

This is presently not supported by cccl.c.parallel, since indirect_arg_t does not implement increment operator.

Since indirect_arg_t is used to represent cccl_value_t, cccl_operation_t and cccl_iterator_t, and incrementing only makes sense for iterators, a dedicated type indirect_iterator_t must be introduced, which may implement the operator+=.

If the entirety of iterator state is user-defined, cuda.parallel must provide host function pointer to increment iterator's state by compiling advance function for the host.

If we define the state of a struct that contains size_t linear_id in addition to user-defined state, we could get rid of user-defined advance function altogether, but would need to provide access to linear_id to the dereference function.

Approached need to be prototyped and compared.

Describe the solution you'd like

The solution should unblock #3764

Additional context

#3764 (comment)

@oleksandr-pavlyk
Copy link
Contributor Author

Approach 1: augment the state of the iterator

CCCL_POINTER

It should be noted that for cccl_iterator_kind_t::CCCL_POINTER we can advance the state on the host if the host type representing the type-erased device pointer stored the value size in bytes.

struct indirect_pointer_t {
  void *ptr;
  size_t value_size;

  indirect_pointer_t(cccl_iterator_t &it) : ptr(&it.state), value_size(it.value_type.size) {
     assert( it.type == cccl_iterator_kind_t::CCCL_POINTER );
  }

  indirect_pointer_t& operator+=(uint64_t offset) {
    char **ptr_ptr = reinterpret_cast<char **>(ptr);
    *ptr_ptr += (offset * value_size);
    return *this;
  }

  void* operator&() const
  {
    return ptr;
  }

};

With no modification to the device side, launcher would only copy the pointer value pointed to by ptr.

CCCL_ITERATOR

For CCCL_ITERATOR we need to augment state of the iterator with diffence_type offset struct member on both the host and the device side.

struct indirect_iterator_t {
    // -----
    // type definitions
    // ----
    void *ptr;
    difference_type *offset_ptr;

    indirect_iterator_t(cccl_iterator_t &it) : ptr() {
      size_t offset_offset = align_up(it.size, sizeof(difference_type));
      size_t combined_nbytes = offset_offset + sizeof(difference_type); 
      // allocate memory for user-defined state followed by the offset 
      ptr = calloc(combined_nbytes);
      // copy content of state from `cccl_iterator_t` to allocation
      ::memcpy(ptr, it.state, it.size);
      // set offset to zero
      offset_ptr = ptr + offset_offset;
    }

    ~indirect_iterator_t () noexcept {
      // deallocate memory 
      free(ptr);
    }

    indirect_iterator_t& operator+=(difference_type offset) {
      *offset_ptr += offset;
      return *this;
    }
}

On the device side, the make_input_iterator needs to be modified as suggested by @gevtushenko:

struct __align__(ALIGNMENT) input_iterator_t {
  // type definitions

  __device__ inline value_type operator*() const { 
       const input_iterator_t &it = (*this + *offset_ptr);
       return DEREF(it.data);
  }
  __device__ inline input_iterator_t& operator+=(difference_type diff) {
      ADVANCE(data, diff);
      return *this;
  }
  __device__ inline value_type operator[](difference_type diff) const {
      return *(*this + diff);
  }
  __device__ inline input_iterator_t operator+(difference_type diff) const {
      input_iterator_t result = *this;
      result += diff;
      return result;
  }

   char data[ STATE_SIZE ];
   int64_t offset;
};

Somehow these two structs have to be combined into a single type to service cccl_iterator_t regardless of the iterator kind it represents.

Approach 2: user-provided host function

We could augment cccl_iterator_t with function pointer to advance_host_fn(void *, uint64_t offset). This pointer will only be used for cccl_iterator_kind_t::ITERATOR.

Numba allows us to compile advance function to get a native function pointer:

import numba
import ctypes

def advance(state, incr):
    state[0] = state[0] + incr

numba_t = numba.types.CPointer(numba.types.int64)
sig = numba.void(numba_t, numba.int64)

c_advance_fn = numba.cfunc(sig)(advance)

state_ = ctypes.c_int64(73)
state_ptr = ctypes.pointer(state_)

c_advance_fn.ctypes(state_ptr, ctypes.c_int64(17))

assert state_.value == 73 + 17

c_advance_fn.ctypes(state_ptr, ctypes.c_int64(10))

assert state_.value == 100

raw_function_ptr = ctypes.cast(c_advance_fn.ctypes, ctypes.c_void_p)

Approach 3: Make advance function internal

We could make linear_id a mandatory member of indirect_iterator_t struct, and generate both device and host functions for mutating linear_id in cccl.c.parallel.

Then the dereference and assign functions will need to become like getitem and setitem functions in python and take linear_id as an argument. So instead of dereference(state) we would be calling dereference(state, linear_id) and instead of assign(state, val) we would be calling assign(state, linear_id, val).

oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 15, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 17, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 18, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 21, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 21, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 21, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 22, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 22, 2025
@oleksandr-pavlyk oleksandr-pavlyk moved this from Todo to In Progress in CCCL Apr 23, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 29, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue Apr 29, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue May 9, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue May 9, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL May 14, 2025
oleksandr-pavlyk added a commit to oleksandr-pavlyk/cccl that referenced this issue May 14, 2025
oleksandr-pavlyk added a commit that referenced this issue May 20, 2025
* Checkpoint for implementing approach 2 in gh-4148

* Checkpoint 2

* Printing added for debugging

* Almost working version

* Remove superfluous import, increased num_segments to exceed INT_MAX

* Define segment size variable and use it in the test

* Restore INT_MAX increment in streaming iterations in segmented_reduce

* Remove stdio from indirect_arg, remove printf

* Test for large_num_segments is parametrized by three modulus params

The test cooks input sequence in the form of F(i+1) - F(i) for
0 <= i < num_segments * segment_size.

Fixed-size segmented reduction is expected to to be

[ F(segment_size * (k + 1)) - F(segment_size * k) for
     k in range(num_segments) ]

All operations (addition (+), inversion (-)) are assumed to take
place in a ring.

In this test the ring is a cycling ring of integers modulo 7.
F(i) = g(i) * g(i+1) % 7, where g: i -> (i % 5 + i % 3).

Note that `my_add` computes addition modulo 7 too.

Choice of moduli 3, 5, 7 is arbitrary and may be changed. The test
is expected to pass so long as they are changed consistently throughout
the test.

* Add test_large_num_segments3

This example computes segmented reduce of input constant iterator
of int8(1) values. The offsets iterators are crafted to that segment
lengths are

1, 2, ..., (p - 2), (p - 1), 1, 2, ...,  (p - 2), (p - 1), 1, ...

The expected sums would equal these lengths.

The example is run for (2**15 + 2**3) * 2**16 number of segments
which is INT_MAX and about 2**19 elements.

1. Input iterator changed to have type int16
2. Initial value changed to have type int16
3. Changed range of change of segments from [1, p] to
   [m0, m0 + p) and set m0, p as they are set in native test.

   Specifically, m0, p = 265, 163

This means segment sizes vary from 265 elements to 428 elements,
which is an improvement over previous situation where it was varying
from 1 to 113 elements (and most threads in a 256-string block were
unused).

* Change indirect_iterator_t::operator+= to throw if host_advance_fn is null

This ensures meaningful prescriptive cerr output.

* Enable large num_segments test in test_segmented_reduce.cpp

* Add docstrings to test_large_num_segments and test_large_num_segments3

* Fix type in the test name in test_bindings

* Remove make_advance_iterator. Implement IteratorBase.__add__ to copy state and advance cvalue. We rely on cvalue being a ctype for which __add__ is defined

* Add host_advance_ property to iterators

For IteratorBase it is defined as None
For concrete iterators, it is defined as input_iterators, except
for transform iterator, where host_advance does not involve numba-cuda
compiled object.

* Add stubs for Iterators.host_advance_fn getter/setter

* Add function to set host_advance_fn member of bindings Iterator member

* segmented_reduce calls to set host_advance_fn on output and start/end offset iterators

This is done during build step irrespective of the input size, but it should perhaps
be done as needed the first time num_segments > INT_MAX is encountered. TBD

* Remove band-aid setting of host_advance_fn.

The host_advance function is now generated and set during segmented_reduce
build step.

* Use concept to narrow types of supported increment in indirect_iterator_t::operator+=

* Change validation of large num_segments tests in Python

Validation is now performed by constructing an iterator yielding
expected result. The allocating fixed size uint8 buffer and running
binary transform to compare actual to expected and write result
of comparison to the validation buffer. We then use
`cp.all(validation.view(cp.bool_))` to validate this chunk, and
move to the next.

* Mark large num_segments tests as large

* Run large tests, but do it sequentially (-n 0)
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL May 20, 2025
@oleksandr-pavlyk
Copy link
Contributor Author

I should comment that since approaches 1 and 3 effectively augment state of iterators with host accessible counter which needs to be copied to the device, these approaches lead to increase in register usage by kernels. There might be kernels where such an increase causes a performance cliff due to register spilling.

For this reason, approach 2 was implemented and merged in gh-4697. It does requires host compiler (provided by Numa) and increases the cost of creating the algorithm, but this cost would only incur for those algorithms that employ streaming approach and so far there are few such algorithms only.

The only issue is that we presently always compile host-callable advance function. If some users know in advance that num_segments will never exceed INT_MAX, it might be useful for them to avoid this additional cost.

Perhaps a keyword argument to algorithms.segmented_reduce is in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant