-
Notifications
You must be signed in to change notification settings - Fork 216
[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Approach 1: augment the state of the iterator
|
* Checkpoint for implementing approach 2 in gh-4148 * Checkpoint 2 * Printing added for debugging * Almost working version * Remove superfluous import, increased num_segments to exceed INT_MAX * Define segment size variable and use it in the test * Restore INT_MAX increment in streaming iterations in segmented_reduce * Remove stdio from indirect_arg, remove printf * Test for large_num_segments is parametrized by three modulus params The test cooks input sequence in the form of F(i+1) - F(i) for 0 <= i < num_segments * segment_size. Fixed-size segmented reduction is expected to to be [ F(segment_size * (k + 1)) - F(segment_size * k) for k in range(num_segments) ] All operations (addition (+), inversion (-)) are assumed to take place in a ring. In this test the ring is a cycling ring of integers modulo 7. F(i) = g(i) * g(i+1) % 7, where g: i -> (i % 5 + i % 3). Note that `my_add` computes addition modulo 7 too. Choice of moduli 3, 5, 7 is arbitrary and may be changed. The test is expected to pass so long as they are changed consistently throughout the test. * Add test_large_num_segments3 This example computes segmented reduce of input constant iterator of int8(1) values. The offsets iterators are crafted to that segment lengths are 1, 2, ..., (p - 2), (p - 1), 1, 2, ..., (p - 2), (p - 1), 1, ... The expected sums would equal these lengths. The example is run for (2**15 + 2**3) * 2**16 number of segments which is INT_MAX and about 2**19 elements. 1. Input iterator changed to have type int16 2. Initial value changed to have type int16 3. Changed range of change of segments from [1, p] to [m0, m0 + p) and set m0, p as they are set in native test. Specifically, m0, p = 265, 163 This means segment sizes vary from 265 elements to 428 elements, which is an improvement over previous situation where it was varying from 1 to 113 elements (and most threads in a 256-string block were unused). * Change indirect_iterator_t::operator+= to throw if host_advance_fn is null This ensures meaningful prescriptive cerr output. * Enable large num_segments test in test_segmented_reduce.cpp * Add docstrings to test_large_num_segments and test_large_num_segments3 * Fix type in the test name in test_bindings * Remove make_advance_iterator. Implement IteratorBase.__add__ to copy state and advance cvalue. We rely on cvalue being a ctype for which __add__ is defined * Add host_advance_ property to iterators For IteratorBase it is defined as None For concrete iterators, it is defined as input_iterators, except for transform iterator, where host_advance does not involve numba-cuda compiled object. * Add stubs for Iterators.host_advance_fn getter/setter * Add function to set host_advance_fn member of bindings Iterator member * segmented_reduce calls to set host_advance_fn on output and start/end offset iterators This is done during build step irrespective of the input size, but it should perhaps be done as needed the first time num_segments > INT_MAX is encountered. TBD * Remove band-aid setting of host_advance_fn. The host_advance function is now generated and set during segmented_reduce build step. * Use concept to narrow types of supported increment in indirect_iterator_t::operator+= * Change validation of large num_segments tests in Python Validation is now performed by constructing an iterator yielding expected result. The allocating fixed size uint8 buffer and running binary transform to compare actual to expected and write result of comparison to the validation buffer. We then use `cp.all(validation.view(cp.bool_))` to validate this chunk, and move to the next. * Mark large num_segments tests as large * Run large tests, but do it sequentially (-n 0)
I should comment that since approaches 1 and 3 effectively augment state of iterators with host accessible counter which needs to be copied to the device, these approaches lead to increase in register usage by kernels. There might be kernels where such an increase causes a performance cliff due to register spilling. For this reason, approach 2 was implemented and merged in gh-4697. It does requires host compiler (provided by Numa) and increases the cost of creating the algorithm, but this cost would only incur for those algorithms that employ streaming approach and so far there are few such algorithms only. The only issue is that we presently always compile host-callable Perhaps a keyword argument to |
Uh oh!
There was an error while loading. Please reload this page.
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.
Supporting these algorithms for problem sizes in excess of
INT_MAX
can be done with streaming approach with streaming logic encoded in algorithm's dispatcher. Dispatcher needs to increment iterators on the host.This is presently not supported by
cccl.c.parallel
, sinceindirect_arg_t
does not implement increment operator.Since
indirect_arg_t
is used to representcccl_value_t
,cccl_operation_t
andcccl_iterator_t
, and incrementing only makes sense for iterators, a dedicated typeindirect_iterator_t
must be introduced, which may implement theoperator+=
.If the entirety of iterator state is user-defined,
cuda.parallel
must provide host function pointer to increment iterator's state by compilingadvance
function for the host.If we define the state of a struct that contains
size_t linear_id
in addition to user-defined state, we could get rid of user-definedadvance
function altogether, but would need to provide access tolinear_id
to thedereference
function.Approached need to be prototyped and compared.
Describe the solution you'd like
The solution should unblock #3764
Additional context
#3764 (comment)
The text was updated successfully, but these errors were encountered: