-
Notifications
You must be signed in to change notification settings - Fork 215
[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Approach 1: augment the state of the iterator
|
Is this a duplicate?
Area
cuda.parallel (Python)
Is your feature request related to a problem? Please describe.
To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.
Supporting these algorithms for problem sizes in excess of
INT_MAX
can be done with streaming approach with streaming logic encoded in algorithm's dispatcher. Dispatcher needs to increment iterators on the host.This is presently not supported by
cccl.c.parallel
, sinceindirect_arg_t
does not implement increment operator.Since
indirect_arg_t
is used to representcccl_value_t
,cccl_operation_t
andcccl_iterator_t
, and incrementing only makes sense for iterators, a dedicated typeindirect_iterator_t
must be introduced, which may implement theoperator+=
.If the entirety of iterator state is user-defined,
cuda.parallel
must provide host function pointer to increment iterator's state by compilingadvance
function for the host.If we define the state of a struct that contains
size_t linear_id
in addition to user-defined state, we could get rid of user-definedadvance
function altogether, but would need to provide access tolinear_id
to thedereference
function.Approached need to be prototyped and compared.
Describe the solution you'd like
The solution should unblock #3764
Additional context
#3764 (comment)
The text was updated successfully, but these errors were encountered: