Simplify synchronization primitives used between host and device reads

`kvikio::FileHandle` supports multiple APIs for reading and writing data. Of the APIs for reading (writing), `pread` (`pwrite`) support both host and device data while `read` and `read_async` (`write` and `write_async`) do not. Meanwhile, both `pread` and `read_async` (`pwrite` and `write_async`) are asynchronous, while `read` (`write`) is synchronous. Due to the mixing of host and device asynchrony and the support for dynamically selecting between cufile and POSIX reads, these APIs wind up also mixing host and device approaches for asynchrony. In particular, we have some APIs returning std::future instances while others return StreamFuture. What's even more confusing is that depending on a complex interplay between multiple system properties and the values of input parameters and kvikio configuration, APIs that accept streams may or may not actually be stream-ordered. A more subtle issue is that in order to support stream ordering, some use cases will trigger stream synchronization, but they will do so on specific host threads, so there is still a degree of asynchrony in the form of a future on the host, but at some point the task running on a background thread could implicitly trigger a stream sync. One important challenge to keep in mind is that simply launching all CUDA APIs on different streams may not be sufficient to saturate the GPU (sometimes for reasons we don't fully understand involving overheads in the CUDA driver API), so we may need to launch CUDA memcpy operations on different threads even if they are also queuing up on different streams. Another is that in cases when we fall back to POSIX I/O we do a POSIX read to host followed by a cuda memcpy, we definitely want to use multiple threads to coordinate the host reads.

I would like for kvikio to be more disciplined and consistent in its approach to asynchrony. Given the importance of having both multiple host threads and multiple streams, I would like to provide an API that exposes both forms of asynchrony but does not do any unnecessary synchronization a priori. I think this could be accomplished by implementing a parallelizable async read that accepts a stream.
- When not using GDS, the API would use a thread pool and a stream per thread, with each stream being forked from the initial stream by waiting on an event queued on the input stream. Each thread would do whatever it needed to, including synchronizing the worker stream where necessary to ensure a valid state when interacting with host data produced by a POSIX read. Since each task will return a future, the final operation can be using cuLaunchHostFunc to queue an operation that waits on these futures onto the input stream. That way everything remains asynchronous and stream-ordered with respect to the input stream while being suitable multi-threaded and pipelined with respect to the parallelization of the host reads and device copies.
- When GDS is available, we can instead simply queue all the operations using cufile's ReadAsync function and then add stream event waits to the input stream. If necessary we can also use a thread pool to launch the ReadAsync calls. That will require some additional benchmarking to determine.

Once all of that is done we should consider the possibility of deprecating or at least simplifying some of the existing APIs to reduce some of the complexity. Ideally we would use strictly stream-ordered operations to handle device synchronization while using std::future-based operations for host synchronization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify synchronization primitives used between host and device reads #818

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simplify synchronization primitives used between host and device reads #818

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions