zuku builds on top of the Realm event-based runtime system to provide a futures-based API for expressing data-flow graphs. zuku provides two key features on top of Realm:
- Capturing data effects on arrays and automatically building execution graphs based on how data futures are used
- API for creating multi-GPU tasks (gang scheduling) and multi-GPU data movement (array resharding)
zuku provides a single-controller programming model. The underlying implementation can either rely on control replication (MPI style) to execute in a multi-controller fashion or execute with a true single controller (although the single-controller is not yet implemented).
zuku uses a standard CMake build.
zuku must be pointed at a working Realm build or installation using the -DLegion_ROOT
command.
A recommended build setup pointing to the Realm build tree would be:
legion=<Legion build directory>
cmake -S . -B build \
-DLegion_ROOT=$legion \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_CXX_STANDARD=17
cmake --build build
For configuring Realm, a recommended configure would be:
cmake \
-S . \
-B build \
-GNinja \
-D Legion_USE_CUDA=ON \
-D Legion_BUILD_RUST_PROFILER=OFF \
-D CMAKE_CXX_STANDARD=17 \
-D CUDA_NVCC_FLAGS="-std=c++17" \
-D CMAKE_EXPORT_COMPILE_COMMANDS=ON
With this configure, <Legion build directory>
would be <Legion source directory>/build
.
zuku provides a futures library on top of the Realm runtime providing deferred execution. Rather than directly executing tasks, C++ lambdas are enqueued to run when their dependencies are ready. The return values of the lambda are wrapped in futures and can be passed to later tasks.
There are 3 main execution abstractions, which are largely inherited from Realm:
- Processor
- Memory
- Ordering tokens
There are 3 main data abstractions:
Future<T>
: an owning value holder for a deferred valueView<T>
: a read-only value holder for a deferred valueStore<T>
: a future with reference-like semantics for updating values in-place
Each of these objects is move-only and cannot be copied.
Views of each object, though, can be created by calling their .view()
function,
which creates a new object holding read-only ownership of the underlying data.
An arbitrary number of views can be created.
All read-only views must be collected before a Future
or Store
is free
to modify the data. This is all managed automatically when
passing arguments to a defer
function.
A future is created by a deferred operation:
auto [f] = defer([]{
return 42;
});
This Future<int> f
can be passed to a further deferred task.
See below for the reasons behind the structured binding syntax.
auto token = defer([](int x){
std::cerr << "Got " << x << std::endl;
}, std::move(f));
Here execution is ordered by implicit data dependencies. The returned token can be used to wait on conditions or provide execution dependencies.
This relinquishes ownership of f
to the subtask.
If the task instead wants to read the data without
transferring ownership, a view can be created.
auto token = defer([](int x){
std::cerr << "Got " << x << std::endl;
}, f.view());
To wait for all operations to finish, the user can do:
token.Wait();
The return type of each deferred operation is an ordering token that can enforce execution dependencies. If the execution token is not required, the user can unpack the token to the underlying future objects using structured bindings.
In contrast to Future
which acts like std::unique_ptr
and must transfer ownership
where used, Store
provides something closer to std::shared_ptr
or an
object passed by reference into a function. A Store
can be modified in-place
by passing a child store to a deferred operation.
Store<int> a_st = Store<int>::Create();
defer([](int& a){
a = 42;
}, a_st);
The task operates on a reference, which essentially provides an in-place update.
Store
is move-only and cannot be copied. Instead, a child store is implicitly
created and passed to the deferred operation.
A View
is a read-only handle for deferred data.
A View
can be created from a Future
, Store
, or another View
.
Store<int> a_st = Store<int>::Create();
defer([](const int& a){
a = 42;
}, a_st);
In constrast to Store
, the lambda can only take a const reference (or copy)
since the data is read-only. The defer
operation implicitly creates
a subview of the store and passes it to the deferred operation.
The simplest way to start the runtime and execute a program is to use the
zuku::program
function.
int main(int argc, char** argv){
return zuku::program([]{
...
}, argc, argv);
}
zuku::program
takes a lambda as an argument that executes the program.
A full "hello world" program using deferred execution would be:
int main(int argc, char** argv){
return zuku::program([]{
auto [msg] = defer([]{
return "Hello World";
});
auto token = defer([](std::string name){
std::cout << name << std::endl;
}, std::move(msg));
return token;
}, argc, argv);
}
The token
must be returned at the end to indicate the last
asynchronous event in the program. If token
is not returned,
the program could end immediately without waiting for the subtasks.
zuku is designed to express bulk operations directly rather than having to express parallelism via single-GPU operations.
zuku provides a DeviceList
that represents multiple processors:
DeviceList devices{{.start = 0, .num_devices= 8}};
This is the key abstraction for creating sharded arrays or gang-scheduling tasks.
Arrays are created with sharded shapes that defines how they should be arranged across devices:
ShardedShape shape = {
.type = zuku::SupportedType::S32,
.sharding = {
.dims = {
{ .size = 32, .sharding = 4, .permutation = 0},
{ .size = 64, .sharding = 2, .permutation = 1}
},
.devices = devices
}
};
Store<ShardedArray> a = ShardedArray::Create(shape);
This creates an array with a 4x2 tiling across devices
with no permutation on the tile indexing.
Tasks are gang-scheduled by specifying which devices to launch across:
across(devices).defer([](const ShardedArray& a){
...
}, a);
The input array sharding is expected to match the devices. This launches a task on each participating device and passes the individual shard to each task.
Given two arrays a
and b
with different shardings,
the values in a
can be copied to b
using:
reshard(a,b)
Sharding encompasses both the tiling and the devices assigned.
A reshard
can either involve a shape change, same shape on different devices,
or combination of the two.
Zukunft: German for "future".