Skip to content

feat: Nvidia GPU Direct Storage Support for reading RNTuple #1426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

fstrug
Copy link

@fstrug fstrug commented Apr 22, 2025

This PR adds support for reading RNTuple data from storage directly to GPU memory via RDMA on GPU Direct Storage (GDS) enabled systems. On systems without GDS, cufile runs in compatibility mode and reads are performed by the cpu via POSIX. Currently, there is only support for reading RNTuple datas compressed with the zstandard algorithm.

Changes are made to uproot.behaviors.RNTuple.HasFields.arrays() to control steering of arrays function with new optional arguments backend and use_GDS.

There are known issues with nvcomp's implementation of the zstandard algorithm which crashes nvcomp (and uproot) when decompressing certain buffers. This behavior is non-determinant and the reason is unknown. Fixes are expected in the next release of nvcomp.

@nsmith- nsmith- requested review from lgray and nsmith- April 22, 2025 15:03
Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. I will look more closely at the implementation later.

@fstrug fstrug marked this pull request as draft April 24, 2025 15:30
@fstrug
Copy link
Author

fstrug commented May 14, 2025

Passing all RNTuple reading tests besides four.

test_1411_rntuple_physlite_ATLAS.py::test_truth_muon_containers - Bug found in awkward array fixed at scikit-hep/awkward#3507. Had issues building awkward-cpp so I haven't verified with awkward main branch that tests passes as I got missing awkward kernel error at final assert.
test_1250_rntuple_improvements.py::test_iterate - failing at line 89. Iterate not yet implemented for GDS. Failing for backend = "cpu" and use_GDS = False. Only changed function to directly call HasFields._arrays() instead of high level `HasFields.arrays(). This seems to be unrelated to this PR right now.

I am still unsure how we want to solve the following tests.
test_0662_rntuple_stl_containers - Cupy does not support dtype = string. Cudf supports string manipulations and support in cupy not anticipated soon. cupy/cupy#8698 (comment) Unsure how we might want to implement this.
test_1223_empty_struct.py::test_invalid_variant() - Cupy does not support dtype = object. It is possible to call ak.to_backend(a, "cuda") which works. I need to find a way to build a.variant with backend cuda via cupy arrays and ak.from_buffers().

@fstrug fstrug marked this pull request as ready for review June 10, 2025 17:15
fstrug and others added 6 commits June 10, 2025 20:36
Fixed 
```
error: Failed to parse entry in group `test`: `kvikio-cu12>=25.02.01; platform_system == "Linux" & python_version >= 3.10`
  Caused by: Unexpected character '&', expected 'and', 'or' or end of input
kvikio-cu12>=25.02.01; platform_system == "Linux" & python_version >= 3.10
```
Require numpy version < 2.3 due to bug.
@@ -29,6 +29,7 @@ test = [
"rangehttpserver",
"requests",
"s3fs",
'kvikio-cu12>=25.02.01; platform_system == "Linux" and python_version >= "3.10"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to add a test-gpu group?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, then it's clear what is failing (if something is failing :-)

Comment on lines +59 to +68
# https://github.com/root-project/root/blob/6dc4ff848329eaa3ca433985e709b12321098fe2/core/zip/inc/Compression.h#L93-L105
compression_settings_dict = {
-1: "Inherit",
0: "UseGlobal",
1: "ZLIB",
2: "LZMA",
3: "deflate",
4: "LZ4",
5: "zstd",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in uproot.const

kZLIB = 1
kLZMA = 2
kOldCompressionAlgo = 3
kLZ4 = 4
kZSTD = 5
kUndefinedCompressionAlgorithm = 6


return Cluster_Contents

def Deserialize_decompressed_content(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stick with the convention of function names being lower snake case



# GDS Helper Dataclasses
class cupy: # to appease the linter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's probably a better alternative

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a type alias

CupyArray = Any

which won't offer any type checks but at least hints at what is there

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fstrug - nice work! Just a few minor comments - please, check. Thanks!

@@ -76,7 +77,7 @@ dependencies = [
"awkward>=2.4.6",
"cramjam>=2.5.0",
"xxhash",
"numpy",
"numpy < 2.3",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fstrug - no need to pin it now. The CI should pick up the latest awkward release that fixes the issue.

@@ -29,6 +29,7 @@ test = [
"rangehttpserver",
"requests",
"s3fs",
'kvikio-cu12>=25.02.01; platform_system == "Linux" and python_version >= "3.10"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, then it's clear what is failing (if something is failing :-)

Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few code quality comments, no showstoppers regarding functionality as far as I can see

interpretation_executor=None,
filter_branch=unset,
):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now a private function we can omit the duplicate docstring (to avoid potential accidents later in editing the private one instead of the public one)

)[entry_start:entry_stop]

arrays = uproot.extras.awkward().to_backend(arrays, backend=backend)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new line, do we have a test that exercises the cuda backend when not using GDS?

filter_branch=unset,
):
"""
Current GDS support is limited to nvidia GPUs. The python library kvikIO is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this is a private method so the docstring might better be reduced to implementation-specific details

Comment on lines +1045 to +1077
for key in target_cols:
if "column" in key and "union" not in key:
key_nr = int(key.split("-")[1])

dtype_byte = self.ntuple.column_records[key_nr].type
content = content_dict[key_nr]

if "cardinality" in key:
content = cupy.diff(content)

if dtype_byte == uproot.const.rntuple_col_type_to_num_dict["switch"]:
kindex, tags = uproot.models.RNTuple._split_switch_bits(content)
# Find invalid variants and adjust buffers accordingly
invalid = numpy.flatnonzero(tags == -1)
if len(invalid) > 0:
kindex = numpy.delete(kindex, invalid)
tags = numpy.delete(tags, invalid)
invalid -= numpy.arange(len(invalid))
optional_index = numpy.insert(
numpy.arange(len(kindex), dtype=numpy.int64), invalid, -1
)
else:
optional_index = numpy.arange(len(kindex), dtype=numpy.int64)
container_dict[f"{key}-index"] = cupy.array(optional_index)
container_dict[f"{key}-union-index"] = cupy.array(kindex)
container_dict[f"{key}-union-tags"] = cupy.array(tags)
else:
# don't distinguish data and offsets
container_dict[f"{key}-data"] = content
container_dict[f"{key}-offsets"] = content
cluster_offset = cluster_starts[start_cluster_idx]
entry_start -= cluster_offset
entry_stop -= cluster_offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a lot of this code is similar to the non-GPU _arrays method implementation. Is there a way to share more code between them rather than duplicating?



@dataclasses.dataclass
class ColBuffers_Cluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, this repo is using CamelCase for types and snake_case for functions, so here also a symbol rename may be in order

import uproot


class Source_CuFile:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CuFileSource

Comment on lines +1726 to +1727
# if self.columns == []:
# self.columns = Cluster.columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Comment on lines +1683 to +1690
key = ColBuffers_Cluster.key
self.columns.append(key)
self.data_dict[key] = ColBuffers_Cluster
self.algorithms[key] = ColBuffers_Cluster.algorithm
if ColBuffers_Cluster.isCompressed:
self.data_dict_comp[key] = ColBuffers_Cluster
else:
self.data_dict_uncomp[key] = ColBuffers_Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this dataclass has multiple containers all indexed by the same key, it may be a sign that they belong together. It looks like they start out this way with ColBuffers_Cluster having a complete record and this destructures it. Any reason not to just keep a dict[str, ColBuffers_Cluster] member?

else:
self.data_dict_uncomp[key] = ColBuffers_Cluster

def _decompress(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to free any memory occupied by the compressed buffers once they are decompressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants