Skip to content

[FEA] Add a method to cudf::table and cudf::column to get its size in bytes without kernel launches or d -> h memcpy #18462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
devavret opened this issue Apr 9, 2025 · 11 comments · Fixed by #18639
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Apr 9, 2025

Is your feature request related to a problem? Please describe.
Currently, there is no clean way to get the size in bytes of all buffers owned by a const cudf::table or a cudf::column without requiring estimation and d->h copies.
We can use the following APIs which can also work with a cudf::table_view:

  • cudf::bitmask_allocation_size_bytes to estimate the nullmask buffer size.
  • cudf::strings_column_view::chars_size for char size of any string columns. However this needs to do a d->h copy to get a single element

Describe the solution you'd like
The desired API should be able to return the size in bytes value by summing the sizes of all device_buffers owned by all constituent columns of a cudf::table

Describe alternatives you've considered
Currently, we're able to workaround this by disassembling, inspecting, and reassembling the cudf::table and cudf::column like so:

std::pair<uint64_t, std::unique_ptr<cudf::column>> getColumnSize(
    std::unique_ptr<cudf::column> column) {
  // Store column metadata (type, null count, and size) before releasing it,
  // as the release() operation transfers ownership of the underlying buffers
  // and invalidates access to these properties.
  auto type = column->type();
  auto nullCount = column->null_count();
  auto size = column->size();

  auto contents = column->release();
  auto bytes = contents.data->size() + contents.null_mask->size();

  // Recursively get the size of the children columns.
  std::vector<std::unique_ptr<cudf::column>> children;
  for (auto& child : contents.children) {
    auto [childBytes, childColumn] = getColumnSize(std::move(child));
    bytes += childBytes;
    children.push_back(std::move(childColumn));
  }

  // Reassemble the column with the original metadata.
  auto reconstitutedColumn = std::make_unique<cudf::column>(
      type,
      size,
      std::move(*contents.data.release()),
      std::move(*contents.null_mask.release()),
      nullCount,
      std::move(children));

  return std::make_pair(bytes, std::move(reconstitutedColumn));
}

std::pair<uint64_t, std::unique_ptr<cudf::table>> getTableSize(
    std::unique_ptr<cudf::table>&& table) {
  auto columns = table->release();
  std::vector<std::unique_ptr<cudf::column>> columnsOut;
  uint64_t totalBytes = 0;

  for (auto& column : columns) {
    auto [bytes, columnOut] = getColumnSize(std::move(column));
    totalBytes += bytes;
    columnsOut.push_back(std::move(columnOut));
  }
  return std::make_pair(
      totalBytes, std::make_unique<cudf::table>(std::move(columnsOut)));
}
@devavret devavret added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Apr 9, 2025
@davidwendt
Copy link
Contributor

Why is it important to not to perform device-to-host copies?

@devavret
Copy link
Contributor Author

devavret commented Apr 9, 2025

Why is it important to not to perform device-to-host copies?

For the velox-cudf use case, this will be called a lot, between almost every two other cudf actual compute calls. So it's more of a why do a copy when the information is readily available on host?

@Matt711 Matt711 added this to libcudf Apr 9, 2025
@davidwendt
Copy link
Contributor

For a large number of columns this could take awhile to add up on the host.
This is an unusual interface I suppose since we generally do not take cudf::column or cudf::table parameters as input but rather cudf::column_view or cudf::table_view which may or may not be backed by an actual cudf::column or cudf::table and also may be sliced.

@devavret
Copy link
Contributor Author

devavret commented Apr 9, 2025

For a large number of columns this could take awhile to add up on the host.

Yes, but so would adding up using the tools we already have. The scale of sum operations we'd need to do would be the same.
Come to think of it, this would be especially helpful for wide tables, for which we would need to do a d2h copy for each string column.

This is an unusual interface I suppose since we generally do not take cudf::column or cudf::table parameters as input but rather cudf::column_view or cudf::table_view which may or may not be backed by an actual cudf::column or cudf::table and also may be sliced.

It is indeed, but it's a different use case. We don't need this information to pass to other libcudf APIs, which are designed to work with sliced tables and views of tables on existing data pointers. We need this to be able to track the size of the output tables returned from cudf calls, for the engine's bookkeeping and potentially memory management purpose.

@davidwendt
Copy link
Contributor

I believe cudf::bit_row_count was written for this kind of thing and should help minimize the device-to-host copying.
https://docs.rapids.ai/api/libcudf/stable/group__transformation__transform.html#ga0fe5d7898ef1ceef3aaf9405611700ff

The end result is a column of integers which you could use cudf::reduce to get the final result in bytes with only one device-to-host copy I think.

@devavret
Copy link
Contributor Author

IIUC cudf::row_bit_count has several kernel calls in it. And then it needs one more kernel call with cudf::reduce.

@lamarrr
Copy link
Contributor

lamarrr commented Apr 30, 2025

I agree that adding APIs to get the allocated byte size of each column and its children should be implemented, and the information should be readily available on the host without performing memory copies.
For the string-sizes, It should just be a single copy of the run-end of the last string element? The string characters are contiguous and the indices are run-end encoded.

@davidwendt
Copy link
Contributor

I think we are talking about something like adding a size_bytes() member function to cudf::table and cudf::column that simply accumulates the sizes from the internal rmm::device_buffers that they own (data and bitmask). This would be host-only code and similar to the example code in the description. The function would need to include disclaimers about about possible padding, etc.
It makes me a bit nervous to support this forever but advertising it as an estimate (in the doyxgen) may be acceptable.

@lamarrr
Copy link
Contributor

lamarrr commented Apr 30, 2025

I think we are talking about something like adding a size_bytes() member function to cudf::table and cudf::column that simply accumulates the sizes from the internal rmm::device_buffers that they own (data and bitmask).

Yeah, I'm talking about that as well.

This would be host-only code and similar to the example code in the description. The function would need to include disclaimers about about possible padding, etc. It makes me a bit nervous to support this forever but advertising it as an estimate (in the doyxgen) may be acceptable.

The result would be fragile and can become out-of-sync, but if it's specified as an estimate (minimum), that would be sufficient.

@devavret
Copy link
Contributor Author

devavret commented May 1, 2025

I think we are talking about something like adding a size_bytes() member function to cudf::table and cudf::column that simply accumulates the sizes from the internal rmm::device_buffers that they own (data and bitmask).

That is exactly what we want as well.

The function would need to include disclaimers about about possible padding, etc.

That's actually the desired behavior because when managing memory, we'd need to know how much memory a table occupies, padding included. I thought of using rmm in some way to get this information by checking the usage before and after table creation but the situation was complicated by the multi-threaded nature of the engine. There's also an rmm resource that can track allocations using custom counters but it's advertised as a debugging tool with performance implications.

@devavret
Copy link
Contributor Author

devavret commented May 1, 2025

For the string-sizes, It should just be a single copy of the run-end of the last string element? The string characters are contiguous and the indices are run-end encoded.

@lamarrr That's also avoidable because we can get that information from the rmm::device_buffer that holds the chars

@davidwendt davidwendt self-assigned this May 2, 2025
rapids-bot bot pushed a commit that referenced this issue May 7, 2025
Adds a new member function to the `cudf::table` and `cudf::column` classes to report the device memory allocation size of the internal device-buffers of itself and its children.
The value will including padding and may not be computable from the logical column type and the number of elements and null-count.

Closes #18462

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #18639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants