Skip to content

String Transform Examples #18616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 93 commits into
base: branch-25.06
Choose a base branch
from

Conversation

lamarrr
Copy link
Contributor

@lamarrr lamarrr commented May 1, 2025

Description

Depends on: #18490
Follows up on #18023

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

lamarrr added 30 commits March 24, 2025 20:38
@github-actions github-actions bot added the Java Affects Java cuDF API. label May 15, 2025
@github-actions github-actions bot removed the Java Affects Java cuDF API. label May 15, 2025
@lamarrr lamarrr added feature request New feature or request non-breaking Non-breaking change labels May 15, 2025
@lamarrr lamarrr marked this pull request as ready for review May 15, 2025 16:43
@lamarrr lamarrr requested review from a team as code owners May 15, 2025 16:43
@lamarrr lamarrr requested review from vyasr and shrshi May 15, 2025 16:43
auto push = [&](cudf::string_view str) {
auto const size = str.size_bytes();

if ((it + size) > end) { return; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to set an "error" flag for this row as a second output, so we could know if the scratch space was insufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not suggest error handling. It would require an extra column to hold the errors and I think would get in the way of what we are trying to show here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @davidwendt, this example assumes the buffer is sufficient for the entire operation. Which I think is ideal, especially for database/parquet structured operations where we know the upper-bound of memory usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn’t clear in my question. I agree it is not suitable for this example from a teaching standpoint. I was trying to ask if it is possible, like if we have support for multiple column outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It is not possible currently. But if there's a consensus or user requirement for it I'll be happy to implement it. It should be easy to extend it to do so, I had that usage in mind when writing it but wanted to avoid feature creep.

I assume this should also be the same as a struct output?
Is there a way to slice a struct column into its components without copying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps outside the scope of this PR. It would not be something to solve in the examples. I can think of many different ways to solve this with the current API.


The example source code loads a csv file and produces a transformed column from the table using the values from the tables.

Four examples are included:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the other ones? I think it would be good to show an example just returning a fixed-width integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back. I was informed it wasn't that important for the release blog. But it's a valuable example anyway

@lamarrr lamarrr requested review from davidwendt and bdice May 19, 2025 12:37
lamarrr and others added 3 commits May 19, 2025 13:48
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Comment on lines +49 to +50
auto alt = cudf::make_column_from_scalar(
cudf::string_scalar(cudf::string_view{"(unknown)", 9}, true, stream), 1, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. How about a comment here about how a single-row column is treated as a constant scalar for all rows? Something worded better than that perhaps.

rmm::device_uvector<char> scratch(maximum_size * num_rows, stream, mr);

auto size =
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream, mr), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream, mr), 1);
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream), 1, stream);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

4 participants