-
Notifications
You must be signed in to change notification settings - Fork 947
String Transform Examples #18616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
String Transform Examples #18616
Conversation
…rrr/cudf into column-device-view-refactor
…into string-jit-examples
auto push = [&](cudf::string_view str) { | ||
auto const size = str.size_bytes(); | ||
|
||
if ((it + size) > end) { return; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to set an "error" flag for this row as a second output, so we could know if the scratch space was insufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not suggest error handling. It would require an extra column to hold the errors and I think would get in the way of what we are trying to show here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @davidwendt, this example assumes the buffer is sufficient for the entire operation. Which I think is ideal, especially for database/parquet structured operations where we know the upper-bound of memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn’t clear in my question. I agree it is not suitable for this example from a teaching standpoint. I was trying to ask if it is possible, like if we have support for multiple column outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, It is not possible currently. But if there's a consensus or user requirement for it I'll be happy to implement it. It should be easy to extend it to do so, I had that usage in mind when writing it but wanted to avoid feature creep.
I assume this should also be the same as a struct output?
Is there a way to slice a struct column into its components without copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps outside the scope of this PR. It would not be something to solve in the examples. I can think of many different ways to solve this with the current API.
|
||
The example source code loads a csv file and produces a transformed column from the table using the values from the tables. | ||
|
||
Four examples are included: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the other ones? I think it would be good to show an example just returning a fixed-width integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back. I was informed it wasn't that important for the release blog. But it's a valuable example anyway
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
auto alt = cudf::make_column_from_scalar( | ||
cudf::string_scalar(cudf::string_view{"(unknown)", 9}, true, stream), 1, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. How about a comment here about how a single-row column is treated as a constant scalar for all rows? Something worded better than that perhaps.
rmm::device_uvector<char> scratch(maximum_size * num_rows, stream, mr); | ||
|
||
auto size = | ||
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream, mr), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream, mr), 1); | |
cudf::make_column_from_scalar(cudf::numeric_scalar<int32_t>(maximum_size, true, stream), 1, stream); |
Description
Depends on: #18490
Follows up on #18023
Checklist