-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large strings support in cudf::merge #15374
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes
total_bytes < threshold, "Size of output exceeds the column size limit", std::overflow_error); | ||
} | ||
if (total_bytes >= get_offset64_threshold()) { | ||
// recompute as int64 offsets when above the threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once large strings are the default behavior, I wonder if we should do a first pass in int64 and then cast the offsets down if it fits in int32, rather than recomputing as int64 on overflow. I don't know how expensive the downcast would be compared to recomputing. Of course this also depends on whether we expect large strings to be common. I don't expect any action here, but would be open to your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really expect large strings to be very common. At least not in the near to mid future.
I'd rather force any performance hit into the int64 path right now if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a single optional suggestion.
/merge |
Description
Enable large strings support in
cudf::merge
.Simplifies the strings specialization to use the gather-based strings factory function which is already optimized for long strings and is now appropriately enabled for large strings.
Also moved source from
include/cudf/strings/detail/merge.cuh
tosrc/strings/merge/merge.cu
file since the template implemenation was not actually required.Checklist