-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for large string columns to Parquet reader and writer #15632
Conversation
cc @vuule @mhaseeb123 thoughts on how to test? |
/ok to test |
We can round trip a single string column with >2B characters. Maybe throw in a smaller string column in the same table just to make sure they can coexist. |
I was wondering about the setting the env var to turn on large strings support, but I see there are already methods in the test suite to enable and disable large strings. Maybe I'll also change the threshold so we don't have to write 2GB of data per test (and, yes, we should test the strings kernel, delta byte array, and delta length byte array). |
expected_metadata.column_metadata[2].set_encoding(cudf::io::column_encoding::DELTA_BYTE_ARRAY); | ||
|
||
// set smaller threshold to reduce file size and execution time | ||
setenv("LIBCUDF_LARGE_STRINGS_THRESHOLD", std::to_string(threshold).c_str(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
candidate for inclusion in StringsLargeTest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea.
I think I'd want it behave like the CUDF_TEST_ENABLE_LARGE_STRINGS()
macro where it automatically unsets the environment variable at the end of the scope. I can do this in a follow on PR so to keep this one more focused.
Co-authored-by: David Wendt <[email protected]>
/ok to test |
cpp/src/io/parquet/writer_impl.cu
Outdated
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) - | ||
cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the input column could have been sliced?
If so, then this would be more correct.
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) - | |
cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream); | |
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size() + column.offset(), stream) - | |
cudf::strings::detail::get_offset_value(scol.offsets(), column.offset(), stream); |
Note that if the column has not been sliced then column.offset()==0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IIRC it was originally written that way due to a concern @vuule had about sliced columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't get_offset_value
already adjust for the column offset? AFAICT, it's using get_value
, which uses data()
, which is implemented as head<T>() + _offset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The get_offset_value()
takes an offsets column which does not include it's parent's sliced values offset/size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for clearing that up.
Co-authored-by: David Wendt <[email protected]>
@@ -1076,7 +1076,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size) | |||
__syncwarp(); | |||
} else if (use_char_ll) { | |||
__shared__ __align__(8) uint8_t const* pointers[warp_size]; | |||
__shared__ __align__(4) size_type offsets[warp_size]; | |||
__shared__ __align__(4) size_t offsets[warp_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__shared__ __align__(4) size_t offsets[warp_size]; | |
__shared__ __align__(8) size_t offsets[warp_size]; |
Are these align declarators needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I started using the __align__
as a monkey-see-monkey-do kind of thing 😅. I don't know if they're actually necessary at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this one is the only one that may have purpose
https://github.com/rapidsai/cudf/pull/15632/files#diff-52e09ddca44181e11af56d8526360207906f5f25ba888cf51efbd2c1b15d775cR957
__shared__ __align__(16) page_state_s state_g;
Only because page_state_s
is a structure but it should probably have been declared with alignas(16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove them since they're unnecessary.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no further comments at this time so I'll approve, but I know there are some unresolved questions for which I don't have great answers (e.g. whether we need alignment, large strings testing strategy, etc.). I think this PR will benefit from close eyes from @vuule and/or @davidwendt.
/ok to test |
Sorry for delayed response but I would say we can test this with a table with only large string column(s) and a table with mixed columns round tripped with some encoding and compression. I think we shouldn't test with all/many encodings or compressions unless functionality-critical to keep testing time small. Instead this could be made an example or benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. Thanks for the effort Ed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
Also, the (small) size of this PR shows how @davidwendt 's utilities make a great foundation for large strings support.
/ok to test |
/merge |
Description
Part of #13733.
Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses
cudf::strings::detail::make_offsets_child_column()
. This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget.Checklist