Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for large string columns to Parquet reader and writer #15632

Merged
merged 22 commits into from
May 3, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 1, 2024

Description

Part of #13733.

Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses cudf::strings::detail::make_offsets_child_column(). This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented May 1, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels May 1, 2024
@etseidl
Copy link
Contributor Author

etseidl commented May 1, 2024

cc @vuule @mhaseeb123 thoughts on how to test?

@davidwendt davidwendt added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 1, 2024
@davidwendt
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented May 1, 2024

cc @vuule @mhaseeb123 thoughts on how to test?

We can round trip a single string column with >2B characters. Maybe throw in a smaller string column in the same table just to make sure they can coexist.
Maybe I'm missing something with this simple answer. Do we need to test different encoding types?

@etseidl
Copy link
Contributor Author

etseidl commented May 1, 2024

We can round trip a single string column with >2B characters. Maybe throw in a smaller string column in the same table just to make sure they can coexist. Maybe I'm missing something with this simple answer. Do we need to test different encoding types?

I was wondering about the setting the env var to turn on large strings support, but I see there are already methods in the test suite to enable and disable large strings. Maybe I'll also change the threshold so we don't have to write 2GB of data per test (and, yes, we should test the strings kernel, delta byte array, and delta length byte array).

expected_metadata.column_metadata[2].set_encoding(cudf::io::column_encoding::DELTA_BYTE_ARRAY);

// set smaller threshold to reduce file size and execution time
setenv("LIBCUDF_LARGE_STRINGS_THRESHOLD", std::to_string(threshold).c_str(), 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidate for inclusion in StringsLargeTest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea.
I think I'd want it behave like the CUDF_TEST_ENABLE_LARGE_STRINGS() macro where it automatically unsets the environment variable at the end of the scope. I can do this in a follow on PR so to keep this one more focused.

@davidwendt
Copy link
Contributor

/ok to test

@etseidl etseidl marked this pull request as ready for review May 2, 2024 16:05
@etseidl etseidl requested review from a team as code owners May 2, 2024 16:05
@etseidl etseidl requested review from bdice and davidwendt May 2, 2024 16:05
Comment on lines 282 to 283
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) -
cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the input column could have been sliced?
If so, then this would be more correct.

Suggested change
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) -
cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream);
return cudf::strings::detail::get_offset_value(scol.offsets(), column.size() + column.offset(), stream) -
cudf::strings::detail::get_offset_value(scol.offsets(), column.offset(), stream);

Note that if the column has not been sliced then column.offset()==0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IIRC it was originally written that way due to a concern @vuule had about sliced columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't get_offset_value already adjust for the column offset? AFAICT, it's using get_value, which uses data(), which is implemented as head<T>() + _offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The get_offset_value() takes an offsets column which does not include it's parent's sliced values offset/size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for clearing that up.

@@ -1076,7 +1076,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
__syncwarp();
} else if (use_char_ll) {
__shared__ __align__(8) uint8_t const* pointers[warp_size];
__shared__ __align__(4) size_type offsets[warp_size];
__shared__ __align__(4) size_t offsets[warp_size];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
__shared__ __align__(4) size_t offsets[warp_size];
__shared__ __align__(8) size_t offsets[warp_size];

Are these align declarators needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I started using the __align__ as a monkey-see-monkey-do kind of thing 😅. I don't know if they're actually necessary at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this one is the only one that may have purpose
https://github.com/rapidsai/cudf/pull/15632/files#diff-52e09ddca44181e11af56d8526360207906f5f25ba888cf51efbd2c1b15d775cR957

__shared__ __align__(16) page_state_s state_g;

Only because page_state_s is a structure but it should probably have been declared with alignas(16)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove them since they're unnecessary.

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_delta_decode.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no further comments at this time so I'll approve, but I know there are some unresolved questions for which I don't have great answers (e.g. whether we need alignment, large strings testing strategy, etc.). I think this PR will benefit from close eyes from @vuule and/or @davidwendt.

@bdice
Copy link
Contributor

bdice commented May 2, 2024

/ok to test

@mhaseeb123
Copy link
Member

cc @vuule @mhaseeb123 thoughts on how to test?

Sorry for delayed response but I would say we can test this with a table with only large string column(s) and a table with mixed columns round tripped with some encoding and compression. I think we shouldn't test with all/many encodings or compressions unless functionality-critical to keep testing time small. Instead this could be made an example or benchmark.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Thanks for the effort Ed!

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!
Also, the (small) size of this PR shows how @davidwendt 's utilities make a great foundation for large strings support.

@vuule
Copy link
Contributor

vuule commented May 2, 2024

/ok to test

@davidwendt
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit b8503bc into rapidsai:branch-24.06 May 3, 2024
70 checks passed
@etseidl etseidl deleted the large_strings branch May 3, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants