Add support for large string columns to Parquet reader and writer #15632

etseidl · 2024-05-01T18:10:16Z

Description

Part of #13733.

Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses cudf::strings::detail::make_offsets_child_column(). This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-05-01T18:10:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2024-05-01T18:47:34Z

cc @vuule @mhaseeb123 thoughts on how to test?

davidwendt · 2024-05-01T18:50:51Z

/ok to test

vuule · 2024-05-01T19:48:41Z

cc @vuule @mhaseeb123 thoughts on how to test?

We can round trip a single string column with >2B characters. Maybe throw in a smaller string column in the same table just to make sure they can coexist.
Maybe I'm missing something with this simple answer. Do we need to test different encoding types?

etseidl · 2024-05-01T20:23:02Z

We can round trip a single string column with >2B characters. Maybe throw in a smaller string column in the same table just to make sure they can coexist. Maybe I'm missing something with this simple answer. Do we need to test different encoding types?

I was wondering about the setting the env var to turn on large strings support, but I see there are already methods in the test suite to enable and disable large strings. Maybe I'll also change the threshold so we don't have to write 2GB of data per test (and, yes, we should test the strings kernel, delta byte array, and delta length byte array).

cpp/tests/io/parquet_reader_test.cpp

etseidl · 2024-05-02T01:01:58Z

cpp/tests/large_strings/parquet_tests.cpp

+  expected_metadata.column_metadata[2].set_encoding(cudf::io::column_encoding::DELTA_BYTE_ARRAY);
+
+  // set smaller threshold to reduce file size and execution time
+  setenv("LIBCUDF_LARGE_STRINGS_THRESHOLD", std::to_string(threshold).c_str(), 1);


candidate for inclusion in StringsLargeTest?

That is a good idea.
I think I'd want it behave like the CUDF_TEST_ENABLE_LARGE_STRINGS() macro where it automatically unsets the environment variable at the end of the scope. I can do this in a follow on PR so to keep this one more focused.

cpp/tests/large_strings/parquet_tests.cpp

Co-authored-by: David Wendt <[email protected]>

davidwendt · 2024-05-02T14:25:10Z

/ok to test

davidwendt · 2024-05-02T16:23:27Z

cpp/src/io/parquet/writer_impl.cu

+    return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) -
+           cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream);


Is it possible that the input column could have been sliced?
If so, then this would be more correct.

Suggested change

return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) -

cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream);

return cudf::strings::detail::get_offset_value(scol.offsets(), column.size() + column.offset(), stream) -

cudf::strings::detail::get_offset_value(scol.offsets(), column.offset(), stream);

Note that if the column has not been sliced then column.offset()==0.

Yes, IIRC it was originally written that way due to a concern @vuule had about sliced columns.

Doesn't get_offset_value already adjust for the column offset? AFAICT, it's using get_value, which uses data(), which is implemented as head<T>() + _offset.

No. The get_offset_value() takes an offsets column which does not include it's parent's sliced values offset/size.

Ah, thank you for clearing that up.

Co-authored-by: David Wendt <[email protected]>

davidwendt · 2024-05-02T16:50:44Z

cpp/src/io/parquet/page_string_decode.cu

@@ -1076,7 +1076,7 @@ CUDF_KERNEL void __launch_bounds__(decode_block_size)
            __syncwarp();
          } else if (use_char_ll) {
            __shared__ __align__(8) uint8_t const* pointers[warp_size];
-            __shared__ __align__(4) size_type offsets[warp_size];
+            __shared__ __align__(4) size_t offsets[warp_size];


Suggested change

__shared__ __align__(4) size_t offsets[warp_size];

__shared__ __align__(8) size_t offsets[warp_size];

Are these align declarators needed?

tbh, I started using the __align__ as a monkey-see-monkey-do kind of thing 😅. I don't know if they're actually necessary at this point.

I suspect this one is the only one that may have purpose
https://github.com/rapidsai/cudf/pull/15632/files#diff-52e09ddca44181e11af56d8526360207906f5f25ba888cf51efbd2c1b15d775cR957

__shared__ __align__(16) page_state_s state_g;

Only because page_state_s is a structure but it should probably have been declared with alignas(16)

I'll remove them since they're unnecessary.

cpp/src/io/parquet/page_string_decode.cu

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/page_delta_decode.cu

Co-authored-by: Bradley Dice <[email protected]>

bdice

I have no further comments at this time so I'll approve, but I know there are some unresolved questions for which I don't have great answers (e.g. whether we need alignment, large strings testing strategy, etc.). I think this PR will benefit from close eyes from @vuule and/or @davidwendt.

bdice · 2024-05-02T17:40:05Z

/ok to test

mhaseeb123 · 2024-05-02T18:40:19Z

cc @vuule @mhaseeb123 thoughts on how to test?

Sorry for delayed response but I would say we can test this with a table with only large string column(s) and a table with mixed columns round tripped with some encoding and compression. I think we shouldn't test with all/many encodings or compressions unless functionality-critical to keep testing time small. Instead this could be made an example or benchmark.

mhaseeb123

The changes look good to me. Thanks for the effort Ed!

vuule

Amazing work!
Also, the (small) size of this PR shows how @davidwendt 's utilities make a great foundation for large strings support.

vuule · 2024-05-02T21:35:44Z

/ok to test

davidwendt · 2024-05-03T14:44:04Z

/merge

etseidl added 2 commits May 1, 2024 17:26

first cut at reading large string columns

072f7a7

small fix for writer

1e44820

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels May 1, 2024

etseidl and others added 2 commits May 1, 2024 18:38

call sizes_to_offsets directly

6c6c6fd

Merge branch 'branch-24.06' into large_strings

de26d50

davidwendt added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 1, 2024

fix delta decoders as well

b67d9d3

etseidl and others added 3 commits May 1, 2024 21:12

add test

70ee8c4

Merge branch 'branch-24.06' into large_strings

2c45fdf

fix comment

7035112

davidwendt reviewed May 1, 2024

View reviewed changes

cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved

etseidl and others added 3 commits May 1, 2024 23:31

move test to large_strings

aea2691

get rid of leftover include

3b77620

remove outdated comment

32452d4

etseidl commented May 2, 2024

View reviewed changes

davidwendt reviewed May 2, 2024

View reviewed changes

cpp/tests/large_strings/parquet_tests.cpp Outdated Show resolved Hide resolved

davidwendt reviewed May 2, 2024

View reviewed changes

cpp/tests/large_strings/parquet_tests.cpp Outdated Show resolved Hide resolved

etseidl and others added 2 commits May 2, 2024 07:19

Apply suggestions from code review

8748094

Co-authored-by: David Wendt <[email protected]>

a few tweaks

5b0156d

etseidl marked this pull request as ready for review May 2, 2024 16:05

etseidl requested review from a team as code owners May 2, 2024 16:05

etseidl requested review from bdice and davidwendt May 2, 2024 16:05

davidwendt reviewed May 2, 2024

View reviewed changes

etseidl and others added 3 commits May 2, 2024 09:39

fix sliced offset calc as suggested

5587bff

Co-authored-by: David Wendt <[email protected]>

Merge branch 'branch-24.06' into large_strings

a9a2230

formatting

d465fab

davidwendt reviewed May 2, 2024

View reviewed changes

bdice reviewed May 2, 2024

View reviewed changes

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_delta_decode.cu Outdated Show resolved Hide resolved

etseidl and others added 3 commits May 2, 2024 10:03

reword per review suggestion

3a2e64e

Co-authored-by: Bradley Dice <[email protected]>

fix alignment

16b94b8

Co-authored-by: Bradley Dice <[email protected]>

make clearer how offsets are computed in the large strings case

92c062c

bdice approved these changes May 2, 2024

View reviewed changes

mhaseeb123 approved these changes May 2, 2024

View reviewed changes

etseidl and others added 2 commits May 2, 2024 19:01

remove unneeded alignment specifiers

8f8e3ae

Merge branch 'branch-24.06' into large_strings

182ce50

davidwendt approved these changes May 2, 2024

View reviewed changes

vuule approved these changes May 2, 2024

View reviewed changes

Merge branch 'branch-24.06' into large_strings

9d4dcbe

rapids-bot bot merged commit b8503bc into rapidsai:branch-24.06 May 3, 2024
70 checks passed

etseidl deleted the large_strings branch May 3, 2024 15:57

davidwendt mentioned this pull request May 31, 2024

[FEA] Increase maximum characters in strings columns #13733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for large string columns to Parquet reader and writer #15632

Add support for large string columns to Parquet reader and writer #15632

etseidl commented May 1, 2024 •

edited

Loading

copy-pr-bot bot commented May 1, 2024

etseidl commented May 1, 2024 •

edited

Loading

davidwendt commented May 1, 2024

vuule commented May 1, 2024

etseidl commented May 1, 2024

etseidl May 2, 2024

davidwendt May 2, 2024

davidwendt commented May 2, 2024

davidwendt May 2, 2024

etseidl May 2, 2024

vuule May 2, 2024

davidwendt May 2, 2024

etseidl May 2, 2024

davidwendt May 2, 2024

etseidl May 2, 2024

davidwendt May 2, 2024

etseidl May 2, 2024

bdice left a comment •

edited

Loading

bdice commented May 2, 2024

mhaseeb123 commented May 2, 2024

mhaseeb123 left a comment

vuule left a comment

vuule commented May 2, 2024

davidwendt commented May 3, 2024

		return cudf::strings::detail::get_offset_value(scol.offsets(), column.size(), stream) -
		cudf::strings::detail::get_offset_value(scol.offsets(), 0, stream);

	__shared__ __align__(4) size_t offsets[warp_size];
	__shared__ __align__(8) size_t offsets[warp_size];

Add support for large string columns to Parquet reader and writer #15632

Add support for large string columns to Parquet reader and writer #15632

Conversation

etseidl commented May 1, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented May 1, 2024

etseidl commented May 1, 2024 • edited Loading

davidwendt commented May 1, 2024

vuule commented May 1, 2024

etseidl commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidwendt commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

bdice commented May 2, 2024

mhaseeb123 commented May 2, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule commented May 2, 2024

davidwendt commented May 3, 2024

etseidl commented May 1, 2024 •

edited

Loading

etseidl commented May 1, 2024 •

edited

Loading

bdice left a comment •

edited

Loading