Sort dictionary data alphabetically in the ORC writer #14295

vuule · 2023-10-17T21:46:24Z

Description

Strings in the dictionary data streams are now sorted alphabetically.
Reduces file size in some cases because compression can be more efficient.

Reduces throughput up to 22% when writing strings columns (3% speedup when dictionary encoding is not used, though!).
Benchmark data does not demonstrate the compression difference, but we have some user data that compresses almost 30% better.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2023-10-17T22:52:33Z

cpp/src/io/orc/writer_impl.cu

+        auto const is_str_dict =
+          ck.type_kind == TypeKind::STRING and ck.encoding_kind == DICTIONARY_V2;
+        ck.dict_index = is_str_dict ? column.host_stripe_dict(stripe.id).index.data() : nullptr;
+        ck.dict_data_order =
+          is_str_dict ? column.host_stripe_dict(stripe.id).data_order.data() : nullptr;
+        ck.dtype_len = (ck.type_kind == TypeKind::STRING) ? 1 : column.type_width();
+        ck.scale     = column.scale();
+        ck.decimal_offsets =
+          (ck.type_kind == TypeKind::DECIMAL) ? column.decimal_offsets() : nullptr;


Some of these were left uninitialized when unused, changed to always initialize.

…bug-sort-orc-dict

…o bug-sort-orc-dict

abellina · 2023-10-23T16:25:20Z

For my internal test, our diff vs the CPU went from 22% to 5%, which is really impressive. Thanks for working on this.

3% speedup when dictionary encoding is not used, though!

Do you expect a 3% slow down to the write because of the sort for dictionary encoded data?

vuule · 2023-10-23T16:50:22Z

Do you expect a 3% slow down to the write because of the sort for dictionary encoded data?

The slowdown is up to 22% unfortunately. Sorting is not cheap :(
FWIW, performance-wise we are still in a much better spot than the starting point - pre-cuco use.

cpp/src/io/orc/writer_impl.cu

…bug-sort-orc-dict

cpp/src/io/orc/writer_impl.cu

…bug-sort-orc-dict

Co-authored-by: David Wendt <[email protected]>

divyegala

Looks good, just one question

divyegala · 2023-10-25T20:38:23Z

cpp/src/io/orc/writer_impl.cu

+  stripe_dicts.host_to_device_async(stream);
+
+  // Sort stripe dictionaries alphabetically
+  auto streams = cudf::detail::fork_streams(stream, std::min<size_t>(dict_order_owner.size(), 8));


Is 8 streams an empirical choice?

it is
I tried powers of two up to 32 and 8 was the fastest one. There wasn't a big difference compared to other 4+ values, though.

abellina · 2023-10-26T18:54:41Z

Before we merge, mind if I run orc benchmarks on our stuff? I should be able to get these back to you tomorrow. @vuule

…bug-sort-orc-dict

…o bug-sort-orc-dict

abellina · 2023-10-27T21:54:45Z

I ran NDS at 3TB and, as expected, it didn't affect that benchmark at all. I compared the results for the tables and I don't see anything out of the ordinary.
I ran a simple transcode scenario (read all the source data, and write it out whole) with customer source data around 10GB and these were my findings:
- With this patch, we wrote 8.8GB instead of 11GB without (20% reduction in size)
- With this patch it took us 115 seconds wall clock vs 106 seconds without to carry out my experiment (7.8% slower)
- Isolating the write GPU time, the GPU spent ~12% more time (or 100ms per task more) with this patch, in order to do the sorting.
- I compared the outputs with this patch and without and found no differences (using except in Spark)

With the above we believe default=on makes sense but we really like having the flag you added @vuule, because it allows us to experiment easily and you never know what pathological cases we may run into.

Thank you!!

vuule · 2023-10-31T18:03:58Z

/merge

… to ORC (#14595) Changes in #14295 introduced a synchronization issue in `build_dictionaries`. After stripe_dicts are initialized on the host, we copy them to the device and then launch kernels that read the dicts (device copy). However, after these kernels we deallocate buffers that are not longer needed and clear the dicts' views to these buffers on the host. The problem is that, without synchronization after the H2D copy, the host modification can be done before the H2D copy is performed, and we run the kernels with the altered state. This PR adds a sync point to make sure the copy is done before host-side modification. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) - Bradley Dice (https://github.com/bdice)

… to ORC (rapidsai#14595) Changes in rapidsai#14295 introduced a synchronization issue in `build_dictionaries`. After stripe_dicts are initialized on the host, we copy them to the device and then launch kernels that read the dicts (device copy). However, after these kernels we deallocate buffers that are not longer needed and clear the dicts' views to these buffers on the host. The problem is that, without synchronization after the H2D copy, the host modification can be done before the H2D copy is performed, and we run the kernels with the altered state. This PR adds a sync point to make sure the copy is done before host-side modification. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) - Bradley Dice (https://github.com/bdice)

sort dictionary data alphabetically

83ae4d1

vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 17, 2023

vuule self-assigned this Oct 17, 2023

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 17, 2023

vuule commented Oct 17, 2023

View reviewed changes

vuule added 6 commits October 18, 2023 11:11

Merge branch 'branch-23.12' into bug-sort-orc-dict

3000000

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

35cea92

…bug-sort-orc-dict

qualify name

dd85893

Merge branch 'bug-sort-orc-dict' of https://github.com/vuule/cudf int…

6fd5901

…o bug-sort-orc-dict

Merge branch 'branch-23.12' into bug-sort-orc-dict

7bad218

Merge branch 'branch-23.12' into bug-sort-orc-dict

0c50417

vuule marked this pull request as ready for review October 23, 2023 16:46

vuule requested a review from a team as a code owner October 23, 2023 16:46

vuule requested review from divyegala and davidwendt October 23, 2023 16:46

davidwendt reviewed Oct 23, 2023

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

vuule added 2 commits October 23, 2023 10:17

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

2582fb7

…bug-sort-orc-dict

don't use make_counting_iterator

aec0821

vuule requested a review from davidwendt October 25, 2023 16:48

davidwendt reviewed Oct 25, 2023

View reviewed changes

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved

vuule and others added 3 commits October 25, 2023 11:56

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

be9f3d3

…bug-sort-orc-dict

<

4fe94e9

Co-authored-by: David Wendt <[email protected]>

Merge branch 'branch-23.12' into bug-sort-orc-dict

1894a88

davidwendt approved these changes Oct 25, 2023

View reviewed changes

divyegala approved these changes Oct 25, 2023

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 25, 2023

Merge branch 'branch-23.12' into bug-sort-orc-dict

8b9abe8

vuule added DO NOT MERGE Hold off on merging; see PR for details and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Oct 26, 2023

vuule added 4 commits October 27, 2023 10:44

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

0e21516

…bug-sort-orc-dict

API + test

4faca02

Merge branch 'bug-sort-orc-dict' of https://github.com/vuule/cudf int…

3e60991

…o bug-sort-orc-dict

Merge branch 'branch-23.12' into bug-sort-orc-dict

8056e4f

abellina approved these changes Oct 27, 2023

View reviewed changes

vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed DO NOT MERGE Hold off on merging; see PR for details labels Oct 31, 2023

rapids-bot bot merged commit cb06c20 into rapidsai:branch-23.12 Oct 31, 2023

vuule deleted the bug-sort-orc-dict branch October 31, 2023 18:04

vuule mentioned this pull request Dec 7, 2023

Fix synchronization issue when writing string columns with dictionary to ORC #14595

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort dictionary data alphabetically in the ORC writer #14295

Sort dictionary data alphabetically in the ORC writer #14295

vuule commented Oct 17, 2023 •

edited

Loading

vuule Oct 17, 2023

abellina commented Oct 23, 2023 •

edited

Loading

vuule commented Oct 23, 2023

divyegala left a comment

divyegala Oct 25, 2023

vuule Oct 25, 2023

abellina commented Oct 26, 2023

abellina commented Oct 27, 2023 •

edited

Loading

vuule commented Oct 31, 2023

Sort dictionary data alphabetically in the ORC writer #14295

Sort dictionary data alphabetically in the ORC writer #14295

Conversation

vuule commented Oct 17, 2023 • edited Loading

Description

Checklist

vuule Oct 17, 2023

Choose a reason for hiding this comment

abellina commented Oct 23, 2023 • edited Loading

vuule commented Oct 23, 2023

divyegala left a comment

Choose a reason for hiding this comment

divyegala Oct 25, 2023

Choose a reason for hiding this comment

vuule Oct 25, 2023

Choose a reason for hiding this comment

abellina commented Oct 26, 2023

abellina commented Oct 27, 2023 • edited Loading

vuule commented Oct 31, 2023

vuule commented Oct 17, 2023 •

edited

Loading

abellina commented Oct 23, 2023 •

edited

Loading

abellina commented Oct 27, 2023 •

edited

Loading