Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort dictionary data alphabetically in the ORC writer #14295

Merged
merged 17 commits into from
Oct 31, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 17, 2023

Description

Strings in the dictionary data streams are now sorted alphabetically.
Reduces file size in some cases because compression can be more efficient.

Reduces throughput up to 22% when writing strings columns (3% speedup when dictionary encoding is not used, though!).
Benchmark data does not demonstrate the compression difference, but we have some user data that compresses almost 30% better.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 17, 2023
@vuule vuule self-assigned this Oct 17, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 17, 2023
Comment on lines +873 to +881
auto const is_str_dict =
ck.type_kind == TypeKind::STRING and ck.encoding_kind == DICTIONARY_V2;
ck.dict_index = is_str_dict ? column.host_stripe_dict(stripe.id).index.data() : nullptr;
ck.dict_data_order =
is_str_dict ? column.host_stripe_dict(stripe.id).data_order.data() : nullptr;
ck.dtype_len = (ck.type_kind == TypeKind::STRING) ? 1 : column.type_width();
ck.scale = column.scale();
ck.decimal_offsets =
(ck.type_kind == TypeKind::DECIMAL) ? column.decimal_offsets() : nullptr;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these were left uninitialized when unused, changed to always initialize.

@abellina
Copy link
Contributor

abellina commented Oct 23, 2023

For my internal test, our diff vs the CPU went from 22% to 5%, which is really impressive. Thanks for working on this.

3% speedup when dictionary encoding is not used, though!

Do you expect a 3% slow down to the write because of the sort for dictionary encoded data?

@vuule vuule marked this pull request as ready for review October 23, 2023 16:46
@vuule vuule requested a review from a team as a code owner October 23, 2023 16:46
@vuule vuule requested review from divyegala and davidwendt October 23, 2023 16:46
@vuule
Copy link
Contributor Author

vuule commented Oct 23, 2023

Do you expect a 3% slow down to the write because of the sort for dictionary encoded data?

The slowdown is up to 22% unfortunately. Sorting is not cheap :(
FWIW, performance-wise we are still in a much better spot than the starting point - pre-cuco use.

@vuule vuule requested a review from davidwendt October 25, 2023 16:48
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one question

stripe_dicts.host_to_device_async(stream);

// Sort stripe dictionaries alphabetically
auto streams = cudf::detail::fork_streams(stream, std::min<size_t>(dict_order_owner.size(), 8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 8 streams an empirical choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is
I tried powers of two up to 32 and 8 was the fastest one. There wasn't a big difference compared to other 4+ values, though.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Oct 25, 2023
@abellina
Copy link
Contributor

Before we merge, mind if I run orc benchmarks on our stuff? I should be able to get these back to you tomorrow. @vuule

@vuule vuule added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Oct 26, 2023
@abellina
Copy link
Contributor

abellina commented Oct 27, 2023

  1. I ran NDS at 3TB and, as expected, it didn't affect that benchmark at all. I compared the results for the tables and I don't see anything out of the ordinary.
  2. I ran a simple transcode scenario (read all the source data, and write it out whole) with customer source data around 10GB and these were my findings:
    • With this patch, we wrote 8.8GB instead of 11GB without (20% reduction in size)
    • With this patch it took us 115 seconds wall clock vs 106 seconds without to carry out my experiment (7.8% slower)
    • Isolating the write GPU time, the GPU spent ~12% more time (or 100ms per task more) with this patch, in order to do the sorting.
    • I compared the outputs with this patch and without and found no differences (using except in Spark)

With the above we believe default=on makes sense but we really like having the flag you added @vuule, because it allows us to experiment easily and you never know what pathological cases we may run into.

Thank you!!

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - DO NOT MERGE Hold off on merging; see PR for details labels Oct 31, 2023
@vuule
Copy link
Contributor Author

vuule commented Oct 31, 2023

/merge

@rapids-bot rapids-bot bot merged commit cb06c20 into rapidsai:branch-23.12 Oct 31, 2023
54 checks passed
@vuule vuule deleted the bug-sort-orc-dict branch October 31, 2023 18:04
raydouglass pushed a commit that referenced this pull request Dec 8, 2023
… to ORC (#14595)

Changes in #14295 introduced a synchronization issue in `build_dictionaries`. After stripe_dicts are initialized on the host, we copy them to the device and then launch kernels that read the dicts (device copy). However, after these kernels we deallocate buffers that are not longer needed and clear the dicts' views to these buffers on the host. The problem is that, without synchronization after the H2D copy, the host modification can be done before the H2D copy is performed, and we run the kernels with the altered state.
This PR adds a sync point to make sure the copy is done before host-side modification.

Authors:
   - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
   - Nghia Truong (https://github.com/ttnghia)
   - Alessandro Bellina (https://github.com/abellina)
   - Bradley Dice (https://github.com/bdice)
karthikeyann pushed a commit to karthikeyann/cudf that referenced this pull request Dec 12, 2023
… to ORC (rapidsai#14595)

Changes in rapidsai#14295 introduced a synchronization issue in `build_dictionaries`. After stripe_dicts are initialized on the host, we copy them to the device and then launch kernels that read the dicts (device copy). However, after these kernels we deallocate buffers that are not longer needed and clear the dicts' views to these buffers on the host. The problem is that, without synchronization after the H2D copy, the host modification can be done before the H2D copy is performed, and we run the kernels with the altered state.
This PR adds a sync point to make sure the copy is done before host-side modification.

Authors:
   - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
   - Nghia Truong (https://github.com/ttnghia)
   - Alessandro Bellina (https://github.com/abellina)
   - Bradley Dice (https://github.com/bdice)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants