Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial clean up of ORC writer #7324

Merged
merged 62 commits into from
Mar 4, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 5, 2021

Issue #6763

Clean up of the code surrounding the column data encode in the ORC writer:

  1. Add a 2D version of hostdevice_vector (single allocation);
  2. Add 2D versions of host_span and device_span;
  3. Add implicit conversions from hostdevice_vector to host_span and device_span.
  4. Use the new types to represent collections that currently use flattened hostdevice_vectors;
  5. Separated a part of EncChunk into a separate class, encoder_chunk_streams, as this is the only part used after data encode;
  6. Add orc_streams to represent per-column streams and compute offsets.
  7. Partial writer_impl.cu code "modernization".
  8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info).
  9. use device_uvector instead of device_vector.

@vuule vuule added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 5, 2021
@vuule vuule self-assigned this Feb 5, 2021
@codecov
Copy link

codecov bot commented Feb 5, 2021

Codecov Report

Merging #7324 (a490ff1) into branch-0.19 (43b44e1) will increase coverage by 0.05%.
The diff coverage is 89.94%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7324      +/-   ##
===============================================
+ Coverage        81.80%   81.86%   +0.05%     
===============================================
  Files              101      101              
  Lines            16695    16884     +189     
===============================================
+ Hits             13658    13822     +164     
- Misses            3037     3062      +25     
Impacted Files Coverage Δ
python/cudf/cudf/utils/cudautils.py 50.38% <ø> (+5.56%) ⬆️
python/cudf/cudf/utils/docutils.py 97.36% <ø> (-2.64%) ⬇️
python/cudf/cudf/utils/dtypes.py 89.51% <ø> (+0.24%) ⬆️
python/cudf/cudf/utils/utils.py 85.43% <ø> (+0.23%) ⬆️
python/dask_cudf/dask_cudf/core.py 71.67% <ø> (-2.60%) ⬇️
python/cudf/cudf/testing/testing.py 80.00% <57.14%> (-1.04%) ⬇️
python/cudf/cudf/core/column/timedelta.py 88.23% <75.00%> (ø)
python/cudf/cudf/core/column/categorical.py 91.38% <77.14%> (-0.85%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.09% <78.57%> (+0.04%) ⬆️
python/cudf/cudf/core/multiindex.py 82.28% <90.90%> (+0.12%) ⬆️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5d0ec9...a490ff1. Read the comment docs.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 11, 2021
@vuule vuule requested a review from cwharris February 25, 2021 18:40
@vuule
Copy link
Contributor Author

vuule commented Feb 25, 2021

Added @cwharris in case he wants to review the span related changes.

cpp/src/io/orc/writer_impl.hpp Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
Comment on lines +378 to +379
std::accumulate(stripe_bounds.front().cbegin(),
stripe_bounds.back().cend(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look odd, maybe I should rename these APIs.

add missing newline
@vuule vuule requested a review from rgsl888prabhu March 2, 2021 19:40
Copy link
Contributor

@kaatish kaatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@vuule vuule mentioned this pull request Mar 3, 2021

operator cudf::detail::host_span<T>() { return {h_data, max_elements}; }
operator cudf::detail::host_span<T const>() const { return {h_data, max_elements}; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also make sense to store the device side data of hostdevice_vector in a rmm::device_uvector instead of a device_buffer. This will probably re-use the device_span constructor

  template <typename C, std::enable_if_t<is_device_span_supported_container<C>::value>* = nullptr>
  constexpr device_span(C& in) : base(thrust::raw_pointer_cast(in.data()), in.size())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, we would still need the conversion operators to device_span here, but we wouldn't have to pass the size separately. I do like this change in general, we should look into polishing hostdevice_vector at some point.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 3, 2021
@vuule
Copy link
Contributor Author

vuule commented Mar 4, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d619f77 into rapidsai:branch-0.19 Mar 4, 2021
@vuule vuule deleted the refactor-orc-writer-host branch March 4, 2021 16:42
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
Issue rapidsai#6763

Clean up of the code surrounding the column data encode in the ORC writer:
1. Add a 2D version of `hostdevice_vector` (single allocation);
2. Add 2D versions of `host_span` and `device_span`;
3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`.
4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`;
5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode;
6. Add `orc_streams` to represent per-column streams and compute offsets.
7. Partial `writer_impl.cu` code "modernization".
8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info).
9. use `device_uvector` instead of `device_vector`.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Kumar Aatish (@kaatish)
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

URL: rapidsai#7324
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants