Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding byte array view structure #11322

Merged

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Jul 21, 2022

Description

I wanted to get this up in a PR of its own so we could get some discussion going if necessary. This adds a byte_array_view which is almost identical to a string_view. The goal is to be able to get these for list columns of bytes, so list<uint8> and list<int8>. I didn't template it on the type, but instead selected uint8_t because std::byte is a uint8_t. My PR for writing byte arrays in parquet will use this to get the rows of byte data for statistics and writing. That PR is forthcoming. I left this code down in cuio statistics due to the usage and the previous discussions regarding .element. I needed to wrap the device_span because I need comparison operators for the cub reduce.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner July 21, 2022 17:52
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 21, 2022
@hyperbolic2346 hyperbolic2346 self-assigned this Jul 21, 2022
@hyperbolic2346 hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jul 21, 2022
@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@96f747b). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11322   +/-   ##
===============================================
  Coverage                ?   86.42%           
===============================================
  Files                   ?      143           
  Lines                   ?    22777           
  Branches                ?        0           
===============================================
  Hits                    ?    19686           
  Misses                  ?     3091           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96f747b...88f6f38. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Jul 21, 2022

The implementation generally looks fine, but I am curious about the scope of the utilization of this class. Where all do we expect it to be used? Will there be code paths that always assume List<int8> or List<uint8> are going to be byte streams? How will we, in general, indicate whether a List<int8> or List<uint8> should be treated this way if there are code paths that require handling both?

@hyperbolic2346
Copy link
Contributor Author

Will there be code paths that always assume List<int8> or List<uint8> are going to be byte streams?

No, it will be decided when writing a parquet column on a per-column basis.

How will we, in general, indicate whether a List<int8> or List<uint8> should be treated this way if there are code paths that require handling both?

This is all for parquet writing and not anything integral to cudf. In cudf it will always just be columns. I need this abstraction to help match up with string_view somewhat to simplify the implementation. When writing parquet, the writer takes a bool per column to indicate if a list of bytes or a string is represented as a byte array or a list. It is more efficient to store the list of bytes as a byte array and currently strings are stored as such. Reading it back in will also require a bool per column to indicate if reading a byte array containing a string comes in as a list or string column.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the idea. My only reservation with proceeding with this PR is whether we should make this a part of hierarchy, given that there's a lot of overlap between this and classes like device_span and string_view.

* statistics. Otherwise, it is a device_span in all but name.
*
*/
class byte_array_view {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should derive from device_span<uint8_t const> to inherit the basic members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, feels like there could be a base template class for string_view and byte_array_view, data access and comparison semantics seem to be the same.
Maybe these can be future enhancements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this being derived and the pros/cons above. I debated that, but didn't want to give immediate access to all the members in span. I'm not against it and would encourage the debate and conversions that result, but would suggest a followup for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that conversation was here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this can be changed later. Mostly wanted to share the ideas (was not aware of the previous convo).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the comparison operators and it seemed nice, but it isn't strictly required.

Copy link
Contributor

@mythrocks mythrocks Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but didn't want to give immediate access to all the members in span.

Would private inheritance work, with some using declarations to expose the methods we'd like to?
(I'm on the fence, though. I'm preconditioned to prefer composition over inheritance.)
Edit: I see @vuule's comment on the previous PR as well. Valid point.

Copy link
Contributor

@jrhemstad jrhemstad Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the comparison operators and it seemed nice, but it isn't strictly required.

You can always pass a custom comparison operator to an algorithm.

You should think very carefully about adding a whole new type vs just using a span directly.

When you use a common vocabulary type like a span, you know exactly what you are getting and what the behavior/contract of that type is. A new type incurs a bunch of cognitive overhead to learn all of its semantics.

Copy link
Contributor Author

@hyperbolic2346 hyperbolic2346 Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly does, but my concern from using a span was that I wanted to abstract away the internals of it and not require the user to know it is a span of uint8_t, but they do need to know that for the data pointer, etc so this could be a moot point. I also liked composition for not blanket allowing all span accessors, but look over it again, I don't see anything that would be an example of something I wouldn't want to be available. I'm not against doing this, but I need to get moving on it as this is the foundation of #11303, #11160, and #11328, which are all 22.08 PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mythrocks

Would private inheritance work, with some using declarations to expose the methods we'd like to?

This is nice. Less verbose, chances of logical mistakes will be less compared to composition. Since using declaration inherits all overloads, it needs some thought. when overloads of span changes or added, so does derived class overloads.

I hope, it's not an anti-pattern (@codereport)

cpp/src/io/statistics/byte_array_view.cuh Show resolved Hide resolved
cpp/src/io/statistics/byte_array_view.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/byte_array_view.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/byte_array_view.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, based on the conversations and changes in #11303.

@hyperbolic2346
Copy link
Contributor Author

It has been decided to be better to close this PR and pursue the option of using byte_array_view = device_span<uint8_t> and passing comparison operators to the cub reductions. Investigating that before closing.

@hyperbolic2346
Copy link
Contributor Author

I have a branch with the requested changes, but it is looking gnarly to me. I have it all compiling, but there are orc statistics errors that I haven't tracked down yet. I wanted to get opinions before I chase down this bug and merge this down.

@vuule
Copy link
Contributor

vuule commented Jul 27, 2022

I have a branch with the requested changes, but it is looking gnarly to me. I have it all compiling, but there are orc statistics errors that I haven't tracked down yet. I wanted to get opinions before I chase down this bug and merge this down.

Not saying no to this version, but it seems like a step back to me. The comparators are distributed across the code instead of being in a single class to make a useful abstraction. I still feel like the best option would be to derive from span and add the comparators (or stick with the original implementation). The original class was in cudf::io namespace under statistics, so I don't see it as a string_view equivalent that we should be wary of adding.

@hyperbolic2346 hyperbolic2346 requested a review from vuule July 27, 2022 17:40
@nvdbaranec
Copy link
Contributor

nvdbaranec commented Jul 27, 2022

Not saying no to this version, but it seems like a step back to me. The comparators are distributed across the code instead of being in a single class to make a useful abstraction. I still feel like the best option would be to derive from span and add the comparators (or stick with the original implementation). The original class was in cudf::io namespace under statistics, so I don't see it as a string_view equivalent that we should be wary of adding.

I agree. A centralized place for this is considerably cleaner.

This is a new class with some overlap with existing types so there's weight to the cognitive load argument. But, it's constrained to cuIO so this isn't something that's going to be widely in use, and there's precedence for this kind of thing (eg. hostdevice_vector). cuIO just has some custom needs sometimes.

I also agree with the idea of potentially making this some kind of derived class off of device_span, but that's probably a followup PR.

@GregoryKimball
Copy link
Contributor

I'm good with merging as long as we open an issue to work through the redesign ideas.

@hyperbolic2346
Copy link
Contributor Author

rerun tests

@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 92f1a0f into rapidsai:branch-22.08 Jul 28, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/byte_array_view branch July 28, 2022 01:43
@hyperbolic2346
Copy link
Contributor Author

I'm good with merging as long as we open an issue to work through the redesign ideas.

#11408

rapids-bot bot pushed a commit that referenced this pull request Aug 3, 2022
When reviewing PR #11322 it was noted that it would be preferable to use `std::byte` for the data type, but at the time that didn't work out, so the plan was to address it later and issue #11362 was created to track it.

Fixes #11362

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants