Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing int8 column option from parquet byte_array writing #11539

Conversation

hyperbolic2346
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 commented Aug 16, 2022

Description

As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays. This matches with cudf string columns which contain an INT8 column for data.

closes #11536

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 added 3 - Ready for Review Ready for review by team code quality libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 16, 2022
@hyperbolic2346 hyperbolic2346 self-assigned this Aug 16, 2022
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner August 16, 2022 00:44
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/src/lists/dremel.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 16, 2022

Codecov Report

Base: 87.40% // Head: 88.37% // Increases project coverage by +0.96% 🎉

Coverage data is based on head (86942b2) compared to base (f72c4ce).
Patch coverage: 86.52% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11539      +/-   ##
================================================
+ Coverage         87.40%   88.37%   +0.96%     
================================================
  Files               133      133              
  Lines             21833    22508     +675     
================================================
+ Hits              19084    19892     +808     
+ Misses             2749     2616     -133     
Impacted Files Coverage Δ
python/cudf/cudf/core/udf/__init__.py 97.05% <ø> (+47.05%) ⬆️
python/cudf/cudf/io/orc.py 92.94% <ø> (-0.09%) ⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <ø> (+4.94%) ⬆️
python/cudf/cudf/core/_base_index.py 82.20% <43.75%> (-3.35%) ⬇️
python/cudf/cudf/io/text.py 91.66% <66.66%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 86.27% <76.00%> (-10.61%) ⬇️
python/cudf/cudf/core/index.py 93.18% <95.16%> (+0.55%) ⬆️
python/cudf/cudf/__init__.py 90.69% <100.00%> (ø)
python/cudf/cudf/core/column/categorical.py 89.34% <100.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 91.51% <100.00%> (+0.29%) ⬆️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the aforementioned concerns about values outside int8's range.

@davidwendt
Copy link
Contributor

Based on the description, I would've expected this would strictly used UINT8 instead of INT8 since the sign bit has no meaning in plain binary data like this. Also, it would probably help with the boundary issues mentioned in the other review comments as well.
I think the consistency with strings argument is overly emphasized and should really not be a consideration in this implementation.

@hyperbolic2346
Copy link
Contributor Author

I think the consistency with strings argument is overly emphasized and should really not be a consideration in this implementation.

I would typically use unsigned int values for bytes as well and I would agree with you about the string argument if it weren't for the strong coupling between the two that exists. When reading binary, we read it as a string always and can decompose the column and built it again as a list of bytes if desired. The column type that we get from pulling apart a string column is an int8 column, for reasons unknown to me. This seemed to imply a precedence for int8 byte representations inside cudf and also made the easy path for conversion to be list. As a result, reading binary data from a parquet file returns a list type column right now.

That said, the easy path doesn't mean that it is the best path by a long shot. If we have consensus that uint8 should be the type we use and expect, I am ok with that, but it will require more than a few changes. It is certainly more natural to fit into the parquet statistics.

@etseidl
Copy link
Contributor

etseidl commented Aug 16, 2022

The column type that we get from pulling apart a string column is an int8 column, for reasons unknown to me.

Beyond scope for this PR, but I'd be interested in a discussion about this. Since strings in cuDF are UTF-8, not c-style char arrays, why not use UINT8 as the child column type for both binary and string data? Is it for ease of binding to python strings? string_view has to cast to unsigned char * to compare two strings already. Then again, casts are cheap and refactoring expenssive, so 🤷

@hyperbolic2346
Copy link
Contributor Author

why not use UINT8 as the child column type for both binary and string data

From what I can gather, when strings were implemented cudf didn't have a uint8 type. I agree that unsigned is a better option, but I wonder the impact that would have externally by users. I think the best options is to create an issue to discuss changing the behavior.

@hyperbolic2346 hyperbolic2346 requested a review from bdice August 17, 2022 18:39
@davidwendt
Copy link
Contributor

I don't think consistency with a different type (i.e. string_view) is a compelling reason to use INT8 here. The string_view has its own requirements and dependencies (and things that depend on it). The design and implementation of string_view can be discussed outside the scope of this PR. Regardless, I see no reason to prevent a byte-array to use UINT8.

@hyperbolic2346 hyperbolic2346 changed the title Removing uint8 column option from parquet byte_array writing Removing int8 column option from parquet byte_array writing Aug 23, 2022
@hyperbolic2346
Copy link
Contributor Author

Updated this to change the column type to uint8 instead of int8. Now binary writes only accept uint8 columns and binary reads come back as uint8 columns.

@rwlee
Copy link
Contributor

rwlee commented Oct 12, 2022

Also resolves NVIDIA/spark-rapids#6408 with the added changes that standardize casting to list of bytes. There should be no additional changes needed on the plugin side based on my testing + integration tests.

@rwlee rwlee requested review from etseidl and mythrocks and removed request for etseidl October 12, 2022 20:53
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one question. Happy to defer on that if you don't want to make those changes in this PR.

data->size(),
std::move(*data),
std::move(*null_mask),
UNKNOWN_NULL_COUNT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fetch the null count before releasing the data, and pass it here? We should try to retain any information that may have been previously computed.

Generally, I think we're going to move away from lazily computing null counts as we work on adding streams to the public API. There are some hiccups with the lazy approach and streams, but primarily we aren't being as efficient as possible wrt avoiding calls to the null_count kernel because we aren't tracking information that we already may have computed once earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of moving away form lazily computed null counts and am happy to change here if we want to start now. In this case, it made sense for me to defer the null counting because byte cast is often use as an intermediate type for reading, writing, and hashing. Those are usually indifferent to null count and it made sense to keep it lazily computed while we still supported it.

@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. gpuCI labels Oct 15, 2022
@rwlee rwlee added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 15, 2022
@rwlee
Copy link
Contributor

rwlee commented Oct 15, 2022

Adding do not merge because of some upmerge issues. Upmerged to branch-22.12 in order to resolve some python errors and tested locally successfully... Continuing to debug but i don't want this merged accidently.

@rwlee rwlee mentioned this pull request Oct 17, 2022
3 tasks
@rwlee rwlee changed the base branch from branch-22.12 to pull-request/11936 October 17, 2022 20:08
@rwlee rwlee changed the base branch from pull-request/11936 to branch-22.12 October 17, 2022 20:09
@rwlee rwlee removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 17, 2022
@bdice
Copy link
Contributor

bdice commented Oct 17, 2022

rerun tests

@bdice
Copy link
Contributor

bdice commented Oct 18, 2022

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
9 participants