Removing int8 column option from parquet byte_array writing #11539

hyperbolic2346 · 2022-08-16T00:44:31Z

Description

As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays. ~~This matches with cudf string columns which contain an INT8 column for data.~~

closes #11536

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/tests/io/parquet_test.cpp

cpp/src/lists/dremel.cu

codecov · 2022-08-16T02:42:59Z

Codecov Report

Base: 87.40% // Head: 88.37% // Increases project coverage by +0.96% 🎉

Coverage data is based on head (86942b2) compared to base (f72c4ce).
Patch coverage: 86.52% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11539      +/-   ##
================================================
+ Coverage         87.40%   88.37%   +0.96%     
================================================
  Files               133      133              
  Lines             21833    22508     +675     
================================================
+ Hits              19084    19892     +808     
+ Misses             2749     2616     -133

Impacted Files	Coverage Δ
python/cudf/cudf/core/udf/__init__.py	`97.05% <ø> (+47.05%)`	⬆️
python/cudf/cudf/io/orc.py	`92.94% <ø> (-0.09%)`	⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py	`18.86% <ø> (+4.94%)`	⬆️
python/cudf/cudf/core/_base_index.py	`82.20% <43.75%> (-3.35%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <66.66%> (-8.34%)`	⬇️
python/strings_udf/strings_udf/__init__.py	`86.27% <76.00%> (-10.61%)`	⬇️
python/cudf/cudf/core/index.py	`93.18% <95.16%> (+0.55%)`	⬆️
python/cudf/cudf/__init__.py	`90.69% <100.00%> (ø)`
python/cudf/cudf/core/column/categorical.py	`89.34% <100.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`91.51% <100.00%> (+0.29%)`	⬆️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

upsj

LGTM except for the aforementioned concerns about values outside int8's range.

davidwendt · 2022-08-16T11:41:14Z

Based on the description, I would've expected this would strictly used UINT8 instead of INT8 since the sign bit has no meaning in plain binary data like this. Also, it would probably help with the boundary issues mentioned in the other review comments as well.
I think the consistency with strings argument is overly emphasized and should really not be a consideration in this implementation.

Co-authored-by: Bradley Dice <[email protected]>

hyperbolic2346 · 2022-08-16T17:19:39Z

I think the consistency with strings argument is overly emphasized and should really not be a consideration in this implementation.

I would typically use unsigned int values for bytes as well and I would agree with you about the string argument if it weren't for the strong coupling between the two that exists. When reading binary, we read it as a string always and can decompose the column and built it again as a list of bytes if desired. The column type that we get from pulling apart a string column is an int8 column, for reasons unknown to me. This seemed to imply a precedence for int8 byte representations inside cudf and also made the easy path for conversion to be list. As a result, reading binary data from a parquet file returns a list type column right now.

That said, the easy path doesn't mean that it is the best path by a long shot. If we have consensus that uint8 should be the type we use and expect, I am ok with that, but it will require more than a few changes. It is certainly more natural to fit into the parquet statistics.

… value

etseidl · 2022-08-16T18:08:51Z

The column type that we get from pulling apart a string column is an int8 column, for reasons unknown to me.

Beyond scope for this PR, but I'd be interested in a discussion about this. Since strings in cuDF are UTF-8, not c-style char arrays, why not use UINT8 as the child column type for both binary and string data? Is it for ease of binding to python strings? string_view has to cast to unsigned char * to compare two strings already. Then again, casts are cheap and refactoring expenssive, so 🤷

hyperbolic2346 · 2022-08-17T18:38:53Z

why not use UINT8 as the child column type for both binary and string data

From what I can gather, when strings were implemented cudf didn't have a uint8 type. I agree that unsigned is a better option, but I wonder the impact that would have externally by users. I think the best options is to create an issue to discuss changing the behavior.

davidwendt · 2022-08-18T12:18:21Z

I don't think consistency with a different type (i.e. string_view) is a compelling reason to use INT8 here. The string_view has its own requirements and dependencies (and things that depend on it). The design and implementation of string_view can be discussed outside the scope of this PR. Regardless, I see no reason to prevent a byte-array to use UINT8.

cpp/tests/io/parquet_test.cpp

…quet_binary_uint8_removal

hyperbolic2346 · 2022-08-24T00:39:43Z

Updated this to change the column type to uint8 instead of int8. Now binary writes only accept uint8 columns and binary reads come back as uint8 columns.

rwlee · 2022-10-12T20:50:45Z

Also resolves NVIDIA/spark-rapids#6408 with the added changes that standardize casting to list of bytes. There should be no additional changes needed on the plugin side based on my testing + integration tests.

cpp/src/reshape/byte_cast.cu

bdice

Approving with one question. Happy to defer on that if you don't want to make those changes in this PR.

bdice · 2022-10-14T20:40:10Z

cpp/src/reshape/byte_cast.cu

+                                            data->size(),
+                                            std::move(*data),
+                                            std::move(*null_mask),
+                                            UNKNOWN_NULL_COUNT);


Should we fetch the null count before releasing the data, and pass it here? We should try to retain any information that may have been previously computed.

Generally, I think we're going to move away from lazily computing null counts as we work on adding streams to the public API. There are some hiccups with the lazy approach and streams, but primarily we aren't being as efficient as possible wrt avoiding calls to the null_count kernel because we aren't tracking information that we already may have computed once earlier.

I like the idea of moving away form lazily computed null counts and am happy to change here if we want to start now. In this case, it made sense for me to defer the null counting because byte cast is often use as an intermediate type for reading, writing, and hashing. Those are usually indifferent to null count and it made sense to keep it lazily computed while we still supported it.

… into branch-22.12

rwlee · 2022-10-15T03:52:36Z

Adding do not merge because of some upmerge issues. Upmerged to branch-22.12 in order to resolve some python errors and tested locally successfully... Continuing to debug but i don't want this merged accidently.

bdice · 2022-10-17T22:26:19Z

rerun tests

bdice · 2022-10-18T17:09:16Z

@gpucibot merge

…apidsai#11539)" This reverts commit 1effe19.

removing uint8 column option from parquet byte_array writing

fd1d174

hyperbolic2346 added 3 - Ready for Review Ready for review by team code quality libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 16, 2022

hyperbolic2346 self-assigned this Aug 16, 2022

hyperbolic2346 requested a review from a team as a code owner August 16, 2022 00:44

hyperbolic2346 requested review from upsj, davidwendt and bdice August 16, 2022 00:44

bdice requested changes Aug 16, 2022

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

cpp/src/lists/dremel.cu Outdated Show resolved Hide resolved

upsj approved these changes Aug 16, 2022

View reviewed changes

Update cpp/src/lists/dremel.cu

c226abf

Co-authored-by: Bradley Dice <[email protected]>

bringing back the larger values for test to ensure result > utf-8 max…

6c41535

… value

hyperbolic2346 requested a review from bdice August 17, 2022 18:39

ttnghia reviewed Aug 22, 2022

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

hyperbolic2346 changed the title ~~Removing uint8 column option from parquet byte_array writing~~ Removing int8 column option from parquet byte_array writing Aug 23, 2022

hyperbolic2346 added 2 commits August 23, 2022 17:02

Merge remote-tracking branch 'upstream/branch-22.10' into mwilson/par…

327285f

…quet_binary_uint8_removal

changing to uint8 over int8 columns

51ec521

updating java test for binary writes

a5c1b8f

rwlee requested review from etseidl and mythrocks and removed request for etseidl October 12, 2022 20:53

rwlee added 2 commits October 12, 2022 15:49

remove mutable on lambda

683f60a

fix typo

3803ccb

sameerz linked an issue Oct 13, 2022 that may be closed by this pull request

[FEA] Cudf parquet binary write/read changing to uint8 lists NVIDIA/spark-rapids#6408

Closed

mythrocks reviewed Oct 13, 2022

View reviewed changes

cpp/src/reshape/byte_cast.cu Show resolved Hide resolved

fix test for uint8 change

a79a74e

mythrocks approved these changes Oct 14, 2022

View reviewed changes

bdice approved these changes Oct 14, 2022

View reviewed changes

Merge commit 'refs/pull/11539/head' of https://github.com/rapidsai/cudf…

86942b2

… into branch-22.12

github-actions bot added CMake CMake build issue Python Affects Python cuDF API. gpuCI labels Oct 15, 2022

rwlee added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 15, 2022

rwlee mentioned this pull request Oct 17, 2022

Retarget PR for debugging #11936

Closed

3 tasks

rwlee changed the base branch from branch-22.12 to pull-request/11936 October 17, 2022 20:08

rwlee changed the base branch from pull-request/11936 to branch-22.12 October 17, 2022 20:09

rwlee removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 17, 2022

rapids-bot bot merged commit 1effe19 into rapidsai:branch-22.12 Oct 18, 2022

ttnghia mentioned this pull request Oct 20, 2022

[BUG] parquet_write_test is failing when reading on the CPU parquet that was written on the GPU NVIDIA/spark-rapids#6865

Closed

abellina added a commit to abellina/cudf that referenced this pull request Oct 21, 2022

Revert "Removing int8 column option from parquet byte_array writing (r…

2a2d15a

…apidsai#11539)" This reverts commit 1effe19.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing int8 column option from parquet byte_array writing #11539

Removing int8 column option from parquet byte_array writing #11539

hyperbolic2346 commented Aug 16, 2022 •

edited

Loading

codecov bot commented Aug 16, 2022 •

edited

Loading

upsj left a comment

davidwendt commented Aug 16, 2022

hyperbolic2346 commented Aug 16, 2022

etseidl commented Aug 16, 2022

hyperbolic2346 commented Aug 17, 2022

davidwendt commented Aug 18, 2022

hyperbolic2346 commented Aug 24, 2022

rwlee commented Oct 12, 2022

bdice left a comment

bdice Oct 14, 2022

rwlee Oct 17, 2022

rwlee commented Oct 15, 2022 •

edited

Loading

bdice commented Oct 17, 2022

bdice commented Oct 18, 2022

Removing int8 column option from parquet byte_array writing #11539

Removing int8 column option from parquet byte_array writing #11539

Conversation

hyperbolic2346 commented Aug 16, 2022 • edited Loading

Description

Checklist

codecov bot commented Aug 16, 2022 • edited Loading

Codecov Report

upsj left a comment

Choose a reason for hiding this comment

davidwendt commented Aug 16, 2022

hyperbolic2346 commented Aug 16, 2022

etseidl commented Aug 16, 2022

hyperbolic2346 commented Aug 17, 2022

davidwendt commented Aug 18, 2022

hyperbolic2346 commented Aug 24, 2022

rwlee commented Oct 12, 2022

bdice left a comment

Choose a reason for hiding this comment

bdice Oct 14, 2022

Choose a reason for hiding this comment

rwlee Oct 17, 2022

Choose a reason for hiding this comment

rwlee commented Oct 15, 2022 • edited Loading

bdice commented Oct 17, 2022

bdice commented Oct 18, 2022

hyperbolic2346 commented Aug 16, 2022 •

edited

Loading

codecov bot commented Aug 16, 2022 •

edited

Loading

rwlee commented Oct 15, 2022 •

edited

Loading