Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include row group level stats when writing ORC files #10041

Merged
merged 19 commits into from
Jan 19, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jan 13, 2022

Closes #9964
Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector).

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 13, 2022
@vuule vuule self-assigned this Jan 13, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 13, 2022
@codecov
Copy link

codecov bot commented Jan 13, 2022

Codecov Report

Merging #10041 (fe56f23) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

❗ Current head fe56f23 differs from pull request most recent head 6ea0a50. Consider uploading reports for the commit 6ea0a50 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02   #10041      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20541     +236     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18402     +227     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.25%) ⬇️
python/dask_cudf/dask_cudf/core.py 70.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/scalar.py 0.00% <0.00%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512e161...6ea0a50. Read the comment docs.

@vuule
Copy link
Contributor Author

vuule commented Jan 14, 2022

Measured a small performance regression due to additional stats encode. Difference is within a few percent (hard to measure exactly due to variance between runs).

@vuule
Copy link
Contributor Author

vuule commented Jan 14, 2022

Will look into adding a test, unsure if this is possible with the available readers.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyrights need to be updated on some files here. I like these changes, I think they are going in the right direction for readability.

cpp/src/io/orc/orc.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/orc.h Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@wbo4958
Copy link
Contributor

wbo4958 commented Jan 14, 2022

nd writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

@vuule
Copy link
Contributor Author

vuule commented Jan 15, 2022

nd writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds put_bytes to ProtobufWriter to optimize writing of buffers.
Adds new struct to represent the encoded ORC

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

👍
@mythrocks is working on API changes that will allow callers to disable rowgroup level statistics, so they can effectively revert the behavioral changes in this PR.

@jlowe
Copy link
Member

jlowe commented Jan 15, 2022

Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix.

That only works when everyone stops using the older Spark version(s) (and any other Java-based data processing frameworks) that still using the old ORC version with the reading bug. While those frameworks on the older ORC version are still in use, cudf applications could still end up creating ORC files that those frameworks will silently drop data when reading with predicate pushdown. Even though the spec says these things are technically optional, it is very sketchy to be the only ORC writer on the planet that is not generating these stats.

I think it's fine making it possible in libcudf to avoid writing these stats, but IMO cudf applications should always ask for the stats to be generated unless they know there's no chance the files they're creating could be read by data processing frameworks that could be affected by the ORC reading bug.

@vuule
Copy link
Contributor Author

vuule commented Jan 15, 2022

I think it's fine making it possible in libcudf to avoid writing these stats, but IMO cudf applications should always ask for the stats to be generated unless they know there's no chance the files they're creating could be read by data processing frameworks that could be affected by the ORC reading bug.

That's right. Writing all statistics will be the default.

@vuule
Copy link
Contributor Author

vuule commented Jan 15, 2022

rerun tests

m_buf->data()[lpos + 2] = (uint8_t)(sz);

if (stats != nullptr) {
sz += put_uint(encode_field_number<decltype(*stats)>(2)); // 2: statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe field number (2 in this case) should be an enum. I see that it's used in a lot of places though, so maybe a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doable, but there would need to be different enums for each ORC message type, since the numbers are not unique between messages (see https://orc.apache.org/specification/ORCv1/). We can have the set of enums (non-class) and still pass them as int. I would really need to do this in a follow up for this one to make it into 22.02.

cpp/src/io/orc/orc.h Outdated Show resolved Hide resolved
cpp/src/io/orc/orc.h Outdated Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Might not hurt to cook up some kind of tests for this.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 19, 2022
@vuule
Copy link
Contributor Author

vuule commented Jan 19, 2022

rerun tests

1 similar comment
@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f193d59 into rapidsai:branch-22.02 Jan 19, 2022
@vuule vuule deleted the fea-rowgroup-stats branch January 19, 2022 18:03
rapids-bot bot pushed a commit that referenced this pull request Jan 20, 2022
Depends on #10041.

The erstwhile ORC writer API exposed only a binary choice to choose
the level of statistics: ENABLED/DISABLED.
This commit allows the ORC writer to further choose whether statistics
are collected at the ROW_GROUP or STRIPE level.

This commit also includes the relevant changes to `java/` and `python/`.

Authors:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #10058
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex
6 participants