Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

Closed
wbo4958 opened this issue Jan 4, 2022 · 9 comments · Fixed by #10041
Closed

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

wbo4958 opened this issue Jan 4, 2022 · 9 comments · Fixed by #10041
Assignees
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 4, 2022

Spark 3.2 has changed the orc dependency to 1.6.11 which has different behaviors with orc 1.5.10 (spark-plugins shaded) when picking row group with filter pushed down.

In a word, Spark 3.2 will return empty when reading the orc file written by cudf with filter pushed down which is because of missing Column Statistic in RowIndex.

From the orc spec, Column Statistic of RowIndex seems not to be a required field. But if the orc file didn't include Column Statistic in RowIndex, the spark will get incorrect result.

@wbo4958 wbo4958 added bug Something isn't working Needs Triage Need team to review and classify labels Jan 4, 2022
@revans2
Copy link
Contributor

revans2 commented Jan 4, 2022

Is there a bug filed against Spark for this? It is one thing to work around it in CUDF, which would be nice, but if it really is optional then it is a bug in Spark itself.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jan 5, 2022

I think it's the ORC issue not a spark issue, I just filed an ORC issue https://issues.apache.org/jira/browse/ORC-1075

@wbo4958 wbo4958 changed the title [BUG][FEA] orc file written by cudf doesn't include RowIndex [BUG][FEA] orc file written by cudf doesn't include Column Statistics in RowIndex Jan 5, 2022
@vuule vuule self-assigned this Jan 5, 2022
@vuule
Copy link
Contributor

vuule commented Jan 6, 2022

@wbo4958 can you confirm that columnStatistics is not present in RowIndex? If that's the case, this is a feature request as we current don't support row group statistics (which are optional). If somehow there is an incorrect value, I will look into fixing that.

Since columnStatistics are optional in RowIndex, the code should not depend on existence of this field. Maybe stripe statistics can be used instead?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jan 6, 2022

@vuule, Yeah, columnStatistics is not present in RowIndex. I have filed an issue for the Orc reader. Suppose it is orc reader issue. Thx

@vuule vuule changed the title [BUG][FEA] orc file written by cudf doesn't include Column Statistics in RowIndex [FEA] orc file written by cudf doesn't include Column Statistics in RowIndex Jan 7, 2022
@vuule vuule added cuIO cuIO issue feature request New feature or request and removed Needs Triage Need team to review and classify bug Something isn't working labels Jan 7, 2022
@wbo4958
Copy link
Contributor Author

wbo4958 commented Jan 7, 2022

Closed this issue, since the orc file written by cudf is following ORC format. Cudf doesn't have to add statistics in RowIndex.

ORC maintainer has confirmed it's the ORC java issue, and there is a PR pending to review.

@wbo4958 wbo4958 closed this as completed Jan 7, 2022
@vuule
Copy link
Contributor

vuule commented Jan 7, 2022

I think we can keep this open as a feature request. @wbo4958 are you okay with this option?

@wbo4958 wbo4958 reopened this Jan 7, 2022
@wbo4958
Copy link
Contributor Author

wbo4958 commented Jan 7, 2022

sure.

@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Jan 7, 2022
@vuule
Copy link
Contributor

vuule commented Jan 10, 2022

Scoped out the feature. Changes required:

  1. Encode rowgroup-level stats in the writer (currently they are merged into stripe-level stats and discarded).
  2. Include the encoded stats in the row index entries.
  3. Enable stats by level (rowgroup, stripe, file, none) in the ORC API (currently a boolean option). This would also make the API consistent with Parquet 👍

rapids-bot bot pushed a commit that referenced this issue Jan 19, 2022
Closes #9964 
Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds `put_bytes` to `ProtobufWriter` to optimize writing of buffers.
Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector).

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - https://github.com/nvdbaranec

URL: #10041
@vuule vuule reopened this Jan 19, 2022
@vuule
Copy link
Contributor

vuule commented Jan 24, 2022

Both PRs are merged, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants