[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

wbo4958 · 2022-01-04T03:55:51Z

Spark 3.2 has changed the orc dependency to 1.6.11 which has different behaviors with orc 1.5.10 (spark-plugins shaded) when picking row group with filter pushed down.

In a word, Spark 3.2 will return empty when reading the orc file written by cudf with filter pushed down which is because of missing Column Statistic in RowIndex.

From the orc spec, Column Statistic of RowIndex seems not to be a required field. But if the orc file didn't include Column Statistic in RowIndex, the spark will get incorrect result.

revans2 · 2022-01-04T21:00:23Z

Is there a bug filed against Spark for this? It is one thing to work around it in CUDF, which would be nice, but if it really is optional then it is a bug in Spark itself.

wbo4958 · 2022-01-05T07:49:32Z

I think it's the ORC issue not a spark issue, I just filed an ORC issue https://issues.apache.org/jira/browse/ORC-1075

vuule · 2022-01-06T00:27:38Z

@wbo4958 can you confirm that columnStatistics is not present in RowIndex? If that's the case, this is a feature request as we current don't support row group statistics (which are optional). If somehow there is an incorrect value, I will look into fixing that.

Since columnStatistics are optional in RowIndex, the code should not depend on existence of this field. Maybe stripe statistics can be used instead?

wbo4958 · 2022-01-06T02:12:33Z

@vuule, Yeah, columnStatistics is not present in RowIndex. I have filed an issue for the Orc reader. Suppose it is orc reader issue. Thx

wbo4958 · 2022-01-07T03:33:35Z

Closed this issue, since the orc file written by cudf is following ORC format. Cudf doesn't have to add statistics in RowIndex.

ORC maintainer has confirmed it's the ORC java issue, and there is a PR pending to review.

vuule · 2022-01-07T04:09:59Z

I think we can keep this open as a feature request. @wbo4958 are you okay with this option?

wbo4958 · 2022-01-07T04:58:06Z

sure.

vuule · 2022-01-10T08:35:14Z

Scoped out the feature. Changes required:

Encode rowgroup-level stats in the writer (currently they are merged into stripe-level stats and discarded).
Include the encoded stats in the row index entries.
Enable stats by level (rowgroup, stripe, file, none) in the ORC API (currently a boolean option). This would also make the API consistent with Parquet 👍

Closes #9964 Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer). Adds `put_bytes` to `ProtobufWriter` to optimize writing of buffers. Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - https://github.com/nvdbaranec URL: #10041

vuule · 2022-01-24T18:15:12Z

Both PRs are merged, closing.

wbo4958 added bug Something isn't working Needs Triage Need team to review and classify labels Jan 4, 2022

wbo4958 changed the title ~~[BUG][FEA] orc file written by cudf doesn't include RowIndex~~ [BUG][FEA] orc file written by cudf doesn't include Column Statistics in RowIndex Jan 5, 2022

vuule self-assigned this Jan 5, 2022

vuule changed the title ~~[BUG][FEA] orc file written by cudf doesn't include Column Statistics in RowIndex~~ [FEA] orc file written by cudf doesn't include Column Statistics in RowIndex Jan 7, 2022

vuule added cuIO cuIO issue feature request New feature or request and removed Needs Triage Need team to review and classify bug Something isn't working labels Jan 7, 2022

wbo4958 closed this as completed Jan 7, 2022

wbo4958 reopened this Jan 7, 2022

jlowe added the Spark Functionality that helps Spark RAPIDS label Jan 7, 2022

This was referenced Jan 11, 2022

Use the ORC version that corresponds to the Spark version [databricks] NVIDIA/spark-rapids#4408

Merged

[BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue NVIDIA/spark-rapids#4031

Closed

vuule mentioned this issue Jan 13, 2022

Include row group level stats when writing ORC files #10041

Merged

rapids-bot bot closed this as completed in #10041 Jan 19, 2022

vuule reopened this Jan 19, 2022

vuule closed this as completed Jan 24, 2022

guiyanakuang mentioned this issue Sep 27, 2023

[FEA] Set the officially assigned id for the ORC writer implemented by cuDF #13977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

wbo4958 commented Jan 4, 2022 •

edited

Loading

revans2 commented Jan 4, 2022

wbo4958 commented Jan 5, 2022

vuule commented Jan 6, 2022

wbo4958 commented Jan 6, 2022

wbo4958 commented Jan 7, 2022

vuule commented Jan 7, 2022

wbo4958 commented Jan 7, 2022

vuule commented Jan 10, 2022

vuule commented Jan 24, 2022

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

[FEA] orc file written by cudf doesn't include Column Statistics in RowIndex #9964

Comments

wbo4958 commented Jan 4, 2022 • edited Loading

revans2 commented Jan 4, 2022

wbo4958 commented Jan 5, 2022

vuule commented Jan 6, 2022

wbo4958 commented Jan 6, 2022

wbo4958 commented Jan 7, 2022

vuule commented Jan 7, 2022

wbo4958 commented Jan 7, 2022

vuule commented Jan 10, 2022

vuule commented Jan 24, 2022

wbo4958 commented Jan 4, 2022 •

edited

Loading