[FEA] Add File Statistic when writing the ORC file #10075

wbo4958 · 2022-01-19T01:50:13Z

Even with #10041 adding Statistic in RowIndex, CUDF still does not support writing the whole File Statistics into the ORC file.

Why do we need this FEA?

Spark added a new FEA about aggregate pushdown for ORC/Parquet file format as of Spark 3.3.0, which pushes the aggregate to DataSource to calculate the aggregation.

Typically, for ORC, it will read File Statistics from the ORC file and get the column stats and finally do the aggregate(max/min/count) without reading any real data of the ORC file. which can remarkably improve the performance.

But if there is no File Statistic in the ORC file, Spark will throw an exception.

github-actions · 2022-02-18T02:15:35Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

amahussein · 2022-03-15T14:45:06Z

There is a standing bug #5826 that points to missing statistics for chunk mode.

github-actions · 2022-04-15T00:09:42Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vuule · 2022-06-08T06:22:40Z

@wbo4958 I think the issue should be fixed by #10694
Is anything else pending here?

wbo4958 added feature request New feature or request Needs Triage Need team to review and classify and removed feature request New feature or request labels Jan 19, 2022

wbo4958 added the feature request New feature or request label Jan 19, 2022

github-actions bot added the inactive-30d label Feb 18, 2022

amahussein mentioned this issue Mar 8, 2022

[BUG] GPU writing ORC columns statistics NVIDIA/spark-rapids#4860

Closed

amahussein mentioned this issue Mar 15, 2022

Document agg pushdown on ORC file limitation [skip ci] NVIDIA/spark-rapids#4957

Merged

github-actions bot removed the inactive-30d label Mar 15, 2022

github-actions bot added the inactive-30d label Apr 15, 2022

amahussein mentioned this issue Jun 1, 2022

Update GPU ORC statistics write support NVIDIA/spark-rapids#5715

Merged

vuule added the cuIO cuIO issue label Jun 8, 2022

GregoryKimball closed this as completed Jun 24, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add File Statistic when writing the ORC file #10075

[FEA] Add File Statistic when writing the ORC file #10075

wbo4958 commented Jan 19, 2022

github-actions bot commented Feb 18, 2022

amahussein commented Mar 15, 2022

github-actions bot commented Apr 15, 2022

vuule commented Jun 8, 2022

[FEA] Add File Statistic when writing the ORC file #10075

[FEA] Add File Statistic when writing the ORC file #10075

Comments

wbo4958 commented Jan 19, 2022

Why do we need this FEA?

github-actions bot commented Feb 18, 2022

amahussein commented Mar 15, 2022

github-actions bot commented Apr 15, 2022

vuule commented Jun 8, 2022