-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add File Statistic when writing the ORC file #10075
Comments
This issue has been labeled |
There is a standing bug #5826 that points to missing statistics for chunk mode. |
This issue has been labeled |
Even with #10041 adding Statistic in RowIndex, CUDF still does not support writing the whole File Statistics into the ORC file.
Why do we need this FEA?
Spark added a new FEA about aggregate pushdown for ORC/Parquet file format as of Spark 3.3.0, which pushes the aggregate to DataSource to calculate the aggregation.
Typically, for ORC, it will read File Statistics from the ORC file and get the column stats and finally do the aggregate(max/min/count) without reading any real data of the ORC file. which can remarkably improve the performance.
But if there is no File Statistic in the ORC file, Spark will throw an exception.
The text was updated successfully, but these errors were encountered: