Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC write does not include full timestamp metrics which causes spark to not do predicate push down #13899

Closed
revans2 opened this issue Aug 17, 2023 · 2 comments · Fixed by #13848
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 17, 2023

Describe the bug
I am not totally sure if this is a bug or a new feature, but it ends up being a performance problem for Spark.

ORC statistics for timestamps are confusing. They include a minimum, maximum, minimum UTC, maximum UTC, and then min and max nanoseconds.

CUDF only sets minimum and maximum UTC. Spark and java ORC by default use only minimum and maximum +min/max nanos for predicate push down. This results in predicate push down not working at all for timestamps in ORC.

NVIDIA/spark-rapids#9068

The Spark CPU writer fills in all of the fields. It would really be great if CUDF could do the same. The nano seconds is probably not that important, but timestamp really is important.

Steps/Code to reproduce bug
Write a file out with CUDF in ORC. Try to read the data in with spark along with a filter where the value is outside of the range for any value in a timestamp column.

This will result in all of the data being read when CUDF wrote the file, but if the CPU wrote the file none of it would have been read.

Expected behavior
Predicate push down works no matter who wrote the data.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 17, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 18, 2023
@vuule
Copy link
Contributor

vuule commented Aug 24, 2023

@revans2 could you share more info about the nanosecond fields? I don't see anything about that in the specs.
I sill add min/max in #13848; IIUC, the change is trivial because we always write files in UTC.

@revans2
Copy link
Contributor Author

revans2 commented Aug 30, 2023

@vuule I don't know about the spec. i was just reading through the code and looking at the dump of the metrics form the command line tool.

https://github.com/apache/orc/blob/a9e49656558e54f3f5c1b1eb0b9d70e2ad80bc21/proto/orc_proto.proto#L70-L72

is the protocol buffer code for the nanoseconds. If they are not provided, then the default for min is 0, and the default for max is 999999. The only reason I mentioned it is because when I dumped the footer metrics I saw that the max for our file had the 999999 at the end for nanoseconds and I wanted to understand why it was different from what Spark had written out. I don't think it would change anything functionally

rapids-bot bot pushed a commit that referenced this issue Sep 18, 2023
Closes #7087, closes #13793, closes #13899

This PR adds support for several cases and statistics types:
- sum statistics are included even when all elements are null (no minmax);
- sum statistics are included in double stats;
- minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats;
- hasNull field is written for all columns.
- decimal statistics

Added tests for all supported stats.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants