-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Spark cannot do predicate push down on INTs and LONGs parquet columns written by CUDF #11626
Comments
I filed https://issues.apache.org/jira/browse/SPARK-40280 in Spark for this issue. |
Just FYI. I tested a file written by pandas/pyarrow and it looks like Spark.
|
I'm guessing this is due to changes I made in #11302. I needed the converted type to figure out which field from the stats union to use when deciding column ordering for the column index. Looking back, I changed the logic some to always assume signed integers unless the converted type is unsigned, so Testing now. |
@etseidl That comment was an emotional rollercoaster. Thanks for clarification there. I was remembering that PR as I read the comment, but had forgotten it! |
…11627) As brought up in #11626, converted type being written on INT32 and INT64 columns is not out of spec, but abnormal for parquet writers. This change brings cudf's writer in line with pandas and spark by not including converted type information on these types. closes #11626 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Jim Brennan (https://github.com/jbrennan333) - Yunsong Wang (https://github.com/PointKernel) URL: #11627
Describe the bug
You are going to love this one. To be clear this is not really a bug in CUDF. I think this is a bug in Spark, and I will file an issue there for it. This is asking if we can work around this issue in CUDF because it is going to be a while before any fix goes into Spark and it is going to be even longer before any customers are able to upgrade to a version of Spark with the fix in it.
When talking about signed numbers in the parquet format specification it says...
The CUDF code adds the INT(32, true) and INT(64, true) tags to columns that it writes out of those types. You can see this using the parquet-cli tool to dump the footer for a file.
The Spark writer does not include those extra metadata tags.
Both should be fine according to the spec, but when Spark tries to setup filters for a predicate push down. Like
a > 500 and b < 5
when it tries to translate the spark filter into a parquet filter, if it sees the metadata on the integer it does not match and ends up not filtering anything. This results in a lot of extra data being read when we could have skipped over entire row groups. Could we please stop inserting in the metadata in the footer for columns of type INT32 and INT64? All of the other types appear to be doing what Spark wants.The text was updated successfully, but these errors were encountered: