-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DATE predicate pushdown with Parquet via min/max and … #10181
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
This should be the same issue described here - #8243 (comment) - here's an example exception stack when doing a query such as:
when the backing data is Parquet with "date_field" being an INT32 logically typed as a DATE. Also, relates to #6892 since the types now come from the Parquet data rather than before I think this was looking at the column from the query that Presto derived. So for example, the above query worked in 0.172 (with predicate pushdown enabled it just ignored dates) because before the call to |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Please update your commit message (its title is too long) and make sure it follows the guidelines here.
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
0b43a30
to
669c012
Compare
I rebased the commit message and pushed changes to support TIMESTAMP as well here (which includes dictionary support although it's sort of weird because cardinality will be high typically but Parquet will use a dictionary if it's not). The pushdown support I added is only for INT64 currently logically typed as TIMESTAMP_MILLIS. TIMESTAMP_MICROS is also supported in newer versions of Parquet I believe. I didn't do anything with INT96 but it looks like maybe this is getting deprecated anyway - https://issues.apache.org/jira/browse/PARQUET-323 |
669c012
to
1168491
Compare
@findepi So for Parquet, my understanding is TIMESTAMP_MILLIS is a logical type annotation for a INT64 to indicate milliseconds since epoch - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md - although there's this whole Impala article about the nuances of this so I may be missing some quirks there -https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html I may back out the timestamp changes then though because I was going off of TIMESTAMP in Presto = Instant (UTC) which the linked issue talking about how those semantics are incorrect. It seems like then the corresponding Presto type should be TIMESTAMP_WITH_TIME_ZONE with the zone being UTC however Hive doesn't support that as a column type yet (https://issues.apache.org/jira/browse/HIVE-14412). Then if I try to create a table in Hive with just a timestamp column backed by Parquet's INT64 TIMESTAMP_MILLIS, Presto won't push down the predicate because of the cast that has to be issued e.g.:
Since Presto ends up having to wrap timestamp_col with cast(timestamp_col AS timestamp with time zone) and then this doesn't get pushed down in the effective predicate. |
Yeah, if Hive's As I mentioned before though this won't actually work until the Hive mapping change takes place because the predicate doesn't get pushed down since Presto ends up performing a cast then on the Hive timestamp column into timestamp with time zone. It doesn't really break anything though, just the pushdown portion doesn't trim any data. To test this out I changed the HiveType to Presto |
@findepi I took a look at the changes in #9385 although I saw it's been superseded by #10193 (removes the Hive changes though for now). It looks like via Guessing this stuff may be influx still but just wondering because I could back out the timestamp changes for now so the date changes could get merged (I don't see an issue with this for dates at least). |
@ryanrupp thank you for taking your time to look into this. #10193 doesn't ship Hive changes for the very reason they are not in their final shape yet.
That might be safer way to do. However, please note that I'm fluent in parquet file handling, i'd defer to @nezihyigitbasi for an ultimate decision here. |
The changes to the timestamp types are in progress and it may take some time to have the issue completely fixed. So, I think it makes sense to first get this patch in with only fixing the problem with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take another look once you update the PR to have the fix for only the DATE
type.
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
...ve/src/main/java/com/facebook/presto/hive/parquet/predicate/TupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
f90813c
to
0ee54b7
Compare
@nezihyigitbasi I backed out the As a seperate follow up I was going to try to catch up and see what the current state of the |
@@ -169,7 +171,7 @@ public static Type getPrestoType(TupleDomain<ColumnDescriptor> effectivePredicat | |||
case DOUBLE: | |||
return DoubleType.DOUBLE; | |||
case INT32: | |||
return createDecimalType(descriptor).orElse(IntegerType.INTEGER); | |||
return getInt32Type(descriptor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why INT64
and INT32
cases are handled differently now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because Parquet has Logical Types where additional metadata in originalType
determines how an int32
can be interpreted. In this case, it's checking if the int32 was configured to be interpreted as a date (number of days since Unix epoch which matches the semantics of the Presto Date
type).
Before this change if I had a query like:
select count(*) from my_table where date_col = DATE '2018-01-01'
The query would fail with a type mismatch exception about comparing a date
value to integer
because the logical type wasn't considered. There's a similar issue related to this here - #11118 - which is more involved as I was only targeting the date
pushdown currently.
I backed out the timestamp changes but the case statement for int64
in the future would be adjusted in a similar way to consider logical types e.g. an int64 can be logically typed as a timestamp
where the value is in epoch millis, micro or nanos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests in 1f78d8e3c0dfa4ac8e9ad7a8e2b9f53409a55e05 for this in case this changes in the future.
IntStatistics intStatistics = (IntStatistics) statistics; | ||
// ignore corrupted statistics | ||
if (intStatistics.genericGetMin() > intStatistics.genericGetMax()) { | ||
return Domain.create(ValueSet.all(type), hasNullValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nezihyigitbasi do you know the reason why we silently ignore something? I see we do this already for other cases above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly remember the history behind that. @zhenxiao do you remember why we are silently ignoring corrupt stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I remembered, if the statistics is obviously incorrect/corrupted(in this case, min > max), we silently do not leverage statistics to evaluate predicates.
it should not error out, as if stats are incorrect/corrupted, we could still scan data to complete the query, just lost a potential speedup.
maybe adding a log message here? @findepi @nezihyigitbasi what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I prefer failing fast & loudly for such cases as that may uncover bugs/issues in whoever has written those files/stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it. fail loudly is better. I will file following task to fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is: #12036
@nezihyigitbasi could this get another review? I'm not sure how to indicate the review open issues were followed up on, may be because I force pushed out the timestamp change so github is just referencing a commit that no longer exists. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
presto-parquet/src/main/java/com/facebook/presto/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/com/facebook/presto/parquet/TestParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/com/facebook/presto/parquet/TestParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/test/java/com/facebook/presto/parquet/TestTupleDomainParquetPredicate.java
Outdated
Show resolved
Hide resolved
@ryanrupp I think we can squash the commits into a single commit as I don't think we need a separate commit for the unit tests. Let me know once you address the comments and we can merge this. Thanks! |
This additionally fixes an issue that filtering against a DATE column would result in an exception due to type mismatch. DATE support is for Parquet INT32 columns logically typed as DATE.
1f78d8e
to
7626073
Compare
@nezihyigitbasi thanks for the review. I made the follow up changes and squashed the commits. |
Merged, thanks @ryanrupp. |
Hi All, I am still facing this issue. Please confirm how to resolve this ? Do i need to change anything in code or is there a new jar availalbe with the fix for this ? Error ? Caused by: java.sql.SQLException: Query failed (#20181217_165822_00916_nh8w3): Error opening Hive split /user/test/file_path/part-00000-830904de-7b0e-41c0-88e5-40d3990689e5-c000.snappy.parquet (offset=0, length=7333): Mismatched Domain types: date vs integer |
@sa255304 I recently ran into the same issue using Presto version 0.211. According to the code, the fix has been released in version 0.215. In version 0.215, the fix works fine. Please upgrade to Presto to version 0.215. |
thanks for your response. I am using presto 0.212 by the time i posted my problem. I am Reading my presto table data to pyspark code and trying to do date comparisons with spark sql . Then i got that issue. my EMR Cluster is currently having presto 0.214 version. Still, I tried to pass presto 0.215 jars as an external dependency to pyspark code. The issue still exists. Is there any alternative for this issue? |
You need version 0.215 to get the fix. You should talk to the EMR team for EMR-specific questions. |
…dictionary (plain). Fix an issue where during existing pushdown this would throw an exception for date to integer domain comparison.
This fixes an issue where because Parquet dates are stored as int32 with logical typing of "date", Presto was detecting that the column was an Integer. Then trying to do comparisons against a Date value the domain comparison would throw an exception:
Mismatched Domain types: date vs integer
This would happen on queries such as:
select count(*) from my_table where date_field > DATE '2018-01-01'
Where the left hand side column was detected as an integer and the right hand side literal was a date. Resolved by inspecting Parquet column data for INT32 to detect if the original type was a DATE.