-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Hive Metastore API for listing partitions that allows range (and other) predicates on the partition keys #611
Comments
This is a good idea. Do you know what Hive version these were added in? Hopefully they’re old enough to be available everywhere, otherwise we’ll need some fallback code (catch remote “no such method” error and call the old one). |
If I just go by when the API was added, then that was 9 years ago. |
We have explored this a bit, but I think we should probably avoid using
There are a few potential issues:
We should probably try |
The problem with With the existing Presto fetches partition names while planning, then fetches the full partition metadata iteratively during split generation as the query executes. The engine provides the connector with a I now remember looking into this in the past and coming to the above conclusion that there was no better API that does what we need (list partition names with a filter). However, they recently added We should check if this API is available in CDH 5 and HDP 2. |
CDH 5 seems to be on Hive 1.1.0 + patches and the Jira you have linked does not seem to be in the list of patches HDP-2.6.5 which is the most latest release of HDP-2 also only ships HIVE-2.1.0 No matter when we decide to move to a newer API I believe we will have to make it config enabled to support organization that are running on older version of hive servers so I do not see a reason not to implement this right now and make it config enabled. |
I agree, there's no reason not to implement it now. We can do it without config if we make the fallback transparent. Have a boolean flag defaulting to true to indicate it is supported, then set it to false if we get a "no such method" error from the remote Thrift server. This means the first request after a server restart will take slightly longer, but this shouldn't cause any problems. |
Agreed with @electrum. As long as we have a fallback mechanism it should not break the compatibility. Hive adopts a similar strategy where a
|
|
|
Here also @tooptoop4 |
I'm taking a look at this right now |
After a discussion with @electrum we decided to break this into pieces
I'm tidying up a PR for #1 and will attach today for feedback (consider WIP, still testing). It can be reviewed and committed separately, or bundled with the next steps. However, if the Glue implementation is the desired feature, I would recommend implementing and testing that without #3 first. |
I'm debugging some CI test failures with this commit, but if anyone wants start providing feedback on the refactor, that would be helpful. Once I've sorted out any test issues, I'll begin work on the HiveGlueMetastore implementation that translates TupleDomain -> a filter string. I'll do the BridgingHiveMetastore/ThriftMetastore translation the first commit after the refactor will push any Domain -> wildcard translation into each HMS impl, likely. |
I'm pushing the conversion to List into implementations now and will begin work on the Glue impl + test cases after that. |
pushed conversion of TupleDomain -> List into each HiveMetastore impl next GlueHiveMetastore implementation + tests |
see comments in the PR for progress. The PR/diff is around 1k, but includes some copy & pasted files (serDe related code) |
I'm working on testing out the |
has this improvement been implemented for non-glue hive ? @rash67 |
See thread https://prestosql.slack.com/archives/CFLB9AMBN/p1554842107205400.
Currently in HivePartitionManager we use getPartitionNamesByParts API from metastore which only allows a single value per partition predicate. This means any non equality partition predicate can not be pushed down to metastore which results in 2 problems:
In the newer version of Hive new APIs are introduced to avoid this issue. We should look into moving to
get_partitions_by_expr
orget_partitions_by_filter
for types that are supported by these APIs.The text was updated successfully, but these errors were encountered: