Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor HiveMetastore API and Implementations #4748

Merged
merged 1 commit into from
Aug 27, 2020

Conversation

rash67
Copy link
Member

@rash67 rash67 commented Aug 8, 2020

This commit changes the HiveMetastore API to remove methods to methods
getPartionNames() and getPartitionNamesByParts() and replace with
getPartitionNamesByFilter(). Each implementation is updated as well as
all callsites. The previous behavior is maintained as far as
translation of any non-equal Domain values to the wildcard value.
A subsequent commit will implement this for the GlueHiveMetastore and
pass the supported translation of the Domain per column using the
GetPartitions API in the Glue Data Catalog.

@cla-bot cla-bot bot added the cla-signed label Aug 8, 2020
@rash67 rash67 force-pushed the hive-metastore-refactor branch 4 times, most recently from cfed8bc to efa50d4 Compare August 10, 2020 23:03
@rash67
Copy link
Member Author

rash67 commented Aug 10, 2020

fix merge conflict and resulting compiler errors; minor format fix

@rash67 rash67 force-pushed the hive-metastore-refactor branch 2 times, most recently from 2ca72a4 to fa91389 Compare August 12, 2020 01:59
@rash67
Copy link
Member Author

rash67 commented Aug 12, 2020

debugging failed tests; will update today

@rash67 rash67 force-pushed the hive-metastore-refactor branch 6 times, most recently from dfa4ff4 to c906c6e Compare August 14, 2020 21:20
@rash67
Copy link
Member Author

rash67 commented Aug 14, 2020

split out the single property in HiveConfig

hive.assume-canonical-partition-keys

to per-metastore.

also debugging some various test failures

@rash67 rash67 force-pushed the hive-metastore-refactor branch from c906c6e to 9b5a5d4 Compare August 14, 2020 21:34
@findepi findepi requested a review from electrum August 17, 2020 13:58
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, but some initial commits. Overall looks good. Great work.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.prestosql.plugin.hive.spi.block;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to io.prestosql.plugin.hive.util package

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.prestosql.plugin.hive.spi.block;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to io.prestosql.plugin.hive.util package

@@ -124,6 +128,11 @@ public void configure(Binder binder)

jsonBinder(binder).addDeserializerBinding(Type.class).to(TypeDeserializer.class);

binder.bind(HiveBlockEncodingSerde.class).in(Scopes.SINGLETON);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't bind RecordingHiveMetastore here unconditionally.

There are three duplicated bindings for RecordingHiveMetastore in ThriftMetastoreModule and GlueMetastoreModule. Existing duplicated code:

binder.bind(HiveMetastore.class)
        .annotatedWith(ForCachingHiveMetastore.class)
        .to(RecordingHiveMetastore.class)
        .in(Scopes.SINGLETON);
binder.bind(RecordingHiveMetastore.class).in(Scopes.SINGLETON);
newExporter(binder).export(RecordingHiveMetastore.class).withGeneratedName();

Create a RecordingHiveMetastoreModule and move that plus this code into the new module. Then install the module in those metastore modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now need additional code when binding RecordingHiveMetastore, so it makes sense to extract the module for it.

@@ -297,5 +296,21 @@ public static void main(String[] args)
Thread.sleep(10);
log.info("======== SERVER STARTED ========");
log.info("\n====\n%s\n====", queryRunner.getCoordinator().getBaseUrl());

Session session = createSession(Optional.empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remove later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh shoot, yes, done

@electrum electrum self-requested a review August 18, 2020 22:02
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Remove WIP and "closes xxx" from the commit message, then we can merge it.

@@ -266,70 +241,17 @@ public static boolean partitionMatches(List<HiveColumnHandle> partitionColumns,
return true;
}

private List<String> getFilteredPartitionNames(SemiTransactionalHiveMetastore metastore, HiveIdentity identity, SchemaTableName tableName, List<HiveColumnHandle> partitionKeys, TupleDomain<ColumnHandle> effectivePredicate)
private List<String> getFilteredPartitionNames(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this method signature didn't change, so keep the original formatting for the arguments.

/**
* @param columnNames
* @param partitionKeysFilter
* @param assumeCanonicalPartitionKeys allow conversion of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missing something. Maybe remove all the @param and just explain what the method does.

}

/**
* @param columnNames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid having empty/useless documentation for obvious parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake, i left this half-finished

checkArgument(partitionKeysFilter.getDomains().isPresent());

Map<String, Domain> domainMap = partitionKeysFilter.getDomains().orElse(ImmutableMap.of());
List<String> partitionList = columnNames.stream().map(cn -> domainToString(domainMap.get(cn), assumeCanonicalPartitionKeys, HIVE_PARTITION_VALUE_WILDCARD)).collect(toImmutableList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wrap stream operations

List x = stream()
        .map(...)
        .collect(...);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. i'll check my other uses of stream()

if (domain != null && domain.isNullableSingleValue()) {
return sqlScalarToString(domain.getType(), domain.getNullableSingleValue(), assumeCanonicalPartitionKeys, partitionWildcardString);
}
else { // null or not a single value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: else is redundant

{
this.client = requireNonNull(client);
this.assumeCanonicalPartitionKeys = requireNonNull(metastoreConfig).isAssumeCanonicalPartitionKeys();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for requireNonNull since the . access will throw NPE anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point

if (partitionKeysFilter.isNone()) {
return Optional.of(ImmutableList.of());
}
Optional<List<String>> parts = partitionKeyFilterToStringList(columnNames, partitionKeysFilter, assumeCanonicalPartitionKeys);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the assumeCanonicalPartitionKeys config or filtering here at all, as filtering by the metastore is best effort. This isn't being pushed down, so we can let the caller do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. previously, this information was indeed used by the caller to generate a List<String> parts that included the information. it would be a "regression" in the sense the caller would get a larger list.

are you saying just pass false and let the caller deal with it to avoid having this config for AlluxioHiveMetastore ?

the Glue and Thrift impls will still need it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked on slack and i get this, i've made the change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't assumeCanonicalPartitionKeys applicable to some metastores and not the others -- eg depending whether the call is over string representation or interpreted values?

for example it may turn out to be HMS- (and file-?) metastore specific?

{
return delegate.getPartitionNamesByParts(identity, databaseName, tableName, parts);
if (partitionKeysFilter.isAll()) {
return delegate.getPartitionNames(identity, databaseName, tableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop the getPartitionNames() method from the ThriftMetastore interface? Then this can be a straight delegation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but I didn't know what to pass to thrift to say "all". This should, in theory, be just TupleDomain.all(), but i hit some snags

i can look at it again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change and I'm waiting for a new CI (pushed right now) to run to verify that my translation of TupleDomain.all() to a list of "" works. it should as this was the previous behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this and what happens is that once it's down to the thrift API, it seems like any value for a partitionKey in the 'parts' api that is "" (our wildcard value) gets removed. the parts api then fails when the list is empty

we might be able to solve this when we translate it to an expression, but for now, i think i need to restore the previous special case

@tooptoop4
Copy link
Contributor

does this change mean a higher minimum version of hive is required?

@electrum
Copy link
Member

@tooptoop4 No, the subsequent PR that adds support for the new API will include a fallback for the old API.

@@ -635,34 +652,17 @@ public synchronized void truncateUnpartitionedTable(ConnectorSession session, St
}
// add newly-added partitions to the results from underlying metastore
if (!partitionActionsOfTable.isEmpty()) {
List<String> columnNames = table.get().getPartitionColumns().stream().map(Column::getName).collect(Collectors.toList());
// the top of the function guards against TupleDomain.none() so this cannot be empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this" means what here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this comment, holdover from when this still did filtering. it referred to the columnNames

I'll remove the comment

@rash67 rash67 changed the title Refactor HiveMetastore API and Implementations (WIP) Refactor HiveMetastore API and Implementations Aug 21, 2020
@rash67 rash67 force-pushed the hive-metastore-refactor branch 3 times, most recently from 0ce0a8d to bb6119e Compare August 22, 2020 00:25
@rash67
Copy link
Member Author

rash67 commented Aug 22, 2020

I just pushed the latest update that should address all comments. i'll make one last pass locally, though
(and verify CI passes)

@rash67 rash67 force-pushed the hive-metastore-refactor branch from bb6119e to 39faca6 Compare August 24, 2020 19:51
@rash67 rash67 force-pushed the hive-metastore-refactor branch 8 times, most recently from a8327ae to c24a983 Compare August 25, 2020 23:38
* @param partitionKeysFilter map of filters (Domain) for each partition column
* @return optionally, a list of strings where each entry is in the form of {key}={value}
* @see TupleDomain
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line before Javadoc

if (partitionKeysFilter.isNone()) {
return Optional.of(Collections.emptyList());
}
Optional<List<String>> optionalParts = partitionKeyFilterToStringList(columnNames, partitionKeysFilter, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove all this client side filtering here since it's only a hint. So I think the whole method can be

return ProtoUtils.toPartitionInfoList(
        client.readTable(databaseName, tableName, Constraint.getDefaultInstance()));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. it looks like I need a mapping of PartitionInfo::getPartitionName as well

@@ -69,7 +72,7 @@
private TableMasterClient client;

@Inject
public AlluxioHiveMetastore(TableMasterClient client)
public AlluxioHiveMetastore(TableMasterClient client, AlluxioHiveMetastoreConfig metastoreConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is not needed

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.prestosql.plugin.hive.spi.block;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to io.prestosql.plugin.hive.util package

import static com.google.common.base.Preconditions.checkArgument;
import static java.nio.charset.StandardCharsets.UTF_8;

// This class is exactly the same as BlockEncodingManager. They are in SPI and don't have access to InternalBlockEncodingSerde.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the second sentence as it doesn't really fit here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought i did these, will update

throws Exception
{
Optional<List<String>> ds = MetastoreUtil.partitionKeyFilterToStringList(Arrays.asList("ds1", "ds2", "ds3"), TupleDomain.all(), false);
System.err.println(ds.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the printing. Probably add an assertion or remove the entire test method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops again, yea i'll assert or remove it

@rash67 rash67 force-pushed the hive-metastore-refactor branch 4 times, most recently from ecfd87c to a167bba Compare August 27, 2020 00:13
This commit changes the HiveMetastore API to remove methods to methods
getPartionNames() and getPartitionNamesByParts() and replace with
getPartitionNamesByFilter(). Each implementation is updated as well as
all callsites. The previous behavior is maintained as far as
translation of any non-equal Domain values to the wildcard value.
A subsequent commit will implement translation of a Domain into a
proper Glue expression for the GlueHiveMetastore GetPartitions API in
the Glue Data Catalog.
@rash67 rash67 force-pushed the hive-metastore-refactor branch from a167bba to 43843b2 Compare August 27, 2020 01:24
@electrum electrum merged commit 975bd66 into trinodb:master Aug 27, 2020
@electrum
Copy link
Member

Thanks!

@rash67 rash67 deleted the hive-metastore-refactor branch September 3, 2020 01:28
return retry()
.stopOn(NoSuchObjectException.class)
.stopOnIllegalExceptions()
.run("getPartitionNamesByParts", stats.getGetPartitionNamesByParts().wrap(() -> {
try (ThriftMetastoreClient client = createMetastoreClient(identity)) {
return Optional.of(client.getPartitionNamesFiltered(databaseName, tableName, parts));
return Optional.of(client.getPartitionNamesFiltered(databaseName, tableName, parts.get()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "getPartitionNamesByParts", stats.getGetPartitionNamesByParts()... above should be renamed too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvrm, i see this is pre-existing... still worth updating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants