Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect hive.metastore.thrift.delete-files-on-drop config property for dropping partitions #13822

Merged
merged 3 commits into from
Oct 12, 2022

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Aug 24, 2022

Description

Respect hive.metastore.thrift.delete-files-on-drop config property for dropping partitions
Fixes #13545 Fixes #13821

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Hive
* Respect `hive.metastore.thrift.delete-files-on-drop` config property for dropping partitions. ({issue}`13545`)

@alexjo2144
Copy link
Member

The Hive connector doc needs an update if we're going to do this. Specifically:

system.unregister_partition(schema_name, table_name, partition_columns, partition_values)

Unregisters given, existing partition in the metastore for the specified table. The partition data is not deleted.

@ebyhr
Copy link
Member Author

ebyhr commented Sep 12, 2022

I don't think we need to update unregister_partition docs since deleteData argument is false. I will add a test for the procedure though.

metastore.dropPartition(
session,
table.getDatabaseName(),
table.getTableName(),
partition.getValues(),
false);

@ebyhr ebyhr force-pushed the ebi/hive-delete-files-on-drop branch from f7ff00b to 577addc Compare September 13, 2022 04:48
import static java.nio.charset.StandardCharsets.UTF_8;
import static java.util.Objects.requireNonNull;

public class TestHiveThriftMetastoreWithS3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using normal test environment because somehow I can't reproduce the issue on product test environment.

@ebyhr ebyhr force-pushed the ebi/hive-delete-files-on-drop branch from 577addc to eceb235 Compare September 27, 2022 12:42
client.dropPartition(databaseName, tableName, parts, deleteData);
String partitionLocation = partition.getSd().getLocation();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wrap the file deletion in a separate try/catch?

I'm just thinking, if the partition is dropped from the metastore but the file system doesn't allow us to delete the files for some reason, the query should still succeed right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel users should disable it in such cases. This feature can be controlled by config property, so exception looks better in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on what the behavior is for a following select statement. If I run a drop partition and it fails, the partition should still show up in the next query. If it doesn't show up, the drop should succeed. Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was missing that subsequent deleteDirRecursive already has try-catch block in the method. Do you think it still requires additional change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right. This is fine

@findinpath
Copy link
Contributor

There are several CI failures.

@findepi
Copy link
Member

findepi commented Sep 28, 2022

There are several CI failures.

Added empty commit to retrigger the build after #14323 merged.

@ebyhr ebyhr force-pushed the ebi/hive-delete-files-on-drop branch from 6af2c70 to db0758b Compare October 5, 2022 09:31
@ebyhr
Copy link
Member Author

ebyhr commented Oct 5, 2022

CI hit #14441

@ebyhr ebyhr requested review from findepi and alexjo2144 October 5, 2022 22:35
.setBucketName(writableBucket)
.setPopulateTpchData(false)
.setThriftMetastoreConfig(new ThriftMetastoreConfig().setDeleteFilesOnDrop(true))
.setHiveProperties(ImmutableMap.of("hive.allow-register-partition-procedure", "true"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not redundant. ThriftMetastoreConfig.setDeleteFilesOnDrop is equals to hive.metastore.thrift.delete-files-on-drop not hive.allow-register-partition-procedure.

@@ -255,7 +262,9 @@ public DistributedQueryRunner build()
queryRunner.createCatalog(HIVE_CATALOG, "hive", hiveProperties);
queryRunner.createCatalog(HIVE_BUCKETED_CATALOG, "hive", hiveBucketedProperties);

populateData(queryRunner, metastore);
if (populateTpchData) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else validate initialTables is empty, since it's ignored otherwise

however, a better way for this would be to call the parameter createTpchSchemas
then createTpchSchemas(false) would skip schema cretion and initialTables(empty) (default) would skip tables creation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to createTpchSchemas .

@@ -136,7 +144,9 @@ public DistributedQueryRunner build()
requireNonNull(s3SecretKey, "s3SecretKey is null");
requireNonNull(bucketName, "bucketName is null");
String lowerCaseS3Endpoint = s3Endpoint.toLowerCase(Locale.ENGLISH);
checkArgument(lowerCaseS3Endpoint.startsWith("http://") || lowerCaseS3Endpoint.startsWith("https://"), "Expected http URI for S3 endpoint; got %s", s3Endpoint);
checkArgument(
lowerCaseS3Endpoint.matches("s3.*\\.amazonaws\\.com") || lowerCaseS3Endpoint.startsWith("http://") || lowerCaseS3Endpoint.startsWith("https://"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using s3.*.amazonaws.com without https://? should we add it explicitly? in secrets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed another commit to add the prefix.


// Recreating a table throws "Target directory for table 'xxx' already exists" in real S3 (not MinIO)
// when 'hive.metastore.thrift.delete-files-on-drop' config property is false
assertUpdate("CREATE TABLE " + tableName + "(col int)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to check that the table in fact is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included additional assertions.

@ebyhr
Copy link
Member Author

ebyhr commented Oct 11, 2022

Addressed comments.

@findepi
Copy link
Member

findepi commented Oct 12, 2022

LGTM, but the build is red

@ebyhr
Copy link
Member Author

ebyhr commented Oct 12, 2022

Failed jobs succeeded in #14557 (within the repo). I will take a look these failures and file issues if needed.

@ebyhr ebyhr requested a review from findepi October 12, 2022 12:13
@ebyhr ebyhr merged commit a87fc63 into trinodb:master Oct 12, 2022
@ebyhr ebyhr deleted the ebi/hive-delete-files-on-drop branch October 12, 2022 12:33
@ebyhr ebyhr mentioned this pull request Oct 12, 2022
@github-actions github-actions bot added this to the 400 milestone Oct 12, 2022
@sumanshusamarora
Copy link

@ebyhr Is there a similar property available for glue metastore. Upon drop table command, the table meta information is deleted from glue but the underlying data remains in azure. Anything you can suggest to allow dropping of data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants