Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow dropping Hive schema that contains external files #10146

Merged
merged 7 commits into from
Dec 20, 2021

Conversation

jirassimok
Copy link
Member

This will attempt to reimplement the behavior from #9740 in a way that doesn't break for some configurations.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand the missing piece is the logic which would drive the value for the deleteData parameter.

@findepi
Copy link
Member

findepi commented Dec 2, 2021

@jirassimok can you please verify Glue CREATE SCHEMA and CREATE SCHEMA WITH (location=..) vs DROP SCHEMA?
this can be manual, but can also be as a prep commit with a test in TestHiveGlueMetastore / AbstractTestHive

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from 811fe45 to 71571a0 Compare December 3, 2021 19:29
@findepi
Copy link
Member

findepi commented Dec 6, 2021

@jirassimok did you look at the CI's red?

@jirassimok
Copy link
Member Author

It doesn't look related.

io.trino.plugin.thrift.integration.TestThriftDistributedQueriesIndexed.testHighCardinalityIndexJoinResult

java.lang.IllegalArgumentException: Expected operatorType to be PagesIndexBuilderOperator but was PageBufferOperator

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from e773947 to c3e09be Compare December 6, 2021 19:17
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed. Tiny things + question ragarding default behaviour for case when we fail to list schema directory.

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from c3e09be to e3abc7a Compare December 8, 2021 00:16
@jirassimok jirassimok requested a review from losipiuk December 8, 2021 00:17
@@ -203,6 +203,12 @@ public synchronized void createDatabase(HiveIdentity identity, Database database

Path databaseMetadataDirectory = getDatabaseMetadataDirectory(database.getDatabaseName());
writeSchemaFile(DATABASE, databaseMetadataDirectory, databaseCodec, new DatabaseMetadata(currentVersion, database), false);
try {
metadataFileSystem.mkdirs(databaseMetadataDirectory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed in Don't put database schema files in the database directories commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ensuring that the database directory exists when the database is created. It's not strictly necessary, but I think it makes sense.

@@ -555,6 +562,10 @@ public void dropDatabase(HiveIdentity identity, String databaseName, boolean del
catch (AmazonServiceException e) {
throw new TrinoException(HIVE_METASTORE_ERROR, e);
}

if (deleteData) {
location.ifPresent(path -> deleteDir(hdfsContext, hdfsEnvironment, new Path(path), true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an error when deleteData && location.isEmpty().
Either checkState or add a comment why we're not failing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can theoretically occur if database.getLocation() returns empty.

In this case, it only occurs if Glue's API returns null for the database location, which seems unlikely, but that method is not documented as non-null.

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from e3abc7a to 07ed54f Compare December 8, 2021 21:08
@jirassimok jirassimok marked this pull request as ready for review December 8, 2021 21:08
@jirassimok jirassimok force-pushed the drop-schema-file-handling branch 4 times, most recently from 2fd7051 to 6116ac0 Compare December 10, 2021 14:32
@jirassimok
Copy link
Member Author

jirassimok commented Dec 10, 2021

Green tests before new push.

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from 6116ac0 to 73c0880 Compare December 10, 2021 14:34
@jirassimok
Copy link
Member Author

Changes in last update: auto-close the FileSystem object and catch runtime exceptions when attempting to list files.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -93,6 +93,7 @@
private boolean immutablePartitions;
private Optional<InsertExistingPartitionsBehavior> insertExistingPartitionsBehavior = Optional.empty();
private boolean createEmptyBucketFiles;
private boolean deleteSchemaLocationsFallback = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why true as default. You probably talked this through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the current behavior, so it's less likely to cause problems for anyone upgrading. But it's also the less-safe behavior for someone who isn't familiar with Trino who might , especially if they aren't very familiar with it. I think we did discussed it last week, but maybe we should change it?

What do you think, @findepi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the change is "do not delete those random files when deleting schema directory".
false feels more appropriate in case we cannot verify whether there are any files in there.
let's also add code comment discussion the rationale behind the choice we put in the default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the default and added a comment.

@jirassimok
Copy link
Member Author

Rebased forward, and avoided including new HdfsContext in the try/catch.

@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from ec3c615 to 877cc04 Compare December 16, 2021 21:40
@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from 877cc04 to 11055a1 Compare December 16, 2021 23:00
@findepi
Copy link
Member

findepi commented Dec 17, 2021

@jirassimok mind the CI

@findepi findepi changed the title Handle external files when dropping Hive schemas Disallow dropping Hive schema that contains external files Dec 17, 2021
In SemiTransactionalHiveMetastore, check for files before dropping the
schema. Do not request deletion (via HiveMetastore) if files are
visible in the schema location.

A new config property, hive.delete-schema-locations-fallback,
determines the behavior when Trino can't check the file location.
False (the default) will not request deletion in that case.
@jirassimok jirassimok force-pushed the drop-schema-file-handling branch from 11055a1 to 703098e Compare December 17, 2021 15:08
@jirassimok
Copy link
Member Author

All tests have passed, something is just dangling in this one so it isn't finishing the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants