Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow dropping Hive schemas that contain external files #9740

Merged
merged 1 commit into from
Nov 20, 2021

Conversation

jirassimok
Copy link
Member

This avoids potential data loss when a schema is created and dropped in a location that already has files.

@cla-bot cla-bot bot added the cla-signed label Oct 22, 2021
@jirassimok jirassimok requested a review from losipiuk October 22, 2021 13:37
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch 2 times, most recently from 514304f to 2d197c5 Compare October 22, 2021 16:03
@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch from 1aa0008 to e993519 Compare October 27, 2021 21:07
@jirassimok jirassimok requested a review from losipiuk October 28, 2021 15:20
Comment on lines 132 to 134
private static final SchemaType DATABASE = SchemaType.DATABASE;
private static final SchemaType TABLE = SchemaType.TABLE;
private static final SchemaType PARTITION = SchemaType.PARTITION;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just static import those (actually I am not sure if you really can) :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I make the enum non-private, I could.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - would make more sense to me :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it package-private with a comment.

deleteMetadataDirectory(DATABASE, databaseMetadataDirectory);
try {
if (!metadataFileSystem.delete(getSchemaPath(DATABASE, databaseMetadataDirectory), false)) {
throw new TrinoException(HIVE_METASTORE_ERROR, "Failed to delete database schema file");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put path in exception message

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few nits. To be merged after.

@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch 4 times, most recently from fc68747 to daf9981 Compare October 29, 2021 14:33
@jirassimok
Copy link
Member Author

This shouldn't be merged yet, pending offline discussion.

@findepi
Copy link
Member

findepi commented Oct 30, 2021

Per offline discussion, we should unregister such schema, and leave the files on the storage intact.
Rationale for this is that a schema could have been created for an existing location (CREATE SCHEMA x WITH (location='existing/locaction/with/some/existing/data')). This PR would prevent such schema from being deleted.

@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch from daf9981 to 412d29e Compare November 4, 2021 18:52
@@ -783,7 +783,27 @@ public void createSchema(ConnectorSession session, String schemaName, Map<String
@Override
public void dropSchema(ConnectorSession session, String schemaName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the less sure I am that this is the right place to delete the files.

I think it might be better to include a deleteData parameter for HiveMetastore.dropDatabase, and pass true when there are no external files. That argument is already present on HiveMetastore.dropTable and dropPartition, too, so it wouldn't be entirely out of place.

I guess if we delete the file here, though, HiveMetastore.dropDatabase should note that it's not supposed to delete non-metadata files.

assertQuerySucceeds(format("CREATE SCHEMA %s WITH (location = '%s')", schemaName, schemaDir));
assertQuerySucceeds(format("DROP SCHEMA %s", schemaName));
@Test
public void testDropSchemaWithoutLocation()
Copy link
Member Author

@jirassimok jirassimok Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure how this case still works (well, it seemed to work locally; maybe it doesn't actually work). If the location property is not set, HiveMetadata shouldn't be deleting anything, and the HiveMetastore implementations shouldn't be deleting anything, either.

(Though I don't believe this test runs with every HiveMetastore we have, which would probably be good coverage to have.)

@findepi
Copy link
Member

findepi commented Nov 8, 2021

Extracted #9902 from here

FileSystem fs = hdfsEnvironment.getFileSystem(new HdfsContext(session), path);
// If no files in schema directory, delete it
if (!fs.listFiles(path, false).hasNext()) {
fs.delete(path, true);
Copy link
Member

@findepi findepi Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in SemiTransactionalHiveMetastore, and only when commit is invoked

add a test with:

  • DROP SCHEMA
  • rollback
  • try to use that schema

}

@Test
public void testDropSchemaDeletesDirectory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on file systems.
We should have a copy of that test in io.trino.plugin.hive.AbstractTestHive, so it's run for supported storages too.

@findepi
Copy link
Member

findepi commented Nov 9, 2021

#9902 merged, @jirassimok can you please rebase & squash?

@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch 2 times, most recently from aab6657 to 1b56cc6 Compare November 15, 2021 22:57
@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch from fb8d1c4 to 85a8348 Compare November 17, 2021 17:39
Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First commit LGTM

@losipiuk
Copy link
Member

@electrum this slightly changes the semantics of DROP SCHEMA in Hive. You may be intereseted.

@losipiuk losipiuk requested a review from electrum November 17, 2021 17:47
@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch 2 times, most recently from fe78787 to 7b2e2d3 Compare November 18, 2021 19:45
@jirassimok
Copy link
Member Author

That update added some logging.

@losipiuk
Copy link
Member

Still LGTM. Maybe:

  • drop "WIP" commit
  • test Glue manually for now
  • fix maven checks

And we can merge this one?

@jirassimok
Copy link
Member Author

Sounds good.

@alexjo2144 also suggested logging a warning instead of throwing an exception if there's an error while trying to delete the files, because the schema has already been dropped at that point.

@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch from 7b2e2d3 to 149992b Compare November 18, 2021 22:33
In HiveMetadata, delete an empty schema location after dropping it
from the metastore.

In ThriftHiveMetastore and FileHiveMetastore, do not delete data.
@jirassimok jirassimok force-pushed the dont-drop-schemas-with-files branch from 149992b to 11ba2d2 Compare November 19, 2021 19:46
@jirassimok jirassimok requested a review from findepi November 19, 2021 20:31
@jirassimok
Copy link
Member Author

Last push just fixed an copy/paste error in the tests that made one of them fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants