Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create external table location for Hive #17920

Conversation

vlad-lyutenko
Copy link
Contributor

@vlad-lyutenko vlad-lyutenko commented Jun 15, 2023

Description

Create directory structure if it's not exists.
external location will be created if writesToNonManagedTablesEnabled flag is set

The same behaviour is on pure Hive as mentioned in #17920 (comment)

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

With current change - external location will be created for Hive tables if flag hive.non-managed-table-writes-enabled is set,
otherwise exception will raised as it was before.

# Hive
* Create an empty directory if the external location doesn't exist

@cla-bot cla-bot bot added the cla-signed label Jun 15, 2023
@vlad-lyutenko vlad-lyutenko marked this pull request as draft June 15, 2023 13:38
@github-actions github-actions bot added hive Hive connector tests:hive labels Jun 15, 2023
@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch from a7e603c to a8a7670 Compare June 15, 2023 16:20
@vlad-lyutenko vlad-lyutenko changed the title WIP Create external table location Create external table location for Hive Jun 15, 2023
@vlad-lyutenko vlad-lyutenko marked this pull request as ready for review June 15, 2023 16:24
@hashhar

This comment was marked as off-topic.

Copy link
Contributor

@krvikash krvikash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. nitpick comments only.

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch from a8a7670 to 12b7e34 Compare June 16, 2023 10:09
@vlad-lyutenko
Copy link
Contributor Author

LGTM. nitpick comments only.

Big thanks, all comments addressed

@ebyhr
Copy link
Member

ebyhr commented Jun 16, 2023

I would recommend reading #1277. I don't think we want to allow creating the directory with external_location table property.

Copy link
Contributor

@krvikash krvikash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the commit message is separated into two lines. Could you please make it in one line?

@vlad-lyutenko
Copy link
Contributor Author

I would recommend reading #1277. I don't think we want to allow creating the directory with external_location table property.

@ebyhr thx, for discussion

Maybe we can introduce this change with some config toggle like:
@Config("hive.allow-create-external-table-path")
?

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch from 12b7e34 to ad55aa3 Compare June 16, 2023 10:24
@tooptoop4
Copy link
Contributor

need a way to disable this for following reasons:

  1. creating a table that points to an existing s3 location that contains sensitive data would circumvent table controls. ie create table myschema.sneakytable WITH (external_location = 's3://sensitive_bucket/data_only_other_team_should_see/')
  2. creating a table that changes/deletes/overwrites data on some existing s3 location

Copy link
Contributor

@tooptoop4 tooptoop4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need config to prevent

@hashhar
Copy link
Member

hashhar commented Jun 18, 2023

creating a table that points to an existing s3 location that contains sensitive data would circumvent table controls. ie create table myschema.sneakytable WITH (external_location = 's3://sensitive_bucket/data_only_other_team_should_see/')

For this there is hive.non-managed-table-creates-enabled and hive.non-managed-table-writes-enabled.

This change does allow a user to create a table pointing to a non-existent location though because the directory would get auto-created.

Also I agree with Yuya on that we shouldn't add this. Hive doesn't allow this for example and also because "Create external table target location, if it's not exists." is what a managed table is - not an external table. External tables are supposed to point to a table which already exists but isn't registered. And for allowing arbitrary location for managed tables see concerns in the issue and the two PRs linked from that issue that Yuya has shared.

@vlad-lyutenko
Copy link
Contributor Author

For me this change is more about, synchronise behaviour with s3:

 private void checkExternalPath(HdfsContext context, Path path)
    {
        try {
            if (!isS3FileSystem(context, hdfsEnvironment, path)) {

As far as I understand on s3 we don't have abstraction as directory, so if for example someone wants to create external table, he could just provide any non existing path and we even don't check it (exists or not exists), because this path will be automatically "created" by s3 (path will be just part of the key).

So I'd like to give the same option for HDFS, if we want external table just give trino a path and we will create it.

@vlad-lyutenko
Copy link
Contributor Author

vlad-lyutenko commented Jun 19, 2023

External tables are supposed to point to a table which already exists but isn't registered

I mean, that I could not find code which actually check that table exists on provided external_path, it's just check that path exists (and only for non s3 file systems).

@vlad-lyutenko
Copy link
Contributor Author

vlad-lyutenko commented Jun 19, 2023

I would recommend reading #1277.

Yes and this is good option to allow create managed table with location provided by the user.
But this will require to disallow create external table on path, where no table exists (not just path not exists).

So as soon, as it will be implemented and merged, it will eliminate current change, but mean time current change just synchronise behaviours between s3 and hdfs.

@Praveen2112
Copy link
Member

Hive doesn't allow this for example and also because "Create external table target location, if it's not exists."

Last time when I checked with Hive - if create an external table pointing to non-existing directory, hive would create an empty directory

#1277 is about setting the location of the manged table while this PR is about external table - I'm not sure if it would help us here.

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have additional PT coverage where we check the permission of the directory for external tables ? We do have a similar test for hdfs-impersonation

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch 4 times, most recently from ded99ad to 1a8c89a Compare July 5, 2023 10:22
@vlad-lyutenko
Copy link
Contributor Author

vlad-lyutenko commented Aug 3, 2023

. We should instead separate these into two different properties.

so we should introduce smth like external_location_create_if_not_exists (or some more appropriate name)?

@vlad-lyutenko
Copy link
Contributor Author

vlad-lyutenko commented Aug 3, 2023

All new code needs to use TrinoFileSystem.

didn't find how to create directory using TrinoFileSystem,

or we should add method like createDirecotry in TrinoFileSystem ?

@Praveen2112
Copy link
Member

We should NOT create a non-managed location, as by definition it’s not managed by us.

@electrum Please correct me if I am wrong, this definition is derived from hive right, so if hive supports creating an empty directory if it doesn't exist shouldn't (for an external table) we support the same. Today we do support inserting data into a new partition based on writesToNonManagedTablesEnabled flag can't we exhibit a similar behavior for create table as well.

This is why we don’t allow writing to external locations by default.

Even in this PR we do hide them behind a flag which is disabled by default.

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch 5 times, most recently from fffebe7 to 89246ff Compare August 15, 2023 11:16
@Praveen2112 Praveen2112 requested a review from electrum August 23, 2023 02:59
@Praveen2112
Copy link
Member

Discussed offline with @electrum , now external location will be created if writesToNonManagedTablesEnabled flag is set.

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch from 89246ff to 5442e70 Compare August 23, 2023 08:29
@github-actions github-actions bot added the docs label Aug 23, 2023
@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch 4 times, most recently from a99dad3 to 39aa317 Compare August 23, 2023 13:13
@Praveen2112
Copy link
Member

/test-with-secrets sha=39aa31743ed6cd83b0fcd785e22c43f450783ceb

@github-actions
Copy link

github-actions bot commented Aug 24, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5961189431

@vlad-lyutenko vlad-lyutenko force-pushed the vlad-lyutenko/create-external-table-location branch from 39aa317 to 883ae60 Compare August 24, 2023 09:03
@Praveen2112
Copy link
Member

/test-with-secrets sha=883ae60c59bafb1f4d7dfb6dafb0ad8db2e1c602

@github-actions
Copy link

github-actions bot commented Aug 24, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5961868678

@Praveen2112
Copy link
Member

There were a few flaky failures, re-ran them, now the pipeline is green

@Praveen2112 Praveen2112 merged commit ba2bb9b into trinodb:master Aug 24, 2023
@Praveen2112
Copy link
Member

Thanks for working on this.

@github-actions github-actions bot added this to the 425 milestone Aug 24, 2023
@@ -36,6 +36,7 @@ public List<SuiteTestRun> getTestRuns(EnvironmentConfig config)
return ImmutableList.of(
testOnEnvironment(EnvMultinode.class)
.withGroups("configured_features", "hdfs_no_impersonation")
.withExcludedTests("io.trino.tests.product.TestImpersonation.testExternalLocationTableCreationSuccess")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this exclusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest, don't remember any specific reason, need to run and check this test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood. please recover this info & capture it as a code comment.

Copy link
Contributor Author

@vlad-lyutenko vlad-lyutenko Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood. please recover this info & capture it as a code comment.

got it I didn't want enable hive.non-managed-table-writes-enabled for whole EnvMultinode environment to break other tests,
which is required for this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

8 participants