Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to redirect table reads from Hive to Iceberg #10173

Conversation

findepi
Copy link
Member

@findepi findepi commented Dec 3, 2021

Hive Connector redirects Iceberg table access to the configured
Iceberg catalog.

This change adds implementation of HiveTableRedirectionsProvider,
plugging in Hive->Iceberg redirects and leveraging existing framework.

This is based on work from several authors. In particular, this is based
on #8340 with conflicts resolved, and retrofit to current APIs in Hive connector.

As tests show, not all the statements support redirects properly and
this needs to be followed up upon.

@findepi findepi added enhancement New feature or request tests:hive labels Dec 3, 2021
@cla-bot cla-bot bot added the cla-signed label Dec 3, 2021
@findepi findepi force-pushed the findepi/redirect-hive-iceberg/there-and-back-again branch from f4a3f99 to 708cdf5 Compare December 3, 2021 16:52
@findepi findepi force-pushed the findepi/redirect-hive-iceberg/there-and-back-again branch from 708cdf5 to d12502a Compare December 3, 2021 19:24
@findepi
Copy link
Member Author

findepi commented Dec 3, 2021

per @ssheikin's #8340 (comment), added @MiguelWeezardo as co-author.

@findepi findepi force-pushed the findepi/redirect-hive-iceberg/there-and-back-again branch from ceae60d to 7b9d7ba Compare December 4, 2021 11:24
.map(propertyMetadata -> immutableEntry(
propertyMetadata.getName(),
session.getProperty(propertyMetadata.getName(), propertyMetadata.getJavaType())))
// The session properties collected here are used for events only. Filter out nulls to avoid problems with downstream consumers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we do not care about properties which are set to null in events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any consumer of these events. This must be some extension point for someone.
The only usage i found looks like a dummy consumer, at least this is what the commit message says.

log.debug("File created: query: %s, schema: %s, table: %s, partition: '%s', format: %s, size: %s, path: %s",
writeCompletedEvent.getQueryId(),
writeCompletedEvent.getSchemaName(),
writeCompletedEvent.getTableName(),
writeCompletedEvent.getPartitionName(),
writeCompletedEvent.getStorageFormat(),
writeCompletedEvent.getBytes(),
writeCompletedEvent.getPath());

Including null values could be a breaking change to downstream even consumers (e.g. if the [Immutable]Map.copyOf(event.getSessionProperties()). Converting values to Optional would be a breaking change to. In the absence of more information, i deemed that omitting null values is the least breaking of all options.

import static org.testcontainers.utility.MountableFile.forHostPath;

@TestsEnvironment
public class EnvSinglenodeHiveRedirectionToIceberg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably does not matter here but I heard many times in the past that we should refrain from using singlenode envs as they provide worse test coverage than multinode ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me. We have 27 or something singlenode environments. And yes, here it doesn't matter, as tested functionality is really coordinator-only. Let me keep it as is.


onTrino().executeQuery("ALTER TABLE " + hiveTableName + " ADD COLUMN some_new_column double");

// TODO: ALTER TABLE succeeded, but new column was not added
Copy link
Member

@losipiuk losipiuk Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a github issue which lists missing functionalities?
And tag relevant TODOs with an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary in this case. #8340 is going to fix this problems (once #8340 (comment) is resolved).

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test SHOW STATS redirection too?

What about a non-default hive.timestamp-precision? Redirects will return different results.
Same for the use_column_name configs in tables which have different schema across partitions.

@findepi
Copy link
Member Author

findepi commented Dec 6, 2021

What about a non-default hive.timestamp-precision? Redirects will return different results.

hive.timestamp-precision doesn't matter, since there is no Hive table. Hive doesn't know the table schema, so it doesn't know whether there are any timestamps.

Same for the use_column_name configs in tables which have different schema across partitions.

same

Hive Connector redirects Iceberg table access to the configured
Iceberg catalog.

This change adds implementation of `HiveTableRedirectionsProvider`,
plugging in Hive->Iceberg redirects and leveraging existing framework.

This is based on work from several authors, mentioned at the end of
commit message.

As tests show, not all the statements support redirects properly and
this needs to be followed up upon.

Co-authored-by: Xingyuan Lin <[email protected]>
Co-authored-by: Pratham Desai <[email protected]>
Co-authored-by: Sasha Sheikin <[email protected]>
Co-authored-by: Michał Ślizak <[email protected]>
Co-authored-by: Łukasz Osipiuk <[email protected]>
@findepi findepi force-pushed the findepi/redirect-hive-iceberg/there-and-back-again branch from 7b9d7ba to 8feb622 Compare December 6, 2021 11:17
@findepi
Copy link
Member Author

findepi commented Dec 6, 2021

Do we want to test SHOW STATS redirection too?

added

Also, split the test class into two (#10173 (comment)).

@findepi findepi merged commit d79f6ea into trinodb:master Dec 6, 2021
@findepi findepi deleted the findepi/redirect-hive-iceberg/there-and-back-again branch December 6, 2021 15:34
@findepi findepi mentioned this pull request Dec 6, 2021
10 tasks
@aierate
Copy link

aierate commented Oct 27, 2022

Do I have to use like
USE a-catalog.myschema;
SELECT * FROM mytable;
or
SELECT * FROM a-catalog.myschema.mytable;
?
Instead of using SELECT * FROM myschema.mytable without using a-catalog or USE a-catalog?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

4 participants