Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet Modular Encryption for Trino #20069

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

my7ym
Copy link

@my7ym my7ym commented Dec 8, 2023

Description

This large PR is to support Parquet Modular Encryption for Trino. The PR core logic is migrated from the one in Presto with several changes to bridge the big gaps between APIs in these two repos.

To avoid no trino-hdfs dep in other modules, we pass the trino decryption configs in parquet configs and then translate to parquet flags. Not sure if there is a better way to do this.

For Trino issue: #9383

Will fix the test and linting failures after the implementation is stable.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

Copy link

cla-bot bot commented Dec 8, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added tests:hive hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Dec 8, 2023
@my7ym
Copy link
Author

my7ym commented Dec 8, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Done

@@ -264,6 +264,12 @@
<scope>provided</scope>
</dependency>

<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently completed a huge project to decouple Trino from Hive and Hadoop, so we can't reintroduce Hadoop here. Unfortunately, since the Parquet encryption APIs are tied to Hadoop, we'll need to reimplement the decryption code ourselves, inside of Trino, similar to what we have done for most of Parquet and the other file format readers.

Copy link
Author

@my7ym my7ym Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum Hey, thanks for review. I will take a look and the complexity of it and get back here. (Also thanks for the link to the project. It's nice to learn the reasons behind it)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @electrum I have migrated necessary classes and methods to Trino project and remove the dep on hadoop. I tested with HUDI catalog and it works with encrypted datasets. I could add more unit tests later if the PR looks OK to you. Could you take another look? Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum Bump. No rush tho. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the Hadoop dependency is still here. We need to remove this.

Copy link

cla-bot bot commented Dec 15, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@my7ym my7ym force-pushed the wyu-encryption-latest branch from 3bbc5a5 to e6578c9 Compare December 15, 2023 01:17
Copy link

cla-bot bot commented Dec 15, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@my7ym my7ym requested a review from electrum December 15, 2023 01:23
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 10, 2024
@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

👋 @my7ym .. please let us know if you need any help and ping us when this is ready for review again.

@github-actions github-actions bot removed the stale label Jan 12, 2024
@mosabua
Copy link
Member

mosabua commented Feb 2, 2024

@amoghmargoor @findepi @electrum can you potentially help here and work on the removal of any Hadoop deps as discussed in the contributor call https://github.com/trinodb/trino/wiki/Contributor-meetings#trino-contributor-call-1-feb-2024

In fact looking at the message in the PR it looks like these removals are done. Maybe all we need is for @my7ym to rebase and then coordinate reviews and changes.

Also @my7ym .. I am not sure if you are on trino slack .. if so please let us know and chime in at https://trinodb.slack.com/archives/CP1MUNEUX/p1706832386414769 or sync up in general with us to continue this work together

@electrum
Copy link
Member

electrum commented Feb 2, 2024

@my7ym Great work on reimplementing the Parquet code to not require Hadoop. I'll review this ASAP.

@mosabua
Copy link
Member

mosabua commented Feb 2, 2024

Also fyi @raunaqmorarka

Copy link
Member

@amoghmargoor amoghmargoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @my7ym - thanks for PR. I will try to run this internally and see if it works on our file. If any issues I will raise my patch so that it can help taking this forward. We have been running this in production for 1.5+ years a while and have fixed some crucial bugs on the way.

@electrum
Copy link
Member

It looks like this code still depends on Hadoop. The config also contains a class name that is loaded at runtime, which is something we avoid in Trino. I think we'll want to have a list of supported KMS implementations that work out-of-the-box with appropriate config.

It would be helpful to have a product test that shows this working end-to-end.

@my7ym
Copy link
Author

my7ym commented Feb 17, 2024

It looks like this code still depends on Hadoop. The config also contains a class name that is loaded at runtime, which is something we avoid in Trino.
May I quickly know what class it is?

I think we'll want to have a list of supported KMS implementations that work out-of-the-box with appropriate config.

Good call. Will do.

It would be helpful to have a product test that shows this working end-to-end.
May I know where is the best place to add it? Any existing example?

Thanks!

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Mar 11, 2024
@mosabua
Copy link
Member

mosabua commented Mar 11, 2024

@my7ym let us know if you need any help with rebasing and next steps. cc @electrum

@github-actions github-actions bot removed the stale label Mar 12, 2024
@my7ym
Copy link
Author

my7ym commented Mar 16, 2024

@mosabua A bit tied up in the past weeks. Will work on this in next week. Will let you know if I have more questions. Thanks for checking in.

@ancientmodern
Copy link

hi @my7ym, I'm working on a project that needs this feature, and I've been following this PR with interest. Is the branch featured in this PR in a workable state (with the understanding that there are Hadoop deps that currently prevent merging into Trino)?

I'm eager to assist if there's any way I can help to move this forward. I'm relatively new to Trino but very enthusiastic about contributing to the community :)

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 10, 2024
@mosabua
Copy link
Member

mosabua commented May 29, 2024

@electrum @amoghmargoor how do we want to proceed here?

@github-actions github-actions bot removed the stale label May 30, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jun 21, 2024
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jul 15, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Jul 15, 2024
@mosabua
Copy link
Member

mosabua commented Jul 15, 2024

Can you provide guidance on how to proceed here @electrum ?

@mosabua mosabua reopened this Jul 15, 2024
@mosabua
Copy link
Member

mosabua commented Jul 16, 2024

@my7ym could you proceed to remove the hadoop dependencies as requested by @electrum and rebase the PR?

@ebyhr
Copy link
Member

ebyhr commented Oct 2, 2024

@cla-bot check

Copy link

cla-bot bot commented Oct 2, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Oct 2, 2024

The cla-bot has been summoned, and re-checked this pull request!

@ebyhr
Copy link
Member

ebyhr commented Oct 2, 2024

CLA check failed because it expects @my7ym, but this PR uses @will-ym account.

@sopel39
Copy link
Member

sopel39 commented Nov 5, 2024

Related to #23583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

9 participants