-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Parquet decryption support for Hive tables #23583
Conversation
f9d182e
to
25e6fd1
Compare
25e6fd1
to
cdb6cba
Compare
Thanks @amoghmargoor . A high-level comment first. This patch is big; I think it can be split into smaller patches that are easier to process.
|
Ok I will try to split it. Regarding crypto package I think some jars were removed or decoupled as a process of removing hadoop dependency. Do we have any hadoop dependency in parquet-java ? |
I think the dependency on Apache Hadoop jars is "provided" in the parquet-hadoop pom, so maybe they won't be fetched during the Trino build. Most of the basic crypto package classes (unrelated to factories) don't need Apache Hadoop artifacts. Unless there are some other cross-dependencies, we might be able to re-use those classes. |
if (trinoParquetCryptoConfig.getCryptoFactoryClass() == null) { | ||
return Optional.empty(); | ||
} | ||
final Class<?> foundClass = TrinoCryptoConfigurationUtil.getClassFromConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t do this kind of dynamic class loading based on class names from config. Everything should be configured statically.
The last time I looked, the Parquet crypto classes rely on Hadoop |
@electrum all crypto classes are forked into |
@@ -379,6 +379,21 @@ | |||
<scope>runtime</scope> | |||
</dependency> | |||
|
|||
<!-- Below two dependencies are not needed when parquet-kms-apple includes jackson dependency --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove this or update the comment
<dependency> | ||
<groupId>com.google.errorprone</groupId> | ||
<artifactId>error_prone_annotations</artifactId> | ||
<optional>true</optional> | ||
</dependency> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change, undo
Sorry for the confusion. I was answering these comments:
|
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.basepom.maven</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need these?
@@ -208,6 +208,16 @@ | |||
</excludes> | |||
</configuration> | |||
</plugin> | |||
<plugin> | |||
<groupId>org.basepom.maven</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove. ORC shouldn't be affected
@@ -13,12 +13,25 @@ | |||
<description>Trino - Parquet file format support</description> | |||
|
|||
<dependencies> | |||
<dependency> | |||
<groupId>com.fasterxml.jackson.core</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these required?
@@ -26,4 +28,6 @@ public int getUncompressedSize() | |||
{ | |||
return uncompressedSize; | |||
} | |||
|
|||
public abstract Slice getSlice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be a separate commit
import java.lang.reflect.InvocationTargetException; | ||
import java.util.Optional; | ||
|
||
public class EncryptionUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be final
If no dependency on parquet-hadoop, then sure, it's fine to fork the crypto classes. They are quite stable in the Apache Parquet-java repo, have not been modified in many years (besides code style changes). The only differences in Trino will be related to the config, file path and file system objects, I'll review those parts. |
import java.io.IOException; | ||
import java.util.Map; | ||
|
||
public class TrinoHadoopFSKeyMaterialStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to rename it, this class doesn't use HadoopFS
public boolean isUniformEncryption() | ||
{ | ||
return parquetReaderEncryptionOptions.uniformEncryption; | ||
} | ||
|
||
public boolean isEncryptionParameterChecked() | ||
{ | ||
return parquetReaderEncryptionOptions.encryptionParameterChecked; | ||
} | ||
|
||
public String getFailsafeEncryptionKeyId() | ||
{ | ||
return parquetReaderEncryptionOptions.failsafeEncryptionKeyId; | ||
} | ||
|
||
public String getEncryptionColumnKeys() | ||
{ | ||
return parquetReaderEncryptionOptions.columnKeys; | ||
} | ||
|
||
public String getEncryptionFooterKeyId() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these parameters are write-only, not required for reading.
(except for isEncryptionParameterChecked
)
public String[] getEncryptionVersionedKeyList() | ||
{ | ||
return parquetReaderEncryptionOptions.versionedKeyList; | ||
} | ||
|
||
public String[] getEncryptionKeyList() | ||
{ | ||
return parquetReaderEncryptionOptions.keyList; | ||
} | ||
|
||
public String getEncryptionKeyFile() | ||
{ | ||
return parquetReaderEncryptionOptions.keyFile; | ||
} | ||
|
||
public boolean isEncryptionEnvironmentKeys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are custom parameters for one particular KMS client
Had a look at these. The file path and file system parts are fine. Regarding the config (encryption parameters) - this is related to the general challenge of supporting custom KMS systems. There is basically an infinite number of key managers out there (each public cloud has one; plus open-source KMS's; plus org private KMS systems), so ideally we'd have a dynamic loading mechanism for custom KMS client classes, and a map-like config object for supplying custom encryption parameters for these custom classes. Otherwise, the users would have to add their custom KMS client class (and perform custom modifications of the |
This pull request has gone a while without any activity. Tagging for triage help: @mosabua |
I had a discussion with @electrum regarding KMS. The best would be to have one or two predefined KMS and let people contribute more to OSS or override binding if they need company specific customization |
SGTM |
Superceeded by #24517 |
Description
Adds support to read Hive tables with encrypted Parquet files.
PS: This PR is work in progress and we are adding tests to it.
Additional context and related issues
Parquet added support for encryption https://parquet.apache.org/docs/file-format/data-pages/encryption/. Spark also added support to read and write tables with parquet encrypted files. In this PR we are adding support to read Hive tables with encrypted Parquet files with Trino.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: