Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Parquet decryption support for Hive tables #23583

Closed
wants to merge 1 commit into from

Conversation

amoghmargoor
Copy link
Member

@amoghmargoor amoghmargoor commented Sep 26, 2024

Description

Adds support to read Hive tables with encrypted Parquet files.
PS: This PR is work in progress and we are adding tests to it.

Additional context and related issues

Parquet added support for encryption https://parquet.apache.org/docs/file-format/data-pages/encryption/. Spark also added support to read and write tables with parquet encrypted files. In this PR we are adding support to read Hive tables with encrypted Parquet files with Trino.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Sep 26, 2024
@amoghmargoor amoghmargoor changed the title Add Parquet decryption support for Hive tables [WIP] Add Parquet decryption support for Hive tables Sep 26, 2024
@github-actions github-actions bot added hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Sep 26, 2024
@amoghmargoor amoghmargoor force-pushed the master-pme-oss branch 2 times, most recently from f9d182e to 25e6fd1 Compare September 26, 2024 14:45
@ggershinsky
Copy link

Thanks @amoghmargoor . A high-level comment first. This patch is big; I think it can be split into smaller patches that are easier to process.

  • the classes under crypto/keytools package handle key management. They can moved into a separate PR. This PR can focus on the basic PME layer that is given an encryption key via the reader API, and doesn't manage the key storage etc.
  • many classes in the crypto package are identical to the same classes in the Apache parquet-java artifact. Can we re-use them?

@amoghmargoor
Copy link
Member Author

Thanks @amoghmargoor . A high-level comment first. This patch is big; I think it can be split into smaller patches that are easier to process.

  • the classes under crypto/keytools package handle key management. They can moved into a separate PR. This PR can focus on the basic PME layer that is given an encryption key via the reader API, and doesn't manage the key storage etc.
  • many classes in the crypto package are identical to the same classes in the Apache parquet-java artifact. Can we re-use them?

Ok I will try to split it. Regarding crypto package I think some jars were removed or decoupled as a process of removing hadoop dependency. Do we have any hadoop dependency in parquet-java ?

@ggershinsky
Copy link

I think the dependency on Apache Hadoop jars is "provided" in the parquet-hadoop pom, so maybe they won't be fetched during the Trino build. Most of the basic crypto package classes (unrelated to factories) don't need Apache Hadoop artifacts. Unless there are some other cross-dependencies, we might be able to re-use those classes.

if (trinoParquetCryptoConfig.getCryptoFactoryClass() == null) {
return Optional.empty();
}
final Class<?> foundClass = TrinoCryptoConfigurationUtil.getClassFromConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t do this kind of dynamic class loading based on class names from config. Everything should be configured statically.

@electrum
Copy link
Member

electrum commented Oct 17, 2024

The last time I looked, the Parquet crypto classes rely on Hadoop Configuration. Also, they’re mixed together in a JAR with other Hadoop-requiring classes, so we don’t include parquet-hadoop in Trino.

@sopel39
Copy link
Member

sopel39 commented Oct 28, 2024

The last time I looked, the Parquet crypto classes rely on Hadoop Configuration. Also, they’re mixed together in a JAR with other Hadoop-requiring classes, so we don’t include parquet-hadoop in Trino.

@electrum all crypto classes are forked into io.trino.parquet.crypto package. There is no dependency on parquet-hadoop from trino-parquet.

@@ -379,6 +379,21 @@
<scope>runtime</scope>
</dependency>

<!-- Below two dependencies are not needed when parquet-kms-apple includes jackson dependency -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove this or update the comment

<dependency>
<groupId>com.google.errorprone</groupId>
<artifactId>error_prone_annotations</artifactId>
<optional>true</optional>
</dependency>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change, undo

@electrum
Copy link
Member

Sorry for the confusion. I was answering these comments:

many classes in the crypto package are identical to the same classes in the Apache parquet-java artifact. Can we re-use them?

Most of the basic crypto package classes (unrelated to factories) don't need Apache Hadoop artifacts. Unless there are some other cross-dependencies, we might be able to re-use those classes.

<build>
<plugins>
<plugin>
<groupId>org.basepom.maven</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need these?

@@ -208,6 +208,16 @@
</excludes>
</configuration>
</plugin>
<plugin>
<groupId>org.basepom.maven</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove. ORC shouldn't be affected

@@ -13,12 +13,25 @@
<description>Trino - Parquet file format support</description>

<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these required?

@@ -26,4 +28,6 @@ public int getUncompressedSize()
{
return uncompressedSize;
}

public abstract Slice getSlice();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be a separate commit

import java.lang.reflect.InvocationTargetException;
import java.util.Optional;

public class EncryptionUtils
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be final

@ggershinsky
Copy link

all crypto classes are forked into io.trino.parquet.crypto package. There is no dependency on parquet-hadoop from trino-parquet.

If no dependency on parquet-hadoop, then sure, it's fine to fork the crypto classes. They are quite stable in the Apache Parquet-java repo, have not been modified in many years (besides code style changes). The only differences in Trino will be related to the config, file path and file system objects, I'll review those parts.

import java.io.IOException;
import java.util.Map;

public class TrinoHadoopFSKeyMaterialStore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to rename it, this class doesn't use HadoopFS

Comment on lines +155 to +175
public boolean isUniformEncryption()
{
return parquetReaderEncryptionOptions.uniformEncryption;
}

public boolean isEncryptionParameterChecked()
{
return parquetReaderEncryptionOptions.encryptionParameterChecked;
}

public String getFailsafeEncryptionKeyId()
{
return parquetReaderEncryptionOptions.failsafeEncryptionKeyId;
}

public String getEncryptionColumnKeys()
{
return parquetReaderEncryptionOptions.columnKeys;
}

public String getEncryptionFooterKeyId()
Copy link

@ggershinsky ggershinsky Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these parameters are write-only, not required for reading.
(except for isEncryptionParameterChecked)

Comment on lines +180 to +195
public String[] getEncryptionVersionedKeyList()
{
return parquetReaderEncryptionOptions.versionedKeyList;
}

public String[] getEncryptionKeyList()
{
return parquetReaderEncryptionOptions.keyList;
}

public String getEncryptionKeyFile()
{
return parquetReaderEncryptionOptions.keyFile;
}

public boolean isEncryptionEnvironmentKeys()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are custom parameters for one particular KMS client

@ggershinsky
Copy link

The only differences in Trino will be related to the config, file path and file system objects, I'll review those parts.

Had a look at these. The file path and file system parts are fine.

Regarding the config (encryption parameters) - this is related to the general challenge of supporting custom KMS systems. There is basically an infinite number of key managers out there (each public cloud has one; plus open-source KMS's; plus org private KMS systems), so ideally we'd have a dynamic loading mechanism for custom KMS client classes, and a map-like config object for supplying custom encryption parameters for these custom classes.

Otherwise, the users would have to add their custom KMS client class (and perform custom modifications of the ParquetReaderOptions.java) in their Trino fork.

Copy link

github-actions bot commented Dec 3, 2024

This pull request has gone a while without any activity. Tagging for triage help: @mosabua

@github-actions github-actions bot added the stale label Dec 3, 2024
@sopel39
Copy link
Member

sopel39 commented Dec 10, 2024

I had a discussion with @electrum regarding KMS. The best would be to have one or two predefined KMS and let people contribute more to OSS or override binding if they need company specific customization

@ggershinsky
Copy link

I had a discussion with @electrum regarding KMS. The best would be to have one or two predefined KMS and let people contribute more to OSS or override binding if they need company specific customization

SGTM

@github-actions github-actions bot removed the stale label Dec 10, 2024
@sopel39
Copy link
Member

sopel39 commented Dec 18, 2024

Superceeded by #24517

@sopel39 sopel39 closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

4 participants