Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8455][VL] Fallback Scan for Encrypted Parquet Files #8456

Merged
merged 7 commits into from
Jan 9, 2025

Conversation

ArnavBalyan
Copy link
Contributor

@ArnavBalyan ArnavBalyan commented Jan 7, 2025

  • Currently Scan is offloaded to Velox, however Velox does not support encrypted parquet files.
  • This leads to a Velox side error and runtime failure. Added support to fallback in such cases.
  • We attempt to read parquet footer for the root paths in scan operator, if decryption is detected, the scan fallsback
  • This is behind a config and will not be enabled by default, users can enable scan fallback by passing spark.gluten.sql.fallbackEncryptedParquet true.

Fixes: #8455

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Jan 7, 2025
Copy link

github-actions bot commented Jan 7, 2025

#8455

Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/encrypted-parquet-fallback branch from dab4674 to e375c06 Compare January 7, 2025 20:17
Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! just one comments

@@ -2278,4 +2281,11 @@ object GlutenConfig {
"Otherwise, broadcast build relation will use onheap memory.")
.booleanConf
.createWithDefault(false)

val ENCRYPTED_PARQUET_FALLBACK_ENABLED =
buildStaticConf("spark.gluten.sql.fallbackEncryptedParquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use static conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, this config changing at runtime is an unlikely event, kept removed static to allow any such usecases

case _ => ValidationResult.succeeded
def validateEncryption(): Option[String] = {

val encryptionValidationEnabled = GlutenConfig.getConf.enableEncryptedParquetFallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config name is a little wird to me. maybe parquetEncryptionValidationEnabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it

val fileStatus = filesIterator.next()
checkedFileCount += 1
try {
ParquetFileReader.readFooter(conf, fileStatus.getPath).toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a better way to check encrypted metadata than use the Exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to keep it backward compatible, spark 33 uses parquet 1.12, support is not there yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 3.5 uses parquet 1.13, and I found it defined in thrift. This could work for Spark 3.5?

Copy link

github-actions bot commented Jan 8, 2025

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Contributor Author

Thanks for the review @Yohahaha addressed comments, could you please take a look

@FelixYBW
Copy link
Contributor

FelixYBW commented Jan 8, 2025

CI fails. can you fix format?

Copy link

github-actions bot commented Jan 8, 2025

Run Gluten Clickhouse CI on x86

@Yohahaha
Copy link
Contributor

Yohahaha commented Jan 9, 2025

[ERROR] /__w/incubator-gluten/incubator-gluten/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala:196: error: value getConf is not a member of object org.apache.gluten.config.GlutenConfig
[ERROR]       val encryptionValidationEnabled = GlutenConfig.getConf.parquetEncryptionValidationEnabled
[ERROR]                                                      ^
[ERROR] one error found
[ERROR] exception compilation error occurred!!!

please rebase and fix conflict.

Copy link

github-actions bot commented Jan 9, 2025

Run Gluten Clickhouse CI on x86

@Yohahaha Yohahaha merged commit de725bd into apache:main Jan 9, 2025
48 checks passed
@jackylee-ch
Copy link
Contributor

@Yohahaha The way to check encrypted file haven't been confirmed, I'm afraid this is not work on Spark 3.5

@ArnavBalyan
Copy link
Contributor Author

ArnavBalyan commented Jan 9, 2025

@Yohahaha The way to check encrypted file haven't been confirmed, I'm afraid this is not work on Spark 3.5

Hi @jackylee-ch, spark 3.5 uses parquet 1.13. After parquet 1.13, there is a new field added to check for encryption, which can provide if the file is encrypted. However if we try to read the encrypted file footer, it throws ParquetCryptoRuntimeException. Could you please elaborate on why it may not work on 3.5? Thanks

Alternatively, I was planning to shade parquet 1.14 and bring the shaded parquet as a packaged dependency in Gluten (to check the encryption), which will be a more elegant solution and work with multiple spark version. If there are no concerns with bringing in shaded parquet inside gluten, I'm happy to work on that implementation as well. I could not see such implementations inside Gluten so was a bit hesitant, althought parquet encryption seems to be a special case and could benefit from such a solution. Let me know what you think thanks!

@Yohahaha
Copy link
Contributor

Yohahaha commented Jan 9, 2025

@Yohahaha The way to check encrypted file haven't been confirmed, I'm afraid this is not work on Spark 3.5

@jackylee-ch sorry, I missed your comments.

the check is disabled by default, it's safe to merge to main branch. if we verified it's not work in Spark 3.5, we can fix with a followup PR.

@Yohahaha
Copy link
Contributor

Yohahaha commented Jan 9, 2025

@ArnavBalyan thanks for the update!

would you add UT in different spark shim module with a minimal encrypted parquet file?

we do not need package parquet deps in Gluten I think, just do verification in shim module.

@Yohahaha
Copy link
Contributor

Yohahaha commented Jan 9, 2025

@jackylee-ch I'm ok if you want to revert this PR with Spark 3.5 compatibility concern.

@jackylee-ch
Copy link
Contributor

jackylee-ch commented Jan 9, 2025

Hi @jackylee-ch, spark 3.5 uses parquet 1.13. After parquet 1.13, there is a new field added to check for encryption, which can provide if the file is encrypted. However if we try to read the encrypted file footer, it throws ParquetCryptoRuntimeException. Could you please elaborate on why it may not work on 3.5? Thanks

@ArnavBalyan You mean that no matter which Spark version we use, we can get ParquetCryptoRuntimeException if I try to read the encrypted footer?

BTW what would happen if the footer is not encrypted but the column is encrypted?

@jackylee-ch
Copy link
Contributor

jackylee-ch commented Jan 9, 2025

@jackylee-ch I'm ok if you want to revert this PR with Spark 3.5 compatibility concern.

Current PR should work fine for Spark 3.2 and 3.3 for encrypted footer, not sure for other cases. If it doesn't support those cases, we need move ParquetMetadataUtils to shim layer and support other cases.

Since it is merged and not work by default, a followed PR is good for me.

@ArnavBalyan
Copy link
Contributor Author

ArnavBalyan commented Jan 9, 2025

Hi @jackylee-ch, spark 3.5 uses parquet 1.13. After parquet 1.13, there is a new field added to check for encryption, which can provide if the file is encrypted. However if we try to read the encrypted file footer, it throws ParquetCryptoRuntimeException. Could you please elaborate on why it may not work on 3.5? Thanks

@ArnavBalyan You mean that no matter which Spark version we use, we can get ParquetCryptoRuntimeException if I try to read the encrypted footer?

BTW what would happen if the footer is not encrypted but the column is encrypted?

Sure let me add a follow up UT for 3.5, this feature is behind feature flag and verified for parquet 1.13. @jackylee-ch, it would depend how you are doing encryption in your setup. Typically the footer metadata will indicate encryption for newer versions of parquet.

we do not need package parquet deps in Gluten I think, just do verification in shim module.

@Yohahaha, if we want to keep it backward compatible for parquet 1.12, and not use exception, then we will need a newer parquet inside Gluten regardless of the spark version, in that case, the check can hold true if there are future parquet upgrades inside spark. Using the shim layer we can probably do validation but still will use exception checking for 1.12, I was wondering how feasible it would be to bring in shaded parquet to do this? Thanks

@jackylee-ch
Copy link
Contributor

We can use different checker for different Spark version in shims, and for parquet 1.12, the exception is good to me if there is no other better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Velox Runtime Error on Encrypted Parquet Files
4 participants