-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8455][VL] Fallback Scan for Encrypted Parquet Files #8456
[GLUTEN-8455][VL] Fallback Scan for Encrypted Parquet Files #8456
Conversation
Run Gluten Clickhouse CI on x86 |
dab4674
to
e375c06
Compare
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just one comments
@@ -2278,4 +2281,11 @@ object GlutenConfig { | |||
"Otherwise, broadcast build relation will use onheap memory.") | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val ENCRYPTED_PARQUET_FALLBACK_ENABLED = | |||
buildStaticConf("spark.gluten.sql.fallbackEncryptedParquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use static conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, this config changing at runtime is an unlikely event, kept removed static to allow any such usecases
case _ => ValidationResult.succeeded | ||
def validateEncryption(): Option[String] = { | ||
|
||
val encryptionValidationEnabled = GlutenConfig.getConf.enableEncryptedParquetFallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config name is a little wird to me. maybe parquetEncryptionValidationEnabled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it
val fileStatus = filesIterator.next() | ||
checkedFileCount += 1 | ||
try { | ||
ParquetFileReader.readFooter(conf, fileStatus.getPath).toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a better way to check encrypted metadata than use the Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done to keep it backward compatible, spark 33 uses parquet 1.12, support is not there yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 3.5 uses parquet 1.13, and I found it defined in thrift. This could work for Spark 3.5?
Run Gluten Clickhouse CI on x86 |
Thanks for the review @Yohahaha addressed comments, could you please take a look |
CI fails. can you fix format? |
Run Gluten Clickhouse CI on x86 |
please rebase and fix conflict. |
Run Gluten Clickhouse CI on x86 |
@Yohahaha The way to check encrypted file haven't been confirmed, I'm afraid this is not work on Spark 3.5 |
Hi @jackylee-ch, spark 3.5 uses parquet 1.13. After parquet 1.13, there is a new field added to check for encryption, which can provide if the file is encrypted. However if we try to read the encrypted file footer, it throws ParquetCryptoRuntimeException. Could you please elaborate on why it may not work on 3.5? Thanks Alternatively, I was planning to shade parquet 1.14 and bring the shaded parquet as a packaged dependency in Gluten (to check the encryption), which will be a more elegant solution and work with multiple spark version. If there are no concerns with bringing in shaded parquet inside gluten, I'm happy to work on that implementation as well. I could not see such implementations inside Gluten so was a bit hesitant, althought parquet encryption seems to be a special case and could benefit from such a solution. Let me know what you think thanks! |
@jackylee-ch sorry, I missed your comments. the check is disabled by default, it's safe to merge to main branch. if we verified it's not work in Spark 3.5, we can fix with a followup PR. |
@ArnavBalyan thanks for the update! would you add UT in different spark shim module with a minimal encrypted parquet file? we do not need package parquet deps in Gluten I think, just do verification in shim module. |
@jackylee-ch I'm ok if you want to revert this PR with Spark 3.5 compatibility concern. |
@ArnavBalyan You mean that no matter which Spark version we use, we can get BTW what would happen if the footer is not encrypted but the column is encrypted? |
Current PR should work fine for Spark 3.2 and 3.3 for encrypted footer, not sure for other cases. If it doesn't support those cases, we need move Since it is merged and not work by default, a followed PR is good for me. |
Sure let me add a follow up UT for 3.5, this feature is behind feature flag and verified for parquet 1.13. @jackylee-ch, it would depend how you are doing encryption in your setup. Typically the footer metadata will indicate encryption for newer versions of parquet.
@Yohahaha, if we want to keep it backward compatible for parquet 1.12, and not use exception, then we will need a newer parquet inside Gluten regardless of the spark version, in that case, the check can hold true if there are future parquet upgrades inside spark. Using the shim layer we can probably do validation but still will use exception checking for 1.12, I was wondering how feasible it would be to bring in shaded parquet to do this? Thanks |
We can use different checker for different Spark version in shims, and for parquet 1.12, the exception is good to me if there is no other better idea. |
spark.gluten.sql.fallbackEncryptedParquet
true.Fixes: #8455