-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39645: [Python] Fix read_table for encrypted parquet #39438
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
Could you add a test for this case and open a new issue for this? |
Thank you for fixing this! |
|
I have added new test test_encrypted_parquet_read_table |
The failures look connected. |
I provided updates, and it looks like two tests are failing. I am not sure if they are related, however. |
Could you rebase on main? |
|
The failures can be found on other builds also so they are not related to this PR. |
Hello everyone, anyway to get a reviewer on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding two comments, thank you for your patience! =)
python/pyarrow/_dataset_parquet.pyx
Outdated
@@ -808,6 +815,18 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions): | |||
raise ValueError("size must be larger than zero") | |||
self.reader_properties().set_thrift_container_size_limit(size) | |||
|
|||
def set_file_decryption_properties(self, FileDecryptionProperties decryption_properties): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several things:
- will this work if Parquet encryption is not enabled at compile time? (see how
parquet_decryption_config
works) - can you make this a read/write property like the other options here?
- can it also be added to the constructor like the other options?
- can you add unit tests for it? see existing test in
# set decryption config for parquet fragment scan options - how does this interact with
decryption_config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill look into this
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 1 through 4 are addressed now.
As for 5 decryption_config
and file_decryption_properties
currently don't interact in Python code base. But as per @wgtmac they do in c++. Perhaps @tolleybot has more insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc
Lines 220 to 233 in 318d22a
// Create the ReaderProperties object using the FileDecryptionProperties object | |
auto decryption_config = | |
std::make_shared<parquet::encryption::DecryptionConfiguration>(); | |
auto file_decryption_properties = crypto_factory_->GetFileDecryptionProperties( | |
*kms_connection_config_, *decryption_config); | |
auto reader_properties = parquet::default_reader_properties(); | |
reader_properties.file_decryption_properties(file_decryption_properties); | |
// Read entire file as a single Arrow table | |
parquet::arrow::FileReaderBuilder reader_builder; | |
ASSERT_OK(reader_builder.Open(input, reader_properties)); | |
ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build()); | |
std::shared_ptr<Table> table; | |
ASSERT_OK(arrow_reader->ReadTable(&table)); |
It seems
DecryptionConfiguration
is meant for scanning, while FileDecryptionProperties
is to be used when reading single files with ReadTable
. Both seem to be public.
@wgtmac Would you like to take a look at this? I'm not entirely sure which kind of decryption arguments should be exposed in the public Parquet reading APIs. It seems we already take a also cc @jorisvandenbossche |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I'm not that familiar with the python code base. AFAIK, ParquetDecryptionConfig is used by parquet dataset reader to create FileDecryptionProperties for each file like this:
arrow/cpp/src/arrow/dataset/file_parquet.cc
Lines 82 to 86 in 40a8a68
if (parquet_decrypt_config != nullptr) { | |
auto file_decryption_prop = | |
parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties( | |
*parquet_decrypt_config->kms_connection_config, | |
*parquet_decrypt_config->decryption_config, path, filesystem); |
@pitrou could you do another pass here please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for working on this!
@wgtmac could you please do a quick sanity check over the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
python/pyarrow/_dataset_parquet.pyx
Outdated
@@ -715,6 +714,9 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions): | |||
decryption_config : pyarrow.dataset.ParquetDecryptionConfig, default None | |||
If not None, use the provided ParquetDecryptionConfig to decrypt the | |||
Parquet file. | |||
decryption_properties : pyarrow.dataset.FileDecryptionProperties, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? I only see C++ class FileDecryptionProperties under parquet namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Changed to:
decryption_properties : pyarrow.parquet.FileDecryptionProperties, default None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for the CI and merge.
Refactor Dataset Loading to Utilize ds.dataset for Single File Handling Changed the dataset loading logic to directly use ds.dataset for both single files and directories. Removed the conditional check for single_file and the manual creation of FileSystemDataset from a single fragment. Rationale: - The updated approach leverages the flexibility of ds.dataset, which, according to its documentation, can handle a single file scenario. As per the docstring: "Path pointing to a single file: Open a FileSystemDataset from a single file." - This simplification leads to cleaner and more maintainable code, as ds.dataset internally manages the distinctions between a single file and a directory, removing the need for explicit conditional checks and fragment creation. This refactor enhances code clarity and relies on the robustness of PyArrow's dataset handling, making the code more streamlined. Commit Summary: - Removed `decryption_properties` from `scan_args` to avoid passing it to `ParquetFragmentScanOptions`. - Added `set_file_decryption_properties` function to apply decryption properties directly to `default_fragment_scan_options`. - Applied decryption properties using the new function, ensuring compatibility with `ParquetFragmentScanOptions`. Rationale: - Eliminating `decryption_properties` from `scan_args` prevents errors since `ParquetFragmentScanOptions` does not support this parameter. Its set through the CReaderProperties property - Introducing `set_file_decryption_properties` provides a clear and focused method for handling decryption settings, by updating the CReaderProperties Add test for writing and reading encrypted Parquet files with parquet.read_table I change the name of test_encrypted_parquet_write_read_table. Updated some formatting.
moved name from single_file back to buffer_reader. updated some of the tests in test_encryption to use some utility functions to help with the repetitiveness updates to test_encryption.py to try and re-use common code through utility functions Changed the variable buffer_reader back to single_file
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit bd44410. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
…#39438) ### Rationale for this change Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing. ### What changes are included in this PR? Pass through the argument that was intended to have been passed. ### Are these changes tested? We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI. ### Are there any user-facing changes? Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working. * Closes: apache#39645 Lead-authored-by: Tom McTiernan <[email protected]> Co-authored-by: Don <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>
…#39438) ### Rationale for this change Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing. ### What changes are included in this PR? Pass through the argument that was intended to have been passed. ### Are these changes tested? We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI. ### Are there any user-facing changes? Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working. * Closes: apache#39645 Lead-authored-by: Tom McTiernan <[email protected]> Co-authored-by: Don <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>
Rationale for this change
Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing.
What changes are included in this PR?
Pass through the argument that was intended to have been passed.
Are these changes tested?
We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI.
Are there any user-facing changes?
Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working.