Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39645: [Python] Fix read_table for encrypted parquet #39438

Merged
merged 3 commits into from
May 9, 2024
Merged

Conversation

tmct
Copy link
Contributor

@tmct tmct commented Jan 3, 2024

Rationale for this change

Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing.

What changes are included in this PR?

Pass through the argument that was intended to have been passed.

Are these changes tested?

We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI.

Are there any user-facing changes?

Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working.

Copy link

github-actions bot commented Jan 3, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@tmct tmct changed the title Pass forgotten keyword argument GH-39438: [${pyarrow}] ${Pass forgotten keyword argument} Jan 3, 2024
@tmct tmct changed the title GH-39438: [${pyarrow}] ${Pass forgotten keyword argument} GH-39438: [pyarrow] Pass forgotten keyword argument Jan 3, 2024
@tmct tmct changed the title GH-39438: [pyarrow] Pass forgotten keyword argument GH-39438: [Python] Pass forgotten keyword argument Jan 3, 2024
@tmct tmct changed the title GH-39438: [Python] Pass forgotten keyword argument MINOR: [Python] Pass forgotten keyword argument Jan 3, 2024
@kou
Copy link
Member

kou commented Jan 3, 2024

Could you add a test for this case and open a new issue for this?
This is not a MINOR change: https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes

@raulcd
Copy link
Member

raulcd commented Jan 15, 2024

@tmct do you want to open an issue and follow it up as suggested by @kou or do you want us to follow it up?
We should create an issue, link it to the PR and create a test. cc @AlenkaF

@AlenkaF
Copy link
Member

AlenkaF commented Jan 15, 2024

Thank you for fixing this!
We should add a test to the pyarrow/tests/parquet/test_encryption.py::read_encrypted_parquet.

@tolleybot
Copy link
Contributor

tolleybot commented Jan 16, 2024

I have added the issue ##39645 Issue with Decryption Properties in read_table for Encrypted Parquet Files for this PR
@tmct let me know if there are any details you want to add to the issue.

@AlenkaF AlenkaF changed the title MINOR: [Python] Pass forgotten keyword argument GH-39645: [Python] Pass forgotten keyword argument Jan 17, 2024
Copy link

⚠️ GitHub issue #39645 has been automatically assigned in GitHub to PR creator.

@tolleybot
Copy link
Contributor

I have added new test test_encrypted_parquet_read_table

@AlenkaF
Copy link
Member

AlenkaF commented Jan 22, 2024

The failures look connected.

@tolleybot
Copy link
Contributor

I provided updates, and it looks like two tests are failing. I am not sure if they are related, however.

@kou
Copy link
Member

kou commented Jan 26, 2024

Could you rebase on main?

@tmct tmct changed the title GH-39645: [Python] Pass forgotten keyword argument GH-39645: [Python] Fix read_table for encrypted parquet Jan 26, 2024
Copy link

⚠️ GitHub issue #39645 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 29, 2024

The failures can be found on other builds also so they are not related to this PR.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 29, 2024
@tolleybot
Copy link
Contributor

Hello everyone, anyway to get a reviewer on this PR?

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding two comments, thank you for your patience! =)

python/pyarrow/_dataset_parquet.pyx Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_encryption.py Outdated Show resolved Hide resolved
@@ -808,6 +815,18 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
raise ValueError("size must be larger than zero")
self.reader_properties().set_thrift_container_size_limit(size)

def set_file_decryption_properties(self, FileDecryptionProperties decryption_properties):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several things:

  1. will this work if Parquet encryption is not enabled at compile time? (see how parquet_decryption_config works)
  2. can you make this a read/write property like the other options here?
  3. can it also be added to the constructor like the other options?
  4. can you add unit tests for it? see existing test in
    # set decryption config for parquet fragment scan options
  5. how does this interact with decryption_config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill look into this

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Member

@rok rok May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1 through 4 are addressed now.
As for 5 decryption_config and file_decryption_properties currently don't interact in Python code base. But as per @wgtmac they do in c++. Perhaps @tolleybot has more insight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Create the ReaderProperties object using the FileDecryptionProperties object
auto decryption_config =
std::make_shared<parquet::encryption::DecryptionConfiguration>();
auto file_decryption_properties = crypto_factory_->GetFileDecryptionProperties(
*kms_connection_config_, *decryption_config);
auto reader_properties = parquet::default_reader_properties();
reader_properties.file_decryption_properties(file_decryption_properties);
// Read entire file as a single Arrow table
parquet::arrow::FileReaderBuilder reader_builder;
ASSERT_OK(reader_builder.Open(input, reader_properties));
ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build());
std::shared_ptr<Table> table;
ASSERT_OK(arrow_reader->ReadTable(&table));

It seems DecryptionConfiguration is meant for scanning, while FileDecryptionProperties is to be used when reading single files with ReadTable. Both seem to be public.

python/pyarrow/_dataset_parquet.pyx Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Feb 27, 2024

@wgtmac Would you like to take a look at this? I'm not entirely sure which kind of decryption arguments should be exposed in the public Parquet reading APIs. It seems we already take a decryption_config for Parquet datasets, but this is adding decryption_properties.

also cc @jorisvandenbossche

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I'm not that familiar with the python code base. AFAIK, ParquetDecryptionConfig is used by parquet dataset reader to create FileDecryptionProperties for each file like this:

if (parquet_decrypt_config != nullptr) {
auto file_decryption_prop =
parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties(
*parquet_decrypt_config->kms_connection_config,
*parquet_decrypt_config->decryption_config, path, filesystem);

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 8, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 8, 2024
@rok
Copy link
Member

rok commented May 8, 2024

@pitrou could you do another pass here please?

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you for working on this!

@rok rok requested a review from wgtmac May 9, 2024 13:22
@rok
Copy link
Member

rok commented May 9, 2024

@wgtmac could you please do a quick sanity check over the decryption_config / decryption_properties part?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -715,6 +714,9 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
decryption_config : pyarrow.dataset.ParquetDecryptionConfig, default None
If not None, use the provided ParquetDecryptionConfig to decrypt the
Parquet file.
decryption_properties : pyarrow.dataset.FileDecryptionProperties, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? I only see C++ class FileDecryptionProperties under parquet namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Changed to:

  decryption_properties : pyarrow.parquet.FileDecryptionProperties, default None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for the CI and merge.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 9, 2024
tmct and others added 3 commits May 9, 2024 20:48
Refactor Dataset Loading to Utilize ds.dataset for Single File Handling

Changed the dataset loading logic to directly use ds.dataset for both single files and directories. Removed the conditional check for single_file and the manual creation of FileSystemDataset from a single fragment.

Rationale:
- The updated approach leverages the flexibility of ds.dataset, which, according to its documentation, can handle a single file scenario. As per the docstring: "Path pointing to a single file: Open a FileSystemDataset from a single file."
- This simplification leads to cleaner and more maintainable code, as ds.dataset internally manages the distinctions between a single file and a directory, removing the need for explicit conditional checks and fragment creation.

This refactor enhances code clarity and relies on the robustness of PyArrow's dataset handling, making the code more streamlined.

Commit Summary:
- Removed `decryption_properties` from `scan_args` to avoid passing it to `ParquetFragmentScanOptions`.
- Added `set_file_decryption_properties` function to apply decryption properties directly to `default_fragment_scan_options`.
- Applied decryption properties using the new function, ensuring compatibility with `ParquetFragmentScanOptions`.

Rationale:
- Eliminating `decryption_properties` from `scan_args` prevents errors since `ParquetFragmentScanOptions` does not support this parameter.  Its set through the CReaderProperties property
- Introducing `set_file_decryption_properties` provides a clear and focused method for handling decryption settings, by updating the CReaderProperties

Add test for writing and reading encrypted Parquet files with parquet.read_table

I change the name of test_encrypted_parquet_write_read_table.
Updated some formatting.
moved name from single_file back to buffer_reader.

 updated some of the tests in test_encryption to use some utility functions to help with the repetitiveness

updates to test_encryption.py to try and re-use common code through utility functions

Changed the variable buffer_reader back to single_file
@rok rok merged commit bd44410 into apache:main May 9, 2024
10 of 12 checks passed
@rok rok removed the awaiting change review Awaiting change review label May 9, 2024
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit bd44410.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…#39438)

### Rationale for this change

Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing.

### What changes are included in this PR?

Pass through the argument that was intended to have been passed.

### Are these changes tested?

We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI.

### Are there any user-facing changes?

Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working.

* Closes: apache#39645

Lead-authored-by: Tom McTiernan <[email protected]>
Co-authored-by: Don <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request May 29, 2024
…#39438)

### Rationale for this change

Currently, if you try to read a decrypted parquet with read_table, passing decryption_properties - in the happy path (pyarrow.data available for import) the reading/decryption of the file fails, as the decryption properties are missing.

### What changes are included in this PR?

Pass through the argument that was intended to have been passed.

### Are these changes tested?

We have tested this locally on an encrypted parquet dataset - please advise on any further testing you would like beyond that and the standard CI.

### Are there any user-facing changes?

Not in any cases where their code was previously working? The intended behaviour for encrypted dataset decryption should start working.

* Closes: apache#39645

Lead-authored-by: Tom McTiernan <[email protected]>
Co-authored-by: Don <[email protected]>
Co-authored-by: Rok Mihevc <[email protected]>
Signed-off-by: Rok Mihevc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Issue with Decryption Properties in read_table for Encrypted Parquet Files
9 participants