-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #34616
GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #34616
Conversation
|
|
1 similar comment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Remember using clang-format to format cpp code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this @tolleybot. I did a first pass looking through it, and it seems like we could use a bit more testing, both in C++ and in Python.
On the C++ formatting: If you install clang-tools 14 (must be that version) and set the path the binaries with For Python formatting, see these instructions: https://arrow.apache.org/docs/developers/python.html#coding-style |
@github-actions crossbow submit -g python |
This comment has been minimized.
This comment has been minimized.
Revision: d7a6b55 Submitted crossbow builds: ursacomputing/crossbow @ actions-f652f7cf6d |
@github-actions crossbow submit test-conda-python-3.10-pandas-latest |
Revision: b323457 Submitted crossbow builds: ursacomputing/crossbow @ actions-6e2d0464d5
|
I updated the branch with the latest |
@github-actions crossbow submit -g python |
Revision: ced2ed2 Submitted crossbow builds: ursacomputing/crossbow @ actions-6cee43f1c5 |
@github-actions crossbow submit verifypython* |
Revision: ced2ed2 Submitted crossbow builds: ursacomputing/crossbow @ actions-379629a2a0 |
It seems that all triggered builds are green! Thanks @tolleybot and all others involved to get this over the finish line |
Thanks, everyone for all the reviews and work on this! |
Thank you @tolleybot for contributing this in the first place :-) |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0793432. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
FYI I checked the reports above and they are all flakes (the benchmarks reports are stable beyond this commit) |
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <[email protected]> Co-authored-by: Donald Tolley <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Gang Wu <[email protected]> Co-authored-by: scoder <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <[email protected]> Co-authored-by: Donald Tolley <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Gang Wu <[email protected]> Co-authored-by: scoder <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…tion in the new Dataset API (apache#34616) ### Rationale for this change The purpose of this pull request is to support modular encryption in the new Dataset API. See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document. ### What changes are included in this PR? I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset. ### Are these changes tested? Yes, unit tests are included. I have also included a python sample project. ### Are there any user-facing changes? Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file * Closes: apache#29238 Lead-authored-by: Don <[email protected]> Co-authored-by: Donald Tolley <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Gang Wu <[email protected]> Co-authored-by: scoder <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Rationale for this change
The purpose of this pull request is to support modular encryption in the new Dataset API. See https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit# for supporting document.
What changes are included in this PR?
I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.
Are these changes tested?
Yes, unit tests are included. I have also included a python sample project.
Are there any user-facing changes?
Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file