Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column field ID control in parquet writer #10504

Merged
merged 20 commits into from
Apr 15, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Mar 24, 2022

Closes #10375
Closes #10376

This PR enables column field_id control in the parquet writer. When writing a parquet file, users can specify a column's field_id via column_in_metadata.set_parquet_field_id(). JNI bindings and uni tests are added as well.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Mar 24, 2022
@PointKernel PointKernel self-assigned this Mar 24, 2022
@PointKernel PointKernel marked this pull request as ready for review March 29, 2022 21:59
@PointKernel PointKernel requested a review from a team as a code owner March 29, 2022 21:59
@PointKernel PointKernel changed the title Add schema field ID control in parquet writer Add column field ID control in parquet writer Mar 29, 2022
@codecov
Copy link

codecov bot commented Mar 29, 2022

Codecov Report

Merging #10504 (64427a4) into branch-22.06 (3c13ef1) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10504      +/-   ##
================================================
+ Coverage         86.33%   86.37%   +0.03%     
================================================
  Files               140      142       +2     
  Lines             22289    22356      +67     
================================================
+ Hits              19244    19310      +66     
- Misses             3045     3046       +1     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 93.67% <0.00%> (-1.09%) ⬇️
python/dask_cudf/dask_cudf/tests/test_binops.py 92.00% <0.00%> (-0.60%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.36% <0.00%> (-0.27%) ⬇️
python/cudf/cudf/core/cut.py 82.69% <0.00%> (ø)
python/cudf/cudf/core/series.py 95.28% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 79.60% <0.00%> (ø)
python/cudf/cudf/core/indexed_frame.py 91.77% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/utils.py 90.90% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/test_applymap.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/single_column_frame.py 96.52% <0.00%> (+0.07%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c13ef1...64427a4. Read the comment docs.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Apart from @jlowe's concerns, this should be good to go

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet.hpp Show resolved Hide resolved
jlowe
jlowe previously approved these changes Mar 31, 2022
@jlowe jlowe dismissed their stale review March 31, 2022 13:05

Looks like the Java build is failing on a chunked write with "Optional has no value". Is there a place that was missed for handling chunked writes?

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 31, 2022
@PointKernel
Copy link
Member Author

@res-life Thanks! I just gave you the write access to my repo. Can you please add the JNI bindings and tests to this PR?

@github-actions github-actions bot added the Java Affects Java cuDF API. label Apr 13, 2022
Signed-off-by: Chong Gao <[email protected]>
@PointKernel PointKernel requested a review from a team as a code owner April 13, 2022 09:52
Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Contributor

@jlowe help review JNI part

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
@PointKernel PointKernel requested a review from devavret April 13, 2022 15:44
* Set a simple child meta data
* @return this for chaining.
*/
public T withColumns(boolean nullable, String name, int parquetFieldId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be withColumn since it's only adding a single column.

@PointKernel PointKernel requested a review from jlowe April 14, 2022 12:44
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d5a982b into rapidsai:branch-22.06 Apr 15, 2022
@PointKernel PointKernel deleted the parquet-field-id-writing branch May 26, 2022 17:43
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
6 participants