Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

Merged
merged 16 commits into from
Mar 20, 2021

Conversation

firestarman
Copy link
Contributor

@firestarman firestarman commented Mar 15, 2021

This PR is to add the support of building the structure of column metadata from the flattened column names according to the table schema.
Since the children column metadata is required when converting cudf tables to arrow tables.

Also updating the related unit tests.

closes #7570

Signed-off-by: Firestarman [email protected]

Pass the names of children struct columns to the naitve for arrow
IPC writer, which is required to build column_metadata.

Also add the related unit tests.

Signed-off-by: Firestarman <[email protected]>
@github-actions github-actions bot added the Java Affects Java cuDF API. label Mar 15, 2021
@firestarman firestarman changed the title JNI: Pass the names of children struct columns down to the native Arrow IPC writer. JNI: Pass the names of children struct columns down to the native Arrow IPC writer [skip ci]. Mar 15, 2021
@codecov
Copy link

codecov bot commented Mar 15, 2021

Codecov Report

Merging #7598 (16fa512) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 93.22%.

❗ Current head 16fa512 differs from pull request most recent head 54395f0. Consider uploading reports for the commit 54395f0 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7598      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17350     +466     
===============================================
+ Hits             13822    14294     +472     
+ Misses            3062     3056       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.34% <ø> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.83% <83.33%> (+0.07%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.85% <85.71%> (-0.17%) ⬇️
python/cudf/cudf/core/frame.py 89.12% <89.47%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <90.32%> (-2.12%) ⬇️
python/cudf/cudf/core/dataframe.py 90.58% <95.00%> (+0.11%) ⬆️
python/cudf/cudf/core/series.py 91.57% <95.55%> (+0.78%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/core/column_accessor.py 95.45% <100.00%> (+0.14%) ⬆️
python/cudf/cudf/core/dtypes.py 91.13% <100.00%> (+1.40%) ⬆️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fea6ad...54395f0. Read the comment docs.

This is required by native.

Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman added the improvement Improvement / enhancement to an existing function label Mar 16, 2021
@firestarman firestarman marked this pull request as ready for review March 16, 2021 01:58
@firestarman firestarman requested a review from a team as a code owner March 16, 2021 01:58
@firestarman
Copy link
Contributor Author

rerun tests

@firestarman firestarman added the non-breaking Non-breaking change label Mar 16, 2021
@jlowe jlowe changed the title JNI: Pass the names of children struct columns down to the native Arrow IPC writer [skip ci]. JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] Mar 16, 2021
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
Do this to avoid callback callback into the JVM.

Signed-off-by: Firestarman <[email protected]>
@github-actions github-actions bot added the CMake CMake build issue label Mar 17, 2021
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to refactor this to the flattened name approach, but there are some resource leaks that are possible with the approach used here that are not with the flattened approach and should be fixed.

java/src/main/java/ai/rapids/cudf/ColumnMetadata.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnMetadataJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnMetadataJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the CMake CMake build issue label Mar 18, 2021
@firestarman
Copy link
Contributor Author

firestarman commented Mar 18, 2021

It's not necessary to refactor this to the flattened name approach, but there are some resource leaks that are possible with the approach used here that are not with the flattened approach and should be fixed.

I updated to align with the flattened name approach, and it is a good suggestion, because it not only reduces the code change, but also hides some column metadata details (e.g. stub meta for list type) from Java.

@firestarman
Copy link
Contributor Author

rerun tests

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating @firestarman, this did get a lot cleaner overall. The main thing I see missing now is that the behavior of column names for nested types isn't documented in the Java APIs anywhere. If we're going the flattened names route for all writers then this should be documented on WriterBuilder, but if the flattening logic is only going to apply to Arrow IPC then its builder should override the withColumnNames method if only to provide documentation on the expected behavior.

cc: @revans2 for visibility

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Show resolved Hide resolved
java/src/main/native/src/TableJni.cpp Show resolved Hide resolved
Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Contributor Author

firestarman commented Mar 19, 2021

Thanks for updating @firestarman, this did get a lot cleaner overall. The main thing I see missing now is that the behavior of column names for nested types isn't documented in the Java APIs anywhere. If we're going the flattened names route for all writers then this should be documented on WriterBuilder, but if the flattening logic is only going to apply to Arrow IPC then its builder should override the withColumnNames method if only to provide documentation on the expected behavior.

I think it is only for Arrow IPC now, so I updated its builder to override the two withXXXXNames for the documentation.

@firestarman
Copy link
Contributor Author

@gpucibot merge

@firestarman firestarman added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 20, 2021
@firestarman
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cdd44d2 into rapidsai:branch-0.19 Mar 20, 2021
@firestarman firestarman deleted the arrow_ipc_column_metadata branch March 20, 2021 01:18
@firestarman
Copy link
Contributor Author

Thanks Jason, learnt a lot

firestarman added a commit to NVIDIA/spark-rapids that referenced this pull request Mar 24, 2021
This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.
This PR depends on rapidsai/cudf#7598

closes #1912

Signed-off-by: Firestarman <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.
This PR depends on rapidsai/cudf#7598

closes NVIDIA#1912

Signed-off-by: Firestarman <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
This PR is to support running scalar pandas UDF with array type.

Add array type signature for related expressions and plans.
Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer.
This PR depends on rapidsai/cudf#7598

closes NVIDIA#1912

Signed-off-by: Firestarman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fail to convert the data to arrow format when there is a child column of struct type.
2 participants