JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

firestarman · 2021-03-15T07:57:54Z

This PR is to add the support of building the structure of column metadata from the flattened column names according to the table schema.
Since the children column metadata is required when converting cudf tables to arrow tables.

Also updating the related unit tests.

closes #7570

Signed-off-by: Firestarman [email protected]

Pass the names of children struct columns to the naitve for arrow IPC writer, which is required to build column_metadata. Also add the related unit tests. Signed-off-by: Firestarman <[email protected]>

codecov · 2021-03-15T10:56:20Z

Codecov Report

Merging #7598 (16fa512) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 93.22%.

❗ Current head 16fa512 differs from pull request most recent head 54395f0. Consider uploading reports for the commit 54395f0 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7598      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17350     +466     
===============================================
+ Hits             13822    14294     +472     
+ Misses            3062     3056       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/index.py	`93.34% <ø> (+0.48%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.83% <83.33%> (+0.07%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`94.85% <85.71%> (-0.17%)`	⬇️
python/cudf/cudf/core/frame.py	`89.12% <89.47%> (+0.10%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <90.32%> (-2.12%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.58% <95.00%> (+0.11%)`	⬆️
python/cudf/cudf/core/series.py	`91.57% <95.55%> (+0.78%)`	⬆️
python/cudf/cudf/core/column/string.py	`86.76% <100.00%> (+0.26%)`	⬆️
python/cudf/cudf/core/column_accessor.py	`95.45% <100.00%> (+0.14%)`	⬆️
python/cudf/cudf/core/dtypes.py	`91.13% <100.00%> (+1.40%)`	⬆️
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fea6ad...54395f0. Read the comment docs.

This is required by native. Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-16T02:00:46Z

rerun tests

java/src/main/native/src/TableJni.cpp

Do this to avoid callback callback into the JVM. Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Firestarman <[email protected]>

jlowe

It's not necessary to refactor this to the flattened name approach, but there are some resource leaks that are possible with the approach used here that are not with the flattened approach and should be fixed.

java/src/main/java/ai/rapids/cudf/ColumnMetadata.java

java/src/main/native/src/ColumnMetadataJni.cpp

java/src/main/native/src/TableJni.cpp

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-18T07:43:47Z

It's not necessary to refactor this to the flattened name approach, but there are some resource leaks that are possible with the approach used here that are not with the flattened approach and should be fixed.

I updated to align with the flattened name approach, and it is a good suggestion, because it not only reduces the code change, but also hides some column metadata details (e.g. stub meta for list type) from Java.

firestarman · 2021-03-18T07:43:57Z

rerun tests

Signed-off-by: Firestarman <[email protected]>

jlowe

Thanks for updating @firestarman, this did get a lot cleaner overall. The main thing I see missing now is that the behavior of column names for nested types isn't documented in the Java APIs anywhere. If we're going the flattened names route for all writers then this should be documented on WriterBuilder, but if the flattening logic is only going to apply to Arrow IPC then its builder should override the withColumnNames method if only to provide documentation on the expected behavior.

cc: @revans2 for visibility

java/src/main/native/src/TableJni.cpp

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-19T02:19:03Z

Thanks for updating @firestarman, this did get a lot cleaner overall. The main thing I see missing now is that the behavior of column names for nested types isn't documented in the Java APIs anywhere. If we're going the flattened names route for all writers then this should be documented on WriterBuilder, but if the flattening logic is only going to apply to Arrow IPC then its builder should override the withColumnNames method if only to provide documentation on the expected behavior.

I think it is only for Arrow IPC now, so I updated its builder to override the two withXXXXNames for the documentation.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2021-03-20T01:13:40Z

@gpucibot merge

firestarman · 2021-03-20T01:18:37Z

@gpucibot merge

firestarman · 2021-03-20T03:50:25Z

Thanks Jason, learnt a lot

This PR is to support running scalar pandas UDF with array type. Add array type signature for related expressions and plans. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer. This PR depends on rapidsai/cudf#7598 closes #1912 Signed-off-by: Firestarman <[email protected]>

This PR is to support running scalar pandas UDF with array type. Add array type signature for related expressions and plans. Flatten the names of nested struct columns from schema, which is also required by the cudf Arrow IPC writer. This PR depends on rapidsai/cudf#7598 closes NVIDIA#1912 Signed-off-by: Firestarman <[email protected]>

Pass names of children struct columns to native

1c8d676

Pass the names of children struct columns to the naitve for arrow IPC writer, which is required to build column_metadata. Also add the related unit tests. Signed-off-by: Firestarman <[email protected]>

github-actions bot added the Java Affects Java cuDF API. label Mar 15, 2021

firestarman changed the title ~~JNI: Pass the names of children struct columns down to the native Arrow IPC writer.~~ JNI: Pass the names of children struct columns down to the native Arrow IPC writer [skip ci]. Mar 15, 2021

Add a metadata for offset column of array type.

6adafda

This is required by native. Signed-off-by: Firestarman <[email protected]>

firestarman added the improvement Improvement / enhancement to an existing function label Mar 16, 2021

firestarman mentioned this pull request Mar 16, 2021

[BUG] Fail to convert the data to arrow format when there is a child column of struct type. #7570

Closed

firestarman marked this pull request as ready for review March 16, 2021 01:58

firestarman requested a review from a team as a code owner March 16, 2021 01:58

firestarman added the non-breaking Non-breaking change label Mar 16, 2021

firestarman mentioned this pull request Mar 16, 2021

Support running scalar pandas UDF with array type. NVIDIA/spark-rapids#1944

Merged

jlowe changed the title ~~JNI: Pass the names of children struct columns down to the native Arrow IPC writer [skip ci].~~ JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] Mar 16, 2021

jlowe reviewed Mar 16, 2021

View reviewed changes

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved

Use the regular JNI call to create the c++ column meta

44fa204

Do this to avoid callback callback into the JVM. Signed-off-by: Firestarman <[email protected]>

github-actions bot added the CMake CMake build issue label Mar 17, 2021

firestarman added 2 commits March 17, 2021 16:16

Correct the year

a767e70

Signed-off-by: Firestarman <[email protected]>

comment update

d83d0e9

Signed-off-by: Firestarman <[email protected]>

jlowe requested changes Mar 17, 2021

View reviewed changes

Align with the flattened name approach.

4c2eeef

Signed-off-by: Firestarman <[email protected]>

github-actions bot removed the CMake CMake build issue label Mar 18, 2021

Return a reference from get_column_name

fbbbe5b

Signed-off-by: Firestarman <[email protected]>

firestarman added 4 commits March 18, 2021 15:55

Update the function signaure

bc33abc

Signed-off-by: Firestarman <[email protected]>

correct the child index of array type column

eb39f16

Signed-off-by: Firestarman <[email protected]>

Correct the index for list type

55a1e17

Signed-off-by: Firestarman <[email protected]>

Comment update

07d09e5

Signed-off-by: Firestarman <[email protected]>

Remove unexpected rmm log file

87d403a

Signed-off-by: Firestarman <[email protected]>

jlowe reviewed Mar 18, 2021

View reviewed changes

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved

java/src/main/native/src/TableJni.cpp Outdated Show resolved Hide resolved

java/src/main/native/src/TableJni.cpp Show resolved Hide resolved

java/src/main/native/src/TableJni.cpp Show resolved Hide resolved

Address some comments

1f21c3a

Signed-off-by: Firestarman <[email protected]>

firestarman added 3 commits March 19, 2021 10:28

error message update

420f524

Signed-off-by: Firestarman <[email protected]>

Only root columns and nested struct columns consume names

b8966a0

Signed-off-by: Firestarman <[email protected]>

error message update

54395f0

Signed-off-by: Firestarman <[email protected]>

jlowe approved these changes Mar 19, 2021

View reviewed changes

firestarman added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 20, 2021

rapids-bot bot merged commit cdd44d2 into rapidsai:branch-0.19 Mar 20, 2021

firestarman deleted the arrow_ipc_column_metadata branch March 20, 2021 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

firestarman commented Mar 15, 2021 •

edited

Loading

codecov bot commented Mar 15, 2021 •

edited

Loading

firestarman commented Mar 16, 2021

jlowe left a comment

firestarman commented Mar 18, 2021 •

edited

Loading

firestarman commented Mar 18, 2021

jlowe left a comment

firestarman commented Mar 19, 2021 •

edited

Loading

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

JNI: Pass names of children struct columns to native Arrow IPC writer [skip ci] #7598

Conversation

firestarman commented Mar 15, 2021 • edited Loading

codecov bot commented Mar 15, 2021 • edited Loading

Codecov Report

firestarman commented Mar 16, 2021

jlowe left a comment

Choose a reason for hiding this comment

firestarman commented Mar 18, 2021 • edited Loading

firestarman commented Mar 18, 2021

jlowe left a comment

Choose a reason for hiding this comment

firestarman commented Mar 19, 2021 • edited Loading

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

firestarman commented Mar 20, 2021

firestarman commented Mar 15, 2021 •

edited

Loading

codecov bot commented Mar 15, 2021 •

edited

Loading

firestarman commented Mar 18, 2021 •

edited

Loading

firestarman commented Mar 19, 2021 •

edited

Loading