Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fail to convert the data to arrow format when there is a child column of struct type. #7570

Closed
firestarman opened this issue Mar 11, 2021 · 7 comments · Fixed by #7598
Labels
bug Something isn't working

Comments

@firestarman
Copy link
Contributor

firestarman commented Mar 11, 2021

Describe the bug
The cudf is complaining the error as below when writing a child column of struct type to an arrow table.

Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/liangcail/work/projects/on_github/cudf/cpp/src/interop/to_arrow.cpp:207: Number of field names and number of children doesn't match

	at ai.rapids.cudf.Table.convertCudfToArrowTable(Native Method)
	at ai.rapids.cudf.Table.access$1500(Table.java:46)
	at ai.rapids.cudf.Table$ArrowIPCTableWriter.write(Table.java:1010)

Steps/Code to reproduce bug
Apply this patch and run the test testArrowIPCWriteToBufferChunked .

cd <cudf_root>/java
mvn test -Dtest=ai.rapids.cudf.TableTest#testArrowIPCWriteToBufferChunked

Expected behavior
The test applied the patch above should pass.

@firestarman firestarman added bug Something isn't working Needs Triage Need team to review and classify labels Mar 11, 2021
@firestarman firestarman changed the title [BUG] Fail to convert the struct data to arrow format when there is a child column of struct type. [BUG] Fail to convert the data to arrow format when there is a child column of struct type. Mar 11, 2021
@firestarman
Copy link
Contributor Author

firestarman commented Mar 11, 2021

It the below a reasonable fix (It works)? Since I see there is a simliar check here https://github.com/rapidsai/cudf/blob/branch-0.19/cpp/src/interop/to_arrow.cpp#L248 for list_view .

diff --git a/cpp/src/interop/to_arrow.cpp b/cpp/src/interop/to_arrow.cpp
index 7daffc1a3c..bf269ebfe1 100644
--- a/cpp/src/interop/to_arrow.cpp
+++ b/cpp/src/interop/to_arrow.cpp
@@ -202,20 +202,22 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::struct_view>(
   arrow::MemoryPool* ar_mr,
   rmm::cuda_stream_view stream)
 {
-  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input.num_children()),
+  auto children_meta =
+    metadata.children_meta.empty() ? std::vector<column_metadata>(input.num_children()) : metadata.children_meta;
+  CUDF_EXPECTS(children_meta.size() == static_cast<std::size_t>(input.num_children()),
                "Number of field names and number of children doesn't match\n");
   std::unique_ptr<column> tmp_column = nullptr;
 
   if (input.offset() != 0) { tmp_column = std::make_unique<cudf::column>(input); }
 
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
-  auto child_arrays      = fetch_child_array(input_view, metadata.children_meta, ar_mr, stream);
+  auto child_arrays      = fetch_child_array(input_view, children_meta, ar_mr, stream);
   auto mask              = fetch_mask_buffer(input_view, ar_mr, stream);
 
   std::vector<std::shared_ptr<arrow::Field>> fields;
   std::transform(child_arrays.cbegin(),
                  child_arrays.cend(),
-                 metadata.children_meta.cbegin(),
+                 children_meta.cbegin(),
                  std::back_inserter(fields),
                  [](auto const array, auto const meta) {

@jrhemstad
Copy link
Contributor

@rgsl888prabhu I believe you implemented this originally. Could you take a look?

@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Mar 11, 2021

@firestarman When you try to read the struct column written with the patch you have provided, what are the names of children/struct members. Would it be empty ?

@firestarman
Copy link
Contributor Author

firestarman commented Mar 12, 2021

@firestarman When you try to read the struct column written with the patch you have provided, what are the names of children/struct members. Would it be empty ?

@rgsl888prabhu Yes, i believe it is empty here. Are the names necessary for this operation ? Or could it be empty ? If names are required, we may need to fix this from JNI side.

@firestarman
Copy link
Contributor Author

firestarman commented Mar 15, 2021

I add the support in JNI to carry the column names onto native by this change, however seems native to_arrow does NOT support array of struct?

@rgsl888prabhu Could you help confirm whether to_arrow support the column with array of struct type ?
If not, shall we going to support it ?

@rgsl888prabhu
Copy link
Contributor

to_arrow supports array of structs, here are some tests that were added for structs

TEST_F(ToArrowTest, StructColumn)
.

For structs, it is better to have name for members which is required for accessing the element, else it would be same as list of list.

What error are you getting currently with the changes you mentioned ?

@firestarman
Copy link
Contributor Author

Thanks, i made the PR #7598 to close this issue.

rapids-bot bot pushed a commit that referenced this issue Mar 20, 2021
…#7598)

This PR is to add the support of building the structure of column metadata from the flattened column names according to the table schema.
Since the children column metadata is required when converting cudf tables to arrow tables.

Also updating the related unit tests.

closes #7570

Signed-off-by: Firestarman <[email protected]>

Authors:
  - Liangcai Li (@firestarman)

Approvers:
  - Jason Lowe (@jlowe)

URL: #7598
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants