[FEA][Java] ArrowIPCTableWriter
writes en empty batch in the case of an empty table.
#11882
Labels
feature request
New feature or request
Is your feature request related to a problem? Please describe.
Currently the
ArrowIPCTableWriter
will write no batches into the steam if giving an empty table, which PySpark cannot handle correctly, complaining of the error as below.Pyspark is calling the
pyarrow.Table.from_batches
without specifying a schema, then it expects at least one batch (even an empty one) to be received to infer the batch schema.I have made an unit test to reproduce this case.
Describe the solution you'd like
ArrowIPCTableWriter
writes en empty batch explicitly in the case of an empty table. We can do this in the JNI layer easily, e.g.Describe alternatives you've considered
Let the Arrow C++ IPC writer used by the cuDF JNI support writing en empty batch for an empty table, I tried but failed.
For more details, please refer to https://issues.apache.org/jira/browse/ARROW-17912. BTW, Arrow Java IPC writer can do this, I mean, sending out an empty batch implicitly.
We can also update the Pyspark to specify the schema, but not sure how long it will take to have the change done.
So in a short term, we can do it in the cuDF JNI.
The text was updated successfully, but these errors were encountered: