[BUG] distributed CI failed in `test_read_case_col_name` #8403

abellina · 2023-05-25T19:59:11Z

I see the following failures:

14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[V0-][INJECT_OOM, IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[V0-json][IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[v0-][INJECT_OOM, IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[v0-json][IGNORE_ORDER]

AssertionError: GPU and CPU are not both null at [99, 'v0']

In the multi-threaded shuffle CI job. But I am seeing the same failure locally against a standalone Spark.

I also saw #8400.

The text was updated successfully, but these errors were encountered:

abellina · 2023-05-26T13:16:02Z

This test was recently added here: #8371

abellina · 2023-05-26T13:53:30Z

This repros against a standalone cluster without the multi-threaded cluster.

abellina · 2023-05-26T15:06:51Z

Trying to come up with a smaller repro, but I ended up with this: #8411

abellina · 2023-05-26T15:32:03Z

Related (but not the whole) repro:

import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("k0", LongType, false), StructField("k1", LongType, false), StructField("k2", LongType, false), StructField("k3", LongType, false), StructField("v0", LongType, true), StructField("v1", LongType, true), StructField("v2", LongType, true), StructField("v3", LongType, true)))
val d = spark.read.schema(schema).json("file:///tmp/json_test_repro")
d.show()

Where json_test_repro is a folder with a json file with contents:

{"v1":5139590093827188858,"v2":-1404180295742992878}

And partitioned directory structure, as such:

./k0=1
./k0=1/k1=1
./k0=1/k1=1/k2=2
./k0=1/k1=1/k2=2/k3=3
./k0=1/k1=1/k2=2/k3=3/part-00014-48dd252f-0dec-4969-ac53-79d4cb8a8331.c000.json

Note the schema passed to cuDF has columns v0, v3 that don't exist in the json itself. The issue with this we get back a table from cuDF with 2 columns, instead of 4 columns (after fixing #8411):

java.lang.ArrayIndexOutOfBoundsException: 2
        at ai.rapids.cudf.Table.getColumn(Table.java:124)
        at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader.$anonfun$readToTable$4(GpuJsonScan.scala:326)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader.$anonfun$readToTable$3(GpuJsonScan.scala:319)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)

Since the files can have any number of columns (if all rows are null, Spark can omit the column), then we need to make sure cuDF follows suit.

abellina · 2023-05-26T15:39:35Z

We need to file a cuDF bug for this.

My guess is that the integration test failure is an odd combination of contents of a source table.

If the file in question had the following contents:

{"v0":0, "v1":1, "v2":2, "v3":3}
{"v1":5139590093827188858,"v2":-1404180295742992878}

We get back 4 columns and all seems well:

+----+-------------------+--------------------+----+---+---+---+---+            
|  v0|                 v1|                  v2|  v3| k0| k1| k2| k3|
+----+-------------------+--------------------+----+---+---+---+---+
|   0|                  1|                   2|   3|  1|  1|  2|  3|
|null|5139590093827188858|-1404180295742992878|null|  1|  1|  2|  3|
+----+-------------------+--------------------+----+---+---+---+---+

The CPU consistently returns the above table with all columns and nulls where appropriate.

revans2 · 2023-05-30T19:59:28Z

Yes, this is a bug in our CUDF JNI layer.

https://github.com/rapidsai/cudf/blob/5e12c25ca40c41f21fc80d4cd36713c514fabaa4/java/src/main/native/src/TableJni.cpp#L1504-L1521

If we cannot find a column that matches the name, we return the original data from unchanged. We really should be moving columns over to a new table and generating columns of nulls if we cannot find the column we want.

revans2 · 2023-05-30T20:18:45Z

From a performance/efficiency standpoint I think we want to modify where we reorder the columns. I think we want to return a TableWithMeta, and let the java code handle the column names. This is so we can avoid copying the table multiple times and possibly making multiple copies of the same columns.

revans2 · 2023-05-30T20:55:04Z

Crap. I don't know if we can fix it without pushing some of this back to the plugin or have CUDF actually return nulls for columns that were not in the file.

I'll file an issue and see what CUDF has to say about it.

revans2 · 2023-05-30T21:46:30Z

I filed rapidsai/cudf#13473 for the long term fix. In the short term I am going to do my best to make it work without it. We will not be able to make the corner case of a file with no columns, only rows, work.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 25, 2023

abellina changed the title ~~[BUG] Multi-threaded shuffle CI failed in test_read_case_col_name~~ [BUG] distributed CI failed in test_read_case_col_name May 26, 2023

abellina added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label May 26, 2023

abellina mentioned this issue May 26, 2023

Stop double closing in json scan + skip test #8413

Merged

mattahrens assigned razajafri May 26, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label May 26, 2023

sameerz assigned revans2 May 30, 2023

sameerz unassigned razajafri May 31, 2023

mattahrens unassigned revans2 Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] distributed CI failed in `test_read_case_col_name` #8403

[BUG] distributed CI failed in `test_read_case_col_name` #8403

abellina commented May 25, 2023 •

edited

Loading

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

[BUG] distributed CI failed in test_read_case_col_name #8403

[BUG] distributed CI failed in test_read_case_col_name #8403

Comments

abellina commented May 25, 2023 • edited Loading

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

abellina commented May 26, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

revans2 commented May 30, 2023

[BUG] distributed CI failed in `test_read_case_col_name` #8403

[BUG] distributed CI failed in `test_read_case_col_name` #8403

abellina commented May 25, 2023 •

edited

Loading