Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] distributed CI failed in test_read_case_col_name #8403

Open
abellina opened this issue May 25, 2023 · 9 comments
Open

[BUG] distributed CI failed in test_read_case_col_name #8403

abellina opened this issue May 25, 2023 · 9 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@abellina
Copy link
Collaborator

abellina commented May 25, 2023

I see the following failures:

14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[V0-][INJECT_OOM, IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[V0-json][IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[v0-][INJECT_OOM, IGNORE_ORDER]
14:12:15  FAILED ../../src/main/python/json_test.py::test_read_case_col_name[v0-json][IGNORE_ORDER]
AssertionError: GPU and CPU are not both null at [99, 'v0']

In the multi-threaded shuffle CI job. But I am seeing the same failure locally against a standalone Spark.

I also saw #8400.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 25, 2023
@abellina
Copy link
Collaborator Author

This test was recently added here: #8371

@abellina
Copy link
Collaborator Author

This repros against a standalone cluster without the multi-threaded cluster.

@abellina abellina changed the title [BUG] Multi-threaded shuffle CI failed in test_read_case_col_name [BUG] distributed CI failed in test_read_case_col_name May 26, 2023
@abellina
Copy link
Collaborator Author

Trying to come up with a smaller repro, but I ended up with this: #8411

@abellina
Copy link
Collaborator Author

Related (but not the whole) repro:

import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("k0", LongType, false), StructField("k1", LongType, false), StructField("k2", LongType, false), StructField("k3", LongType, false), StructField("v0", LongType, true), StructField("v1", LongType, true), StructField("v2", LongType, true), StructField("v3", LongType, true)))
val d = spark.read.schema(schema).json("file:///tmp/json_test_repro")
d.show()

Where json_test_repro is a folder with a json file with contents:

{"v1":5139590093827188858,"v2":-1404180295742992878}

And partitioned directory structure, as such:

./k0=1
./k0=1/k1=1
./k0=1/k1=1/k2=2
./k0=1/k1=1/k2=2/k3=3
./k0=1/k1=1/k2=2/k3=3/part-00014-48dd252f-0dec-4969-ac53-79d4cb8a8331.c000.json

Note the schema passed to cuDF has columns v0, v3 that don't exist in the json itself. The issue with this we get back a table from cuDF with 2 columns, instead of 4 columns (after fixing #8411):

java.lang.ArrayIndexOutOfBoundsException: 2
        at ai.rapids.cudf.Table.getColumn(Table.java:124)
        at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader.$anonfun$readToTable$4(GpuJsonScan.scala:326)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at org.apache.spark.sql.catalyst.json.rapids.JsonPartitionReader.$anonfun$readToTable$3(GpuJsonScan.scala:319)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)

Since the files can have any number of columns (if all rows are null, Spark can omit the column), then we need to make sure cuDF follows suit.

@abellina abellina added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label May 26, 2023
@abellina
Copy link
Collaborator Author

We need to file a cuDF bug for this.

My guess is that the integration test failure is an odd combination of contents of a source table.

If the file in question had the following contents:

{"v0":0, "v1":1, "v2":2, "v3":3}
{"v1":5139590093827188858,"v2":-1404180295742992878}

We get back 4 columns and all seems well:

+----+-------------------+--------------------+----+---+---+---+---+            
|  v0|                 v1|                  v2|  v3| k0| k1| k2| k3|
+----+-------------------+--------------------+----+---+---+---+---+
|   0|                  1|                   2|   3|  1|  1|  2|  3|
|null|5139590093827188858|-1404180295742992878|null|  1|  1|  2|  3|
+----+-------------------+--------------------+----+---+---+---+---+

The CPU consistently returns the above table with all columns and nulls where appropriate.

@revans2
Copy link
Collaborator

revans2 commented May 30, 2023

Yes, this is a bug in our CUDF JNI layer.

https://github.com/rapidsai/cudf/blob/5e12c25ca40c41f21fc80d4cd36713c514fabaa4/java/src/main/native/src/TableJni.cpp#L1504-L1521

If we cannot find a column that matches the name, we return the original data from unchanged. We really should be moving columns over to a new table and generating columns of nulls if we cannot find the column we want.

@revans2
Copy link
Collaborator

revans2 commented May 30, 2023

From a performance/efficiency standpoint I think we want to modify where we reorder the columns. I think we want to return a TableWithMeta, and let the java code handle the column names. This is so we can avoid copying the table multiple times and possibly making multiple copies of the same columns.

@revans2
Copy link
Collaborator

revans2 commented May 30, 2023

Crap. I don't know if we can fix it without pushing some of this back to the plugin or have CUDF actually return nulls for columns that were not in the file.

I'll file an issue and see what CUDF has to say about it.

@revans2
Copy link
Collaborator

revans2 commented May 30, 2023

I filed rapidsai/cudf#13473 for the long term fix. In the short term I am going to do my best to make it work without it. We will not be able to make the corner case of a file with no columns, only rows, work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

4 participants