[FEA] JSON Reader: Support `dropFieldIfAllNull` option #4718

HaoYang670 · 2022-02-08T07:29:51Z

Description
In Spark, if you set the option dropFieldIfAllNull to true when reading a JSON file, it will ignore the columns which is NullType.
For example, write a JSON file in spark and then read it:

> val df = Seq((1, null, "", Seq()), (2, null, "", Seq())).toDF
> df.write.mode("overwrite").json(path)
> val full_df = spark.read.format("json").option("dropFieldIfAllNull", false).load(path)
> full_df.show
+---+----+---+---+
| _1|  _2| _3| _4|
+---+----+---+---+
|  1|null|   | []|
+---+----+---+---+
|  2|null|   | []|
+---+----+---+---+
> full_df.printSchema
root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: string (nullable = true)
 |-- _4: array (nullable = true)
 |    |-- element: string (containsNull = true)
> val dropped_df = spark.read.format("json").option("dropFieldIfAllNull", true).load(path)
> dropped_df.show
+---+
| _1|
+---+
|  1|
+---+
|  2|
+---+

The value null, empty string and empty array (maybe more) will be dropped when dropFieldIfAllNull is set to true.

In CUDF, if we read the same JSON file, we will get:

In [9]: df = cudf.read_json(path, lines=True)
In [10]: df
Out[10]: 
   _1    _2    _3  _4
 0  1  <NA>  <NA>  []
 1  2  <NA>  <NA>  []
In [11]: df.info()
Out [11]:
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   _1      2 non-null      int64
 1   _2      0 non-null      int8
 2   _3      0 non-null      int8
 3   _4      2 non-null      object
dtypes: int64(1), int8(2), object(1)

The differences in Spark and CUDF are:

null fields will be inferred as string in Spark (if dropFieldIfAllNull is false) but int8 in CUDF
empty array is not null in CUDF
maybe more

Describe the solution you'd like
support dropFieldIfAllNull in spark-rapids, which should have same behavior as Spark.

Additional context
https://issues.apache.org/jira/browse/SPARK-23772

The text was updated successfully, but these errors were encountered:

sameerz · 2022-02-08T21:11:20Z

dropFieldIfAllNull will be used when we are doing schema discovery, which is unlikely in the near future.

revans2 · 2024-03-14T14:15:21Z

Actually this is only used for schema discovery, which is something we do not do in the plugin at all. It is not directly related to the read portion that we do.

HaoYang670 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 8, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 8, 2022

HaoYang670 mentioned this issue Feb 9, 2022

[FEA] JSON input support #9

Open

62 tasks

revans2 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON Reader: Support `dropFieldIfAllNull` option #4718

[FEA] JSON Reader: Support `dropFieldIfAllNull` option #4718

HaoYang670 commented Feb 8, 2022 •

edited

Loading

sameerz commented Feb 8, 2022

revans2 commented Mar 14, 2024

[FEA] JSON Reader: Support dropFieldIfAllNull option #4718

[FEA] JSON Reader: Support dropFieldIfAllNull option #4718

Comments

HaoYang670 commented Feb 8, 2022 • edited Loading

sameerz commented Feb 8, 2022

revans2 commented Mar 14, 2024

[FEA] JSON Reader: Support `dropFieldIfAllNull` option #4718

[FEA] JSON Reader: Support `dropFieldIfAllNull` option #4718

HaoYang670 commented Feb 8, 2022 •

edited

Loading