Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON Reader: Support dropFieldIfAllNull option #4718

Closed
Tracked by #9
HaoYang670 opened this issue Feb 8, 2022 · 2 comments
Closed
Tracked by #9

[FEA] JSON Reader: Support dropFieldIfAllNull option #4718

HaoYang670 opened this issue Feb 8, 2022 · 2 comments
Labels
feature request New feature or request

Comments

@HaoYang670
Copy link
Collaborator

HaoYang670 commented Feb 8, 2022

Description
In Spark, if you set the option dropFieldIfAllNull to true when reading a JSON file, it will ignore the columns which is NullType.
For example, write a JSON file in spark and then read it:

> val df = Seq((1, null, "", Seq()), (2, null, "", Seq())).toDF
> df.write.mode("overwrite").json(path)
> val full_df = spark.read.format("json").option("dropFieldIfAllNull", false).load(path)
> full_df.show
+---+----+---+---+
| _1|  _2| _3| _4|
+---+----+---+---+
|  1|null|   | []|
+---+----+---+---+
|  2|null|   | []|
+---+----+---+---+
> full_df.printSchema
root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: string (nullable = true)
 |-- _4: array (nullable = true)
 |    |-- element: string (containsNull = true)
> val dropped_df = spark.read.format("json").option("dropFieldIfAllNull", true).load(path)
> dropped_df.show
+---+
| _1|
+---+
|  1|
+---+
|  2|
+---+

The value null, empty string and empty array (maybe more) will be dropped when dropFieldIfAllNull is set to true.

In CUDF, if we read the same JSON file, we will get:

In [9]: df = cudf.read_json(path, lines=True)
In [10]: df
Out[10]: 
   _1    _2    _3  _4
 0  1  <NA>  <NA>  []
 1  2  <NA>  <NA>  []
In [11]: df.info()
Out [11]:
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   _1      2 non-null      int64
 1   _2      0 non-null      int8
 2   _3      0 non-null      int8
 3   _4      2 non-null      object
dtypes: int64(1), int8(2), object(1)

The differences in Spark and CUDF are:

  1. null fields will be inferred as string in Spark (if dropFieldIfAllNull is false) but int8 in CUDF
  2. empty array is not null in CUDF
  3. maybe more

Describe the solution you'd like
support dropFieldIfAllNull in spark-rapids, which should have same behavior as Spark.

Additional context
https://issues.apache.org/jira/browse/SPARK-23772

@HaoYang670 HaoYang670 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 8, 2022
@sameerz
Copy link
Collaborator

sameerz commented Feb 8, 2022

dropFieldIfAllNull will be used when we are doing schema discovery, which is unlikely in the near future.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 8, 2022
@HaoYang670 HaoYang670 mentioned this issue Feb 9, 2022
62 tasks
@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

Actually this is only used for schema discovery, which is something we do not do in the plugin at all. It is not directly related to the read portion that we do.

@revans2 revans2 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants