Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make map column non-nullable when it's a key in another map. #9147

Merged
merged 3 commits into from
Sep 7, 2023

Conversation

res-life
Copy link
Collaborator

fixes #9129

Fix map column can not be non-nullable.
Depends on rapidsai/cudf#14003.

@res-life res-life changed the title Make map column non-nullable Make map column non-nullable when it's a key in another map. Aug 31, 2023
case m: MapType =>
// if this is a key of another map. e.g.: map1(map2(int,int), int), the map2 is the map
// key of map1, map2 should be non-nullable
val isMapNullable = if (isMapKey) false else nullable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set isMapKey instead of just going off of nullable? We set nullable to false specifically each place we set isMapKey to true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider 2 scenarios.
First scenario:

struct(
  field(column_1, map(int, int), nullable = true),
  field(column_1, map(int, int), nullable = false)
)

We should honor the nullable in the else branch

if (isMapKey) false else nullable

Second scenario:

struct(
  field(column_1, map1(map2(int, int), int), nullable = true)
)

The isMapKey means this map is a map key in another map?.
The isMapKey here means the m: MapType is a map key of another map, and its type is also Map.
We can not know isMapKey unless that the outer writerOptionsFromField passes it in.
Consider the map1(map2(int, int), int), the logic is from outer to inner.
writerOptionsFromField first handles map1, then handle its (key(map2(int, int)), value), here isMapKey means (map2(int, int) is key of map1.
writerOptionsFromField handles map2(int, int) (the key of outer map1), isMapKey is true, isMapKey is passed in from outer writerOptionsFromField.
writerOptionsFromField handles int (the value of outer map1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you understand what I am saying. isMapKey is only ever set to true on line 296. On line 295 nullable is also always set to false for this same method call.

              nullable = false,
              writeInt96, fieldMeta, parquetFieldIdWriteEnabled, isMapKey = true).build()

There is never a situation if the current code when

is (isMapKey) false else nullable

will return anything different from nullable by itself. If it did, then it would be an error in the code, because all map keys, no matter the type, are required to never be nullable. There is not even an option in the Spark MapType to make them nullable, which is why line 295 is hard coded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed review. Fixed.
And also tested succefully with rapidsai/cudf#14003.
Once this PR is approved, I'll merge rapidsai/cudf#14003 and this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test code is in: #9090

  test("Statistics tests for Parquet files written by GPU, map(map)") {
    assume(false, "https://github.com/NVIDIA/spark-rapids/issues/9129")
    assume(true, "Move to scale test")

After this PR is merged, I'll update #9090

@res-life
Copy link
Collaborator Author

res-life commented Sep 6, 2023

build

2 similar comments
@res-life
Copy link
Collaborator Author

res-life commented Sep 7, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Sep 7, 2023

build

@res-life res-life merged commit d6a8338 into NVIDIA:branch-23.10 Sep 7, 2023
@res-life res-life deleted the fix-map-nullable branch September 7, 2023 09:04
@sameerz sameerz added the bug Something isn't working label Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Writing Parquet map(map) column can not set the outer key as non-null.
3 participants