Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Null literals to be not parsed as string when mixed types as string is enabled in JSON reader #14939

Merged
merged 14 commits into from
Mar 7, 2024

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jan 31, 2024

Description

Fixes #14864

null literal should be ignored (considered as null) during parsing while handling mixed types.
Unit tests of complex scenarios are added to test this as well.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Jan 31, 2024
@github-actions github-actions bot removed the Java Affects Java cuDF API. label Jan 31, 2024
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Feb 1, 2024
@karthikeyann karthikeyann marked this pull request as ready for review February 1, 2024 13:28
@karthikeyann karthikeyann requested a review from a team as a code owner February 1, 2024 13:28
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks. This looks good to me. I'm still digesting the test cases.

cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall! There are a few points that are unclear to me -

test_fn(R"(
{ "a": { "b": 1 } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why we have STRUCT + STR when we already have STR + STRUCT test case in line 2112? This also applies for the other <STRUCT, STR, LIST> permutations above. Does the ordering of the records in the JSON lines input matter for the parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the examples provided by @andygrove successfully caught a segfault bug where the ordering is different. After fixing the bug, for unit tests, I created all test cases with different order.

cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Show resolved Hide resolved
cpp/src/io/json/json_column.cu Show resolved Hide resolved
cpp/tests/io/json_test.cpp Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from shrshi February 20, 2024 17:51
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from nvdbaranec March 6, 2024 06:03
@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 753bf3e into rapidsai:branch-24.04 Mar 7, 2024
74 checks passed
rapids-bot bot pushed a commit that referenced this pull request Mar 11, 2024
Addresses part of #14288
Depends on  #14939 (mixed type ignore nulls fix)

In the input schema, if a struct column is given as STRING type, it's forced to be a STRING column.
This could be used to support map type in spark JSON reader. (Force a map type to be a STRING, and use different parser to extract this string column as key, value columns)
To enable this forcing, mixed type as string should be enabled in json_reader_options.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Bradley Dice (https://github.com/bdice)

URL: #14936
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] JSON mixed_types_as_strings feature incorrectly returns some structs as strings
7 participants