Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve path parsing issues in get_json_object #15082

Merged

Conversation

SurajAralihalli
Copy link
Contributor

@SurajAralihalli SurajAralihalli commented Feb 17, 2024

This PR addresses a parsing issue related to JSONPath by implementing distinct parsing rules for values inside and outside brackets. For instance, in { "A.B": 2, "'A": { "B'": 3 } }, $.'A.B' differs from $['A.B']. (See Assertible JSON Path Documentation)

The fix ensures accurate parsing of JSONPath values containing quotes. For example in { "A.B": 2, "'A": { "B'": 3 } }

JSONPath Before Fix Spark After Fix
$.'A.B' 2 3 3
$.'A CUDF_FAIL("Encountered invalid JSONPath input string") {"B'": 3} {"B'": 3}

Resolves 12483.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Suraj Aralihalli <[email protected]>
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 17, 2024
Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Feb 18, 2024

However, when encountering an invalid JSONPath, CUDF throws an error, whereas Spark silently returns null. For instance, consider the example { "A" : 1, "B" : 2 }

JSONPath Cudf Spark
. CUDF_FAIL: invalid_argument null
${A} CUDF_FAIL: invalid_argument null
][ CUDF_FAIL: invalid_argument null

We can update the get_json_object to return null instead. If direct modification is not feasible, in order to maintain consistency with the current behavior in CUDF, we can introduce a new parameter named return_null_if_invalid_argument to the get_json_object_options function. This new parameter can have a default value of false.

@SurajAralihalli SurajAralihalli added Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Feb 18, 2024
@davidwendt
Copy link
Contributor

However, when encountering an invalid JSONPath, CUDF throws an error, whereas Spark silently returns null.

If this is an invalid argument, could the Spark logic just catch the exception and then return a null on its own?

@SurajAralihalli
Copy link
Contributor Author

However, when encountering an invalid JSONPath, CUDF throws an error, whereas Spark silently returns null.

If this is an invalid argument, could the Spark logic just catch the exception and then return a null on its own?

Yes, I think that can be done. We can have the Spark catch CudfException and return a column of nulls.

@SurajAralihalli SurajAralihalli added the bug Something isn't working label Feb 20, 2024
@SurajAralihalli SurajAralihalli marked this pull request as ready for review February 20, 2024 01:18
@SurajAralihalli SurajAralihalli requested a review from a team as a code owner February 20, 2024 01:18
@ttnghia
Copy link
Contributor

ttnghia commented Feb 20, 2024

Please clarify: In such situations, Spark will return a column of all nulls, or only nulls corresponding to the invalid input rows (JSON objects)?

@SurajAralihalli
Copy link
Contributor Author

SurajAralihalli commented Feb 20, 2024

Please clarify: In such situations, Spark will return a column of all nulls, or only nulls corresponding to the invalid input rows (JSON objects)?

I believe the column would consist entirely of null values since an invalid JSONPath (argument) would affect all the input rows uniformly.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 20, 2024

If this is an invalid argument, could the Spark logic just catch the exception and then return a null on its own?

This is dangerous. We cannot know if such exception is due to invalid JSON path or something else unless we have a new exception type dedicated to it. So we should better to handle this case explicitly in the code.

@davidwendt
Copy link
Contributor

If this is an invalid argument, could the Spark logic just catch the exception and then return a null on its own?

This is dangerous. We cannot know if such exception is due to invalid JSON path or something else unless we have a new exception type dedicated to it. So we should better to handle this case explicitly in the code.

I believe the exception occurs because of an unexpected parameter value which is checked in host code.
The exception type is std::invalid_argument and hope this would be specific enough to check by the caller.

cpp/src/json/json_path.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am just looking at the results, not the code as I am no C++ expert

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor the bool inside_brackets to an enum class. It would help eliminate the comments that are currently needed to clarify the intent of the argument.

cpp/src/json/json_path.cu Outdated Show resolved Hide resolved
Signed-off-by: Suraj Aralihalli <[email protected]>
@ttnghia
Copy link
Contributor

ttnghia commented Feb 27, 2024

/ok to test

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine for the java side of things.

@SurajAralihalli
Copy link
Contributor Author

/ok to test

cpp/src/json/json_path.cu Outdated Show resolved Hide resolved
cpp/src/json/json_path.cu Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Feb 29, 2024

/ok to test

@SurajAralihalli
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a4f1118 into rapidsai:branch-24.04 Feb 29, 2024
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU get_json_object does incompatible escaping and error checking
6 participants