Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parser from spark to normalize json path in GetJsonObject #10466

Merged

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Feb 22, 2024

Fix #10213
Fix #10214
Fix #10215
Contributes to #10216 and #10217

Spark has implemented a JsonPathParser to parse the json path in getJsonObject, which will have some different behavior than https://jsonpath.com/ and the cudf implementation.

https://github.com/apache/spark/blob/78f7c30e140fd8cf4a80b783dd7e9ee4d1b4d7e2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L59

This PR uses JsonPathParser to normalize the json path before passing it to the kernel, so GetJsonObject will better match Spark in some cases related to json path.

This PR also fallback GetJsonObject when json paths contain wildcard ("*"), as the plugin may produce different results in this case. Note that . * and [ *] will be parsed to a Named("*") by Spark parser, so it also checked Named("*") from path.

The reason why I didn't put it in kernel is I think this feature is only for Spark and we only support scalar for this parameter, so it will consume constant time to do this preprocessing.

In 10213, path not starting with a $ like [0] or ['a'] should be invalid. The parser will reject these cases so this issue can be fixed.

In 10214, Spark and https://jsonpath.com/ happily support paths like $[a] or $[a1] while our implementation throws an exception about them not being a valid number. In fact, such cases should be invalid and rejected by the parser, so this PR will return nulls for them instead of passing them to the kernel and throwing an exception. So this issue can also be fixed in this PR.

In 10215, gpu and https://jsonpath.com/ agree that whitespace should not be stripped from the path, but Spark will treats $. a and $.a the same. This PR normalizes those cases so that the whitespace is stripped according to Spark's way.

In SPARK-46761, Spark does not support ? characters. With the parser in this PR, these cases will be rejected so that the behavior matches Spark.

@thirtiseven thirtiseven self-assigned this Feb 22, 2024
@thirtiseven thirtiseven marked this pull request as ready for review February 22, 2024 10:37
@revans2
Copy link
Collaborator

revans2 commented Feb 22, 2024

build

throw new IllegalArgumentException("Invalid path")
}
// Remove all leading whitespaces before names, so we erase all whitespaces after . or ['
val normalizedPath = pathStr.replaceAll("""\.(\s+)""", ".").replaceAll("""\['(\s+)""", "['")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like us using regular expressions here. It might work, but I don't see it as a good long term solution. Could we please take the code that Spark uses to parse a JSON path.

https://github.com/apache/spark/blob/78f7c30e140fd8cf4a80b783dd7e9ee4d1b4d7e2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L59

And modify it so that we output a normalized JSON path instead. That would also give us the ability to detect errors in the path before we send it to CUDF and make sure that we are compatible with what CUDF supports.

Meaning we really just reuse their code and then convert the Option[List[PathInstruction]] back to a string.

parsedPath match {
  case None => // ERROR return a column of nulls
  case Some(lp) =>
    val ret = new StringBuffer()
    ret.append('$')
    lp.foreach {
      case Subscript | Key => // Ignored we don't care about these
      case Wildcard => ret.append("[*]")
      case Named(name) => ret.append(s"['${name}']")
      case Index(idx) => ret.append(s"[$idx]")
      case _ => ERROR!!!!
    }
}

Copy link
Collaborator Author

@thirtiseven thirtiseven Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated to use JsonPathParser, this approach also looks to fix some other issues related to json path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description to explain why this PR can fix these issues.

@thirtiseven thirtiseven changed the title Remove leading space for json path in GetJsonObject Use parser from spark to normalize json path in GetJsonObject Feb 23, 2024
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@sameerz sameerz added the bug Something isn't working label Feb 24, 2024
@thirtiseven thirtiseven requested a review from revans2 February 26, 2024 08:33
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just want to make sure we don't leak the scalar value.

override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = {
lhs.getBase().getJSONObject(rhs.getBase,
GetJsonObjectOptions.builder().allowSingleQuotes(true).build());
def normalizeJsonPath(path: GpuScalar): Option[Scalar] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this code is written it looks like it is going to leak the Scalar. Can we make this an Option[String] instead? That would also let us cache it, because we know that it is a constant value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

val pathStr = if (path.isValid) {
path.getValue.toString()
} else {
throw new IllegalArgumentException("Invalid path")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how big of a deal this is, but it means that the path is null. We should not throw an exception here, but we should return None as a null path should result in a null for all resulting values instead of an exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@revans2
Copy link
Collaborator

revans2 commented Feb 28, 2024

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants