-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use parser from spark to normalize json path in GetJsonObject #10466
Use parser from spark to normalize json path in GetJsonObject #10466
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
throw new IllegalArgumentException("Invalid path") | ||
} | ||
// Remove all leading whitespaces before names, so we erase all whitespaces after . or [' | ||
val normalizedPath = pathStr.replaceAll("""\.(\s+)""", ".").replaceAll("""\['(\s+)""", "['") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like us using regular expressions here. It might work, but I don't see it as a good long term solution. Could we please take the code that Spark uses to parse a JSON path.
And modify it so that we output a normalized JSON path instead. That would also give us the ability to detect errors in the path before we send it to CUDF and make sure that we are compatible with what CUDF supports.
Meaning we really just reuse their code and then convert the Option[List[PathInstruction]]
back to a string.
parsedPath match {
case None => // ERROR return a column of nulls
case Some(lp) =>
val ret = new StringBuffer()
ret.append('$')
lp.foreach {
case Subscript | Key => // Ignored we don't care about these
case Wildcard => ret.append("[*]")
case Named(name) => ret.append(s"['${name}']")
case Index(idx) => ret.append(s"[$idx]")
case _ => ERROR!!!!
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated to use JsonPathParser, this approach also looks to fix some other issues related to json path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR description to explain why this PR can fix these issues.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just want to make sure we don't leak the scalar value.
override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = { | ||
lhs.getBase().getJSONObject(rhs.getBase, | ||
GetJsonObjectOptions.builder().allowSingleQuotes(true).build()); | ||
def normalizeJsonPath(path: GpuScalar): Option[Scalar] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this code is written it looks like it is going to leak the Scalar. Can we make this an Option[String]
instead? That would also let us cache it, because we know that it is a constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
val pathStr = if (path.isValid) { | ||
path.getValue.toString() | ||
} else { | ||
throw new IllegalArgumentException("Invalid path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how big of a deal this is, but it means that the path is null. We should not throw an exception here, but we should return None
as a null path should result in a null for all resulting values instead of an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuGetJsonObject.scala
Show resolved
Hide resolved
build |
Fix #10213
Fix #10214
Fix #10215
Contributes to #10216 and #10217
Spark has implemented a JsonPathParser to parse the json path in getJsonObject, which will have some different behavior than https://jsonpath.com/ and the cudf implementation.
https://github.com/apache/spark/blob/78f7c30e140fd8cf4a80b783dd7e9ee4d1b4d7e2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L59
This PR uses JsonPathParser to normalize the json path before passing it to the kernel, so GetJsonObject will better match Spark in some cases related to json path.
This PR also fallback GetJsonObject when json paths contain wildcard ("*"), as the plugin may produce different results in this case. Note that
. *
and[ *]
will be parsed to aNamed("*")
by Spark parser, so it also checkedNamed("*")
from path.The reason why I didn't put it in kernel is I think this feature is only for Spark and we only support scalar for this parameter, so it will consume constant time to do this preprocessing.
In 10213, path not starting with a
$
like[0]
or['a']
should be invalid. The parser will reject these cases so this issue can be fixed.In 10214, Spark and https://jsonpath.com/ happily support paths like
$[a]
or$[a1]
while our implementation throws an exception about them not being a valid number. In fact, such cases should be invalid and rejected by the parser, so this PR will return nulls for them instead of passing them to the kernel and throwing an exception. So this issue can also be fixed in this PR.In 10215, gpu and https://jsonpath.com/ agree that whitespace should not be stripped from the path, but Spark will treats
$. a
and$.a
the same. This PR normalizes those cases so that the whitespace is stripped according to Spark's way.In SPARK-46761, Spark does not support
?
characters. With the parser in this PR, these cases will be rejected so that the behavior matches Spark.