-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert String to DecimalType without casting to FloatType [databricks] #4081
Convert String to DecimalType without casting to FloatType [databricks] #4081
Conversation
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
@revans2 PTAL |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/FloatUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
input: ColumnView, | ||
ansiEnabled: Boolean): ColumnVector = { | ||
|
||
// This regex gets applied to filter out known edge cases that would result in incorrect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the known edge cases? would be very nice to know what we have to include this as it is expensive. Looking at the code it appears that is_fixed_point cuts off early if it sees something that it does not expect, so it might be nice to have a follow on issue to actually fix that, either in CUDF or in Spark specific code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is it cutting off early?
Are you saying if I pass c = ["", "1.2", "3", ""] and if the boolean vector is initialized to true
d = c.is_fixed_point() = [false, true, true, true]
basically everything after the first value in d
is bogus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was wrong. Reading through the code it looked like the check ignored anything after it saw something it didn't expect, but that is not true.
It looks like "1.5ABC" will result in a false being returned. Which if that is true, then I don't think we need the regular expression check at all any more. That is what triggered this? Why do we need the regexp. What "edge cases" does it cover that are not covered by the existing type check code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right we don't need the regex check anymore as the cudf is reporting everything we need. This check is still relevant in case of a float because it needs to convert the "infinity" => "inf"
withResource(input.strip()) { stripped => | ||
withResource(GpuScalar.from(null, DataTypes.StringType)) { nullString => | ||
// filter out strings containing breaking whitespace | ||
val withoutWhitespace = withResource(ColumnVector.fromStrings("\r", "\n")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in ANSI mode is this not an error? Does the regular expression not match this, because it sure looks like the regexp would error out on anything that has any white space in it at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. ANSI doesn't like spaces, and throws an ansi exception. I will file an issue for Floats as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually an unnecessary check as \r
is being checked as a string which would be caught by the regex check.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
build |
build |
Signed-off-by: Raza Jafri <[email protected]>
build |
castStringToDecimal
methodFixes #2019
depends on /rapidsai/cudf#9658