-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix getJsonObject number normalization 0.0 <=> inf bug #1944
Fix getJsonObject number normalization 0.0 <=> inf bug #1944
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
…0, V100)" This reverts commit ff2db7e.
Signed-off-by: Haoyang Li <[email protected]>
Add some cases in GetJsonObjectTest.java |
There is only one line change:
But the following code will adjust
So if Fortunately, JSON parser defined the max digits for JSON numbers, refer to
So this change is OK. |
* | ||
* This function will also handle scientific notation format. | ||
*/ | ||
__device__ inline double stod(cudf::string_view const& d_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe this is copied from cuDF to do a temporary fix for branch-24.04.
Long term fix will be in cuDF.
The diff is:
+ if (exp_ten >= 1e8) break;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -1516,8 +1517,8 @@ class json_parser { | |||
// 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0 | |||
// => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003 | |||
// => 3.0E-4 leverage function: `get_current_float_parts` | |||
double d_value = | |||
cudf::strings::detail::stod(cudf::string_view(current_token_start_pos, number_token_len)); | |||
double d_value = spark_rapids_jni::detail::stod( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, please update the comments:
// number normalization:
// 0.03E-2 => 0.3E-4; large value => infinity; small value => 0.0
// 200.000 => 200.0, 351.980 => 351.98, 12345678900000000000.0
// => 1.23456789E19 0.0000000000003 => 3.0E-13; 0.003 => 0.003; 0.0003
// => 3.0E-4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
following IT passed: def json_string_to_float(x):
if x == '"-Infinity"':
return float('-inf')
elif x == '"Infinity"':
return float('inf')
else:
return float(x)
@pytest.mark.parametrize('data_gen', [StringGen('-?[1-9]\d{0,20}\.\d{1,20}', nullable=False),
StringGen('-?[1-9]\d{0,5}\.\d{1,20}', nullable=False),
StringGen('-?[1-9]\d{0,20}\.\d{1,5}', nullable=False),
StringGen('-?[1-9]\d{0,5}\.\d{1,5}', nullable=False),
StringGen('-?[1-9]\d{0,20}E-?\d{1,20}', nullable=False),
StringGen('-?[1-9]\d{0,5}E-?\d{1,20}', nullable=False),
StringGen('-?[1-9]\d{0,20}E-?\d{1,5}', nullable=False),
StringGen('-?[1-9]\d{0,5}E-?\d{1,5}', nullable=False)], ids=idfn)
def test_get_json_object_floating_normalization(data_gen):
schema = StructType([StructField("jsonStr", StringType())])
gpu_res = [[row[1]] for row in with_gpu_session(
lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
'a',
'get_json_object(a,"$")'
).collect(),
conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})]
cpu_res = [[row[1]] for row in with_cpu_session(
lambda spark: unary_op_df(spark, data_gen, length=1000000).selectExpr(
'a',
'get_json_object(a,"$")'
).collect())]
for i in range(len(gpu_res)):
if not math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0])):
print("GPU: ", gpu_res[i][0], json_string_to_float(gpu_res[i][0]), " CPU: ", cpu_res[i][0], json_string_to_float(cpu_res[i][0]))
assert math.isclose(json_string_to_float(gpu_res[i][0]), json_string_to_float(cpu_res[i][0])) We will add above tests to 24.06 in a follow-up. |
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more comment.
ch = *in_ptr++; | ||
if (ch < '0' || ch > '9') break; | ||
exp_ten = (exp_ten * 10) + (int)(ch - '0'); | ||
if (exp_ten >= 1e8) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an explanation about why this is here? This is the only change between CUDF and this version so I would like to understand it better. Mainly because it has a magic number of 1e8 in it and I would really like to understand where that came from.
Fixes #1923
Floating number normalization in getJsonObject kernel using
cudf::strings::detail::stod
to convert string to double, then convert back to string withftos_converter
.For numbers with very large exp, like
9.299999257686047e-0005603333574677677
or9.299999257686047e0005603333574677677
, following code will overflow and produce wrong sign:9.299999257686047e-0005603333574677677
should be 0.0, but cuDF gets Infinity9.299999257686047e0005603333574677677
should be infinity, but cuDF gets 0.0cuDF is using
int
to save exponent value.For the above 1st double string, it will overflow from negative value to positive value.
For the above 2nd double string, it will overflow from positive value to negative value.
So cuDF gets totally wrong result.
// check for exponent char int exp_ten = 0; int exp_sign = 1; if (in_ptr < end) { char ch = *in_ptr++; if (ch == 'e' || ch == 'E') { if (in_ptr < end) { ch = *in_ptr; if (ch == '-' || ch == '+') { exp_sign = (ch == '-' ? -1 : 1); ++in_ptr; } while (in_ptr < end) { ch = *in_ptr++; if (ch < '0' || ch > '9') break; exp_ten = (exp_ten * 10) + (int)(ch - '0'); + if (exp_ten >= 1e8) break; } } } }
Then with a wrong sign, it will turn to
0.0
orinf
in a wrong way.Adding
if (exp_ten >= 1e8) break;
will prevent it from overflowing, and since the exp here is big enough to be an inf, it doesn't matter what the real value is. I choose1e8
here because it's the largest power of 10 that can be multiplied by 10 one more time without overflowing, so it's the largest possible limit here. The reason not to choosestd::numeric_limits<double>::max_exponent10
(which is 308) here is that there is anexp_off
, which could be negative, to calculate the final exp later.I think it's a cudf bug, but let's hot fix it in a copy of stod in jni, to make getJsonObject happy first.
Will file a cudf issue soon.