Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Decimal parses non-arabic numbers #10532

Open
revans2 opened this issue Mar 1, 2024 · 0 comments
Open

[BUG] Decimal parses non-arabic numbers #10532

revans2 opened this issue Mar 1, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 1, 2024

Describe the bug
I don't know how critical this is, but as a part of my investigation into JSON number parsing, specifically decimals I found that Spark passes in a locale when parsing quoted string decimals in particular. This led me to discover that by default BigDecimal parses any character that can be transformed into a digit as a digit, as does java.lang.Long.

(0 until 32767*2 + 1).map(i => i.toChar).filter(c => Character.isDigit(c)).map(c => (Character.digit(c, 10), c))

Shows that there are at least 350 different characters that are "digits"

scala> new java.math.BigDecimal("1٢७")
res20: java.math.BigDecimal = 127

scala> new java.lang.Long("1٢७")
res21: Long = 127

scala> new java.lang.Float("1٢७")
java.lang.NumberFormatException: For input string: "1٢७"
  at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
  at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
  at java.lang.Float.parseFloat(Float.java:451)
  at java.lang.Float.<init>(Float.java:532)
  ... 47 elided

Happily this appears to only work for decimal values in Spark.

scala> Seq("1٢७").toDF("what").selectExpr("what", "CAST(what AS LONG) what_long", "CAST(what AS DECIMAL(10,0)) what_dec", "CAST(what AS double) what_double").show()
+----+---------+--------+-----------+
|what|what_long|what_dec|what_double|
+----+---------+--------+-----------+
| 1٢७|     null|     127|       null|
+----+---------+--------+-----------+

But we are still parsing them differently compared to Spark. This is probably not a big deal, but it is a little scary.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 1, 2024
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 5, 2024
@sameerz sameerz changed the title [BUG] Decimal parses non-aribic numbers [BUG] Decimal parses non-arabic numbers Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants