Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor JSON::PullParser#consume_number to use stdlib number parsing #10447

Merged

Conversation

straight-shoota
Copy link
Member

@straight-shoota straight-shoota commented Feb 26, 2021

The custom parsing algorithm was insufficient and had several issues around range boundaries.
Replacing it with the established conversion methods ensures correctness.

Resolves #10419
Resolves #10920

The custom parsing algorithm was insufficient and had several issues
around range boundaries.
Replacing it with the established conversion methods ensures
correctness.
@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization labels Feb 26, 2021
@asterite
Copy link
Member

How is performance affected?

src/json/token.cr Show resolved Hide resolved
src/json/token.cr Show resolved Hide resolved
@straight-shoota
Copy link
Member Author

The focus is correctness, not performance. So I don't have an extensive performance analysis. It doesn't matter if we have a fast algorithm when its results are plain wrong. I'm sure this can be optimized further, but for now I'd just like to fix the bugs in number parsing.

These are the results for a simple test, parsing an array of 100 random ints/floats.

$ ./bm-json-number-old
  ints  73.67k ( 13.57µs) (± 3.63%)   5.5kB/op        fastest
floats  45.46k ( 22.00µs) (± 2.83%)  7.11kB/op   1.62× slower
$ ./bm-json-number-new
  ints  58.82k ( 17.00µs) (± 2.79%)  5.47kB/op        fastest
floats  31.69k ( 31.56µs) (± 2.92%)  7.09kB/op   1.86× slower

@kostya
Copy link
Contributor

kostya commented Feb 26, 2021

for this bench: https://github.com/kostya/benchmarks#json
was: 0.661 s
this pr: 0.839 s

@asterite
Copy link
Member

@straight-shoota I believe correctness can be achieved without sacrificing performance. There's no need to go from a string to an intermediate string to an int or float. It can be parsed directly. Also numbers in JSON are simpler than what we cover in the std (no whitespace, no underscore, etc.)

@straight-shoota
Copy link
Member Author

Sure, but this is not trivial. Until somebody implements that correctly, we should fall back to a less performant but correct implementation.

@kostya
Copy link
Contributor

kostya commented Feb 26, 2021

what if adapt this parser https://github.com/lemire/fast_double_parser, which is used by simdjson

@kostya
Copy link
Contributor

kostya commented Mar 4, 2021

for fun i try this, with this shard: https://github.com/kostya/fast_to_f

with such dirty patch:

class JSON::Lexer::StringBased
  private def consume_number
    buf = @reader.string.to_unsafe + @reader.pos
    v, end_s = FastToF.parse_internal(buf)

    if end_s.null?
      unexpected_char
    else
      if v.round == v
        @token.kind = :int
        @token.int_value = v.to_i64
      else
        @token.kind = :float
        @token.float_value = v
      end

      chars_count = end_s - buf
      @reader.pos = @reader.pos + chars_count
    end
  end
end

my benchmark:
was: 0.655 s
with this: 0.550 s

Copy link
Member

@sdogruyol sdogruyol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the same page with @straight-shoota here. I think the program should behave correct first, then we can always think about performance.

@straight-shoota straight-shoota added this to the 1.1.0 milestone Apr 13, 2021
@straight-shoota straight-shoota modified the milestones: 1.1.0, 1.2.0 Jun 22, 2021
@straight-shoota
Copy link
Member Author

We'll postpone this to work on the implementation.

@straight-shoota straight-shoota changed the title Refactor JSON::PullParser#consume_number to use stdlib number parsing Refactor JSON::PullParser#consume_number to use stdlib number parsing Jun 29, 2021
@straight-shoota
Copy link
Member Author

After reviewing this again, I think we should merge this as is. This patch fixes a number of serious bugs in JSON number parsing. Performance degradation is insignificant over ensuring correctness.
Sure, this can be implemented more efficiently and we should aim to do that. But that's gonna take some more effort. We must not postpone the low hanging fruit of fixing the parser while waiting for a more performant implementation.

@straight-shoota straight-shoota added this to the 1.2.0 milestone Aug 25, 2021
Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's no better proposal yet, I would like this one to get merged. Like @straight-shoota said, better a correct algorithm than a broken one. When we get a new working and fast algorithm, we'll replace this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON parse gives wrong Float64 result JSON::PullParser is broken for MIN and MAX values
7 participants