-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data read with Zlib::GzipReader (but not copy of same data) slow to parse with JSON.parse #2193
Comments
Can you try putting |
Adding |
Thank you for the great report, it looks like pure-Ruby JSON parsing is slower with the String from
Maybe it has many ConcatRope and the reading accesses in JSON don't force to materialize the @billdueber To make sure the pure-Ruby version of |
It's simpler than that, actually: p Truffle::CExt.native_string?(previously_gzipped_data) # => true
forced_copy = previously_gzipped_data + " "
p Truffle::CExt.native_string?(forced_copy) # => false
json = JSON.parse(previously_gzipped_data)
p Truffle::CExt.native_string?(previously_gzipped_data) # => true So the issue when we use |
Most likely involved, this truffleruby/lib/json/lib/json/pure/parser.rb Line 140 in e020c89
previously_gzipped_data.encoding is UTF-8 (maybe a bit surprising? But MRI behaves the same).
And Then this String is passed to StringScanner, and StringScanner#scan is used which will run Regexps against it. |
JSON gem (almost) just as you expected -- I apparently have 2.1.0
|
Right, that's expected for TruffleRuby 20.3.0, thanks for confirming. |
Just pinging on this to see if it's on anyone's radar. I've updated the benchmarking code to explicitly assert that calling Removed Oj because it make the "x slower" comparison so hard to read, but it's about 10x faster than JSON tl;dr is that on my hardware gzdata is about 30x slower on the smaller (array-of-100-arrays) json string, and
|
Thank you, we should indeed look at this again, especially since the migration to TruffleString which should help with this. |
I threw up another gist with a less-organically-written benchmark script that strictly compares one set of plain data vs one set of gzipped data. https://gist.github.com/billdueber/87b4f100c4a2d5d1c756470e09615857 Sample output:
|
Thank you, I can reproduce it with if IS_TRUFFLE
FILEDATA["gzdata"] = Truffle::Debug.flatten_string(FILEDATA["gzdata"])
end makes them both the same speed. |
Just checking in again every year-and-a-half or so to see if there's any progress or a "won't-fix" on this. Still using the simple benchmark from https://gist.github.com/billdueber/87b4f100c4a2d5d1c756470e09615857
Thanks! |
One workaround/solution for this is to use the json gem native extension, e.g. with
So this performance issue still exists, although these days it seems because the SwitchEncodingNode in MatchInRegionTRegexNode (vs BytesNode previously). After
Same speed for plain & gzdata, but much slower than the pure-Ruby JSON. But we can also use the new
Same speed for plain & gzdata, and close to the pure-Ruby JSON. In general we are working on improving JSON speed as it's been identified as significantly slower than it could be (cc @andrykonchin). |
Pure-Ruby JSON and with a fix to avoid repeated native->managed conversions:
So that solves this issue, will put up a PR for it. |
(first issue opened here; if I'm off the mark or in the wrong spot, my apologies and please let me know)
JSON-parsing data read from a gzipped file with Zlib::GzipReader is very slow. Parsing the same byte-for-byte data read from a non-gzipped file is not slow. Parsing a copy of the data read from a gzipped file is also not slow.
This gist gives a small program that creates a bunch of arrays-of-arrays and dumps json output into two files, one plain and one gzipped. It then reads those data back in via
File.read
orZlib::GzipReader
, additionally making a copy of the formerly-gzipped-data.Note that I'm not parsing every time -- I'm just getting a copy of the data (as seen in this snippet from the gist):
Oj doesn't show this same issue. I know benchmarking this stuff is treacherous, but the pattern seems pretty consistent.
My original data was just a file on disk as gzipped with
gzip
, so I don't think this has anything to do withZlib::GzipWriter
-- that's just to make it a self-contained bug reproduction. I'm assuming thatGzipReader
is hanging on to the string in some way that makes things nasty?-Bill-
The text was updated successfully, but these errors were encountered: