Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize String#to_json #5456

Merged
merged 1 commit into from
Jan 2, 2018
Merged

Conversation

larubujo
Copy link
Contributor

saw this: https://stackoverflow.com/questions/47952457/crystal-slow-json-serialization-of-structs-containing-large-strings

this pr still slower then go (crystal memory allocator slower, probably gc slower), but faster then before

old code: go char by char and append it
new code: go char by char, if escape then append all before that point, then escaped char

best case: string has no escape, append all in one go
worse case: all chars escapes, similar to old code

bench:

require "benchmark"
require "json"

record Page, uri : String, html : String do
  JSON.mapping(uri: String, html: String)
end

url = "not an uri has escape like \" \\ and also \n "
html = "a" * (128 * 1024)

page = Page.new(url, html)
io = IO::Memory.new

Benchmark.ips do |x|
  x.report("json") do
    page.to_json(io)
    io.clear
  end
end

old:

json 766.91  (   1.3ms) (± 3.00%) fastest

new:

json    1.7k (587.83µs) (± 8.62%) fastest

pos = 0
reader = Char::Reader.new(string)

while reader.has_next?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use while char = reader.next_char (obviously modified to include the first char), eliminating an additional method call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. next_char raises on end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can easily be handled with when Char::ZERO; break and the reader will never get to a point where it raises. This solution should be faster than calling has_next? and current_char on each iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string can have zero byte in it. so condition not always work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that's a valid character in a JS string.
Still, you could check has_next? only if char == Char::ZERO instead of every iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search while reader.has_next? in std. 7 occurrences. nothing wrong with this code. plus i dont think its slower than your approach. this is idiomatic code.

when '\\'
io << "\\\\"
escape = "\\\\"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape could just be a Char (the one being escaped).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io << '\\' << escaped_char. It would be more efficient than creating and inserting a new string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are allocated on the heap, chars not. Here it is very easy to avoid heap allocations and use chars instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant strings are not allocated on the heap. They're pre-built in read only ELF data, so doing foo = "bar" is simply the cost of assigning an address to a variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, forgot about that.

Still, using a Char should be slightly preferrable. It's just less data to store: no size field + backslash is only needed once for all escape sequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escape cant be char. for \b want "\b", not '\' and '\b'. escape must be string. or it must be 'b' in case of '\b', but then two writes, slower than one write. but also, nanosecond time. not relevant.

please send a different pr where performance is beat, then i can accept changes. otherwise this pr is good.

string.each_char do |char|
case char

pos = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a more descriptive name like last_start_pos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too long... maybe start or start_pos... theres only one pos to track, why last? theres no first. theres no middle. algorithm is simple.

end

if escape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to put this behaviour directly in the when branches (maybe extract in a private method). This adds some code duplication but eliminates an additional condition check.

Or just skip to the next char with next from the .ascii_control? and else branches. Then you can keep it here but don't need the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put directly in when: longer code.

i understand your concern, but these little details not improve performance, and make code longer, same to understand. so i prefer not change.

io << "\\u"
ord = char.ord
io << '0' if ord < 0x1000
io << '0' if ord < 0x100
io << '0' if ord < 0x10
ord.to_s(16, io)
else
io << char
set_pos = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just insert next here, eliminating the need for a variable set_pos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, ends up being same longer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code might be a bit longer, but I expect execution time to be lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed code, same time. so stay the same for me.

Copy link
Contributor

@RX14 RX14 Dec 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to remove set_pos personally... it's not a big deal for me but the logic is slightly harder to follow. I want to see what the rest of the core team thinks.

@larubujo
Copy link
Contributor Author

@straight-shoota sorry, but you not core team member, so why do i need apply suggestions?

better close this. you want to implement this, you do it.

@larubujo larubujo closed this Dec 26, 2017
@RX14
Copy link
Contributor

RX14 commented Dec 26, 2017

@larubujo all reviews are just suggestions, if you disagree with them feel free to not implement them, but I don't see why you'd close this PR.

@straight-shoota
Copy link
Member

Yeah, you don't have to follow my suggestions. I was just giving you hints how I think you could improve your code. If you don't want to apply them, then don't. But I think they're reasonable to improve efficiency even further.

@larubujo larubujo reopened this Dec 29, 2017
@larubujo
Copy link
Contributor Author

i reopen. i can rename pos to start_pos, but other suggestions not apply for me.

Copy link
Contributor

@RX14 RX14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine to merge if the rest of the core team agrees.

We might need to change the scope of the escaping in the future to escape more characters but that's outside the scope of this PR.

@larubujo
Copy link
Contributor Author

refactored. let me know what you think.

@RX14
Copy link
Contributor

RX14 commented Dec 29, 2017

I think I prefer the old version, it has less duplication and less to go wrong. Replacing the conditionals with next results in code which is harder to maintain.

@larubujo
Copy link
Contributor Author

boring...

@larubujo larubujo closed this Dec 29, 2017
@RX14
Copy link
Contributor

RX14 commented Dec 30, 2017

@larubujo I meant your old version that you originally PRed, not the version before this PR! The version before you said "refactored. let me know what you think." Sorry for the misunderstanding.

@larubujo
Copy link
Contributor Author

problem is, i lost that code. cant remember now. plus review is "do this, no, do that, revert this now". really boring and frustrating. better someone continue this, i got tired.

@larubujo larubujo reopened this Dec 31, 2017
@larubujo
Copy link
Contributor Author

reopen second time. please accept it. also note this is faster than original code, so better:

json   1.82k ( 550.2µs) (± 2.41%) fastest

slightly faster. but maybe moon closer to earth these days, who knows.

@straight-shoota
Copy link
Member

I think it is good now.

However, it could improve the handling of special characters besides performance improvements. But this could be done in a subsequent PR. Namely, it misses named escapes for \0 and \v - they should still be escaped as \u0000 and \u000B but using the named escapes is more elegant and saves a little bitte of space. .ascii_control? should better be .control?. And it should fail for invalid unicode characters. Crystal strings can currently contain invalid UTF-8, JavaScript/JSON cannot.

Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the optimization, but since the flow is more complicated than linear iteration of chars I would prefer to have some specs to ensure corner cases of the code won't fail in the future due to further changes. Like: if the string starts with sequences to escape, if it has more than one sequences to escape together, if there is trailing sequences to escape. Those kind of cases might work right now, but some slightly changes could end in chars been lost.

@larubujo larubujo force-pushed the opt-string-to-json branch from 5eb00cf to 22a66cc Compare January 2, 2018 13:22
@larubujo
Copy link
Contributor Author

larubujo commented Jan 2, 2018

@bcardiff added more specs

@@ -215,6 +215,19 @@ describe "JSON serialization" do
"\u{19}".to_json.should eq("\"\\u0019\"")
end

it "does for String with control codes in a few places" do
"\fab".to_json.should eq("\"\\fab\"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably much more readable to use %q("\fab") here instead of "\"\\fab\"".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@RX14 RX14 added this to the Next milestone Jan 2, 2018
@RX14 RX14 merged commit be9d9c7 into crystal-lang:master Jan 2, 2018
lukeasrodgers pushed a commit to lukeasrodgers/crystal that referenced this pull request Jan 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants