-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Integer decoding from bytes #11796
Optimize Integer decoding from bytes #11796
Conversation
By using Pointer#copy_to and not allocation new slices.
There's actually no memory allocation involved here. This works slightly faster in this PR because some bound checks are removed, which makes this easily prone to segfault. I'm not sure we should do that. I'm not on the computer, but if I have time I will send code that makes this change segfault. |
Yes, the change to I believe the change to |
Removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Thanks!
The discussion within this PR shows that we're missing failing specs for encode/decode
@carlhoerberg Your benchmark results seem to have run into the issue where each iteration is so fast that the results are inaccurate. I updated the benchmark code to run more iterations inside the Updated coderequire "benchmark"
module IO::ByteFormat
{% for mod in %w(LittleEndian BigEndian) %}
module {{mod.id}}
{% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
{% bytesize = 2 ** (i // 2) %}
def self.decode2(type : {{type.id}}.class, bytes : Bytes)
buffer = uninitialized UInt8[{{bytesize}}]
bytes.copy_to(buffer.to_unsafe, {{bytesize}})
buffer.reverse! unless SystemEndian == self
buffer.unsafe_as({{type.id}})
end
{% end %}
end
{% end %}
end
b = Bytes.new(128) { 1u8 }
iterations = 1_000
{% for mod in %w(LittleEndian BigEndian) %}
{% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
Benchmark.ips do |x|
x.report("old decode {{mod.id}} {{ type.id }}") do
iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
end
x.report("new decode {{mod.id}} {{ type.id }}") do
iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) }
end
end
{% end %}
{% end %} Diff from the original benchmark code19a20
> iterations = 1_000
25c26
< IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b)
---
> iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
28c29
< IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b)
---
> iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) } Intel results
ARM64 results
Seems like LLVM might be optimizing the stack allocation out of the resulting binary. I'm surprised your results were so consistently one-sided. |
By using
Slice#copy_to(target : Pointer(T), count)
and not (stack) allocating new slices.Benchmark
Results
Discussion
Decoding is about 80% faster with the notable of exception BigEndian (U)Int64, which becomes 15% slower.