Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Integer decoding from bytes #11796

Conversation

carlhoerberg
Copy link
Contributor

@carlhoerberg carlhoerberg commented Feb 3, 2022

By using Slice#copy_to(target : Pointer(T), count) and not (stack) allocating new slices.

Benchmark

require "benchmark"

module IO::ByteFormat
  {% for mod in %w(LittleEndian BigEndian) %}
    module {{mod.id}}
      {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
        {% bytesize = 2 ** (i // 2) %}
        def self.decode2(type : {{type.id}}.class, bytes : Bytes)
          buffer = uninitialized UInt8[{{bytesize}}]
          bytes.copy_to(buffer.to_unsafe, {{bytesize}})
          buffer.reverse! unless SystemEndian == self
          buffer.unsafe_as({{type.id}})
        end
      {% end %}
    end
  {% end %}
end

b = Bytes.new(128) { 1u8 }

{% for mod in %w(LittleEndian BigEndian) %}
  {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
    Benchmark.ips do |x|
      x.report("old decode {{mod.id}} {{ type.id }}") do
        IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b)
      end
      x.report("new decode {{mod.id}} {{ type.id }}") do
        IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b)
      end
    end
  {% end %}
{% end %}

Results

old decode LittleEndian Int8 481.42M (  2.08ns) (± 3.63%)  0.0B/op   1.80× slower
new decode LittleEndian Int8 865.40M (  1.16ns) (± 1.36%)  0.0B/op        fastest
old decode LittleEndian UInt8 532.87M (  1.88ns) (± 3.58%)  0.0B/op   1.65× slower
new decode LittleEndian UInt8 877.89M (  1.14ns) (± 3.18%)  0.0B/op        fastest
old decode LittleEndian Int16 492.96M (  2.03ns) (± 3.21%)  0.0B/op   1.79× slower
new decode LittleEndian Int16 881.37M (  1.13ns) (± 3.86%)  0.0B/op        fastest
old decode LittleEndian UInt16 468.77M (  2.13ns) (± 4.01%)  0.0B/op   1.84× slower
new decode LittleEndian UInt16 864.24M (  1.16ns) (± 1.69%)  0.0B/op        fastest
old decode LittleEndian Int32 491.97M (  2.03ns) (± 5.36%)  0.0B/op   1.85× slower
new decode LittleEndian Int32 911.73M (  1.10ns) (± 4.76%)  0.0B/op        fastest
old decode LittleEndian UInt32 492.52M (  2.03ns) (± 4.21%)  0.0B/op   1.75× slower
new decode LittleEndian UInt32 863.92M (  1.16ns) (± 2.46%)  0.0B/op        fastest
old decode LittleEndian Int64 481.59M (  2.08ns) (± 2.89%)  0.0B/op   1.92× slower
new decode LittleEndian Int64 925.13M (  1.08ns) (± 4.13%)  0.0B/op        fastest
old decode LittleEndian UInt64 474.08M (  2.11ns) (± 3.60%)  0.0B/op   1.83× slower
new decode LittleEndian UInt64 865.81M (  1.15ns) (± 1.47%)  0.0B/op        fastest
old decode LittleEndian Int128 479.22M (  2.09ns) (± 2.60%)  0.0B/op   1.80× slower
new decode LittleEndian Int128 861.06M (  1.16ns) (± 2.56%)  0.0B/op        fastest
old decode LittleEndian UInt128 478.12M (  2.09ns) (± 3.83%)  0.0B/op   1.85× slower
new decode LittleEndian UInt128 885.39M (  1.13ns) (± 3.43%)  0.0B/op        fastest
old decode BigEndian Int8 508.86M (  1.97ns) (± 2.52%)  0.0B/op   1.73× slower
new decode BigEndian Int8 882.45M (  1.13ns) (± 3.30%)  0.0B/op        fastest
old decode BigEndian UInt8 399.35M (  2.50ns) (±14.63%)  0.0B/op   2.18× slower
new decode BigEndian UInt8 871.88M (  1.15ns) (± 2.95%)  0.0B/op        fastest
old decode BigEndian Int16 520.54M (  1.92ns) (± 4.86%)  0.0B/op   1.76× slower
new decode BigEndian Int16 916.68M (  1.09ns) (± 5.12%)  0.0B/op        fastest
old decode BigEndian UInt16 493.51M (  2.03ns) (± 3.94%)  0.0B/op   1.75× slower
new decode BigEndian UInt16 863.52M (  1.16ns) (± 4.39%)  0.0B/op        fastest
old decode BigEndian Int32 144.54M (  6.92ns) (± 2.27%)  0.0B/op   1.20× slower
new decode BigEndian Int32 172.98M (  5.78ns) (± 2.75%)  0.0B/op        fastest
old decode BigEndian UInt32 144.65M (  6.91ns) (± 2.09%)  0.0B/op   1.19× slower
new decode BigEndian UInt32 172.56M (  5.80ns) (± 2.49%)  0.0B/op        fastest
old decode BigEndian Int64 172.93M (  5.78ns) (± 2.60%)  0.0B/op        fastest
new decode BigEndian Int64 150.70M (  6.64ns) (± 4.02%)  0.0B/op   1.15× slower
old decode BigEndian UInt64 175.44M (  5.70ns) (± 3.70%)  0.0B/op        fastest
new decode BigEndian UInt64 150.72M (  6.63ns) (± 3.56%)  0.0B/op   1.16× slower
old decode BigEndian Int128 123.75M (  8.08ns) (± 2.76%)  0.0B/op   1.32× slower
new decode BigEndian Int128 163.88M (  6.10ns) (± 5.83%)  0.0B/op        fastest
old decode BigEndian UInt128 132.80M (  7.53ns) (± 2.45%)  0.0B/op   1.21× slower
new decode BigEndian UInt128 161.01M (  6.21ns) (± 1.47%)  0.0B/op        fastest

Discussion

Decoding is about 80% faster with the notable of exception BigEndian (U)Int64, which becomes 15% slower.

By using Pointer#copy_to and not allocation new slices.
@asterite
Copy link
Member

asterite commented Feb 3, 2022

There's actually no memory allocation involved here. This works slightly faster in this PR because some bound checks are removed, which makes this easily prone to segfault. I'm not sure we should do that.

I'm not on the computer, but if I have time I will send code that makes this change segfault.

@straight-shoota
Copy link
Member

Yes, the change to .encode is definitely incorrect. It removes bounds checking on bytes and doesn't validate that it's not read_only.

I believe the change to .decode should be safe, though. It still uses Slice#copy_to which includes bounds checking. And we can be certain of the buffer size, so the pointer parameter should not be an issue.

@carlhoerberg
Copy link
Contributor Author

carlhoerberg commented Feb 3, 2022

Removed .encode, it was an afterthought and didn't improve performance anyway.

@carlhoerberg carlhoerberg changed the title Optimize Integer encoding/decoding to/from bytes Optimize Integer decoding from bytes Feb 4, 2022
Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Thanks!

The discussion within this PR shows that we're missing failing specs for encode/decode

@beta-ziliani beta-ziliani added this to the 1.4.0 milestone Mar 18, 2022
@straight-shoota straight-shoota merged commit 9a5e6fa into crystal-lang:master Mar 22, 2022
@jgaskins
Copy link
Contributor

@carlhoerberg Your benchmark results seem to have run into the issue where each iteration is so fast that the results are inaccurate. Benchmark.ips can't be trusted when the results are in the 1-3ns range.

I updated the benchmark code to run more iterations inside the report block and the only notable change in performance was on Int128. Everything else was within ±1% and even closer on ARM.

Updated code
require "benchmark"

module IO::ByteFormat
  {% for mod in %w(LittleEndian BigEndian) %}
    module {{mod.id}}
      {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
        {% bytesize = 2 ** (i // 2) %}
        def self.decode2(type : {{type.id}}.class, bytes : Bytes)
          buffer = uninitialized UInt8[{{bytesize}}]
          bytes.copy_to(buffer.to_unsafe, {{bytesize}})
          buffer.reverse! unless SystemEndian == self
          buffer.unsafe_as({{type.id}})
        end
      {% end %}
    end
  {% end %}
end

b = Bytes.new(128) { 1u8 }
iterations = 1_000

{% for mod in %w(LittleEndian BigEndian) %}
  {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
    Benchmark.ips do |x|
      x.report("old decode {{mod.id}} {{ type.id }}") do
        iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
      end
      x.report("new decode {{mod.id}} {{ type.id }}") do
        iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) }
      end
    end
  {% end %}
{% end %}
Diff from the original benchmark code
19a20
> iterations = 1_000
25c26
<         IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b)
---
>         iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
28c29
<         IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b)
---
>         iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) }
Intel results
old decode LittleEndian Int8   4.44M (224.98ns) (± 1.25%)  0.0B/op   1.01× slower
new decode LittleEndian Int8   4.49M (222.73ns) (± 0.78%)  0.0B/op        fastest
old decode LittleEndian UInt8   4.48M (223.25ns) (± 0.94%)  0.0B/op        fastest
new decode LittleEndian UInt8   4.44M (225.25ns) (± 1.85%)  0.0B/op   1.01× slower
old decode LittleEndian Int16   4.49M (222.81ns) (± 1.23%)  0.0B/op        fastest
new decode LittleEndian Int16   4.45M (224.68ns) (± 0.75%)  0.0B/op   1.01× slower
old decode LittleEndian UInt16   4.46M (224.09ns) (± 1.10%)  0.0B/op   1.01× slower
new decode LittleEndian UInt16   4.49M (222.90ns) (± 1.23%)  0.0B/op        fastest
old decode LittleEndian Int32   4.50M (222.37ns) (± 1.15%)  0.0B/op        fastest
new decode LittleEndian Int32   4.48M (223.13ns) (± 0.88%)  0.0B/op   1.00× slower
old decode LittleEndian UInt32   4.44M (225.38ns) (± 0.66%)  0.0B/op   1.01× slower
new decode LittleEndian UInt32   4.47M (223.47ns) (± 1.75%)  0.0B/op        fastest
old decode LittleEndian Int64   4.44M (225.45ns) (± 2.39%)  0.0B/op        fastest
new decode LittleEndian Int64   4.43M (225.85ns) (± 1.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt64   4.46M (224.41ns) (± 1.31%)  0.0B/op   1.00× slower
new decode LittleEndian UInt64   4.48M (223.41ns) (± 1.65%)  0.0B/op        fastest
old decode LittleEndian Int128   4.49M (222.78ns) (± 1.03%)  0.0B/op        fastest
new decode LittleEndian Int128   4.46M (224.29ns) (± 1.27%)  0.0B/op   1.01× slower
old decode LittleEndian UInt128   4.48M (223.31ns) (± 0.77%)  0.0B/op   1.00× slower
new decode LittleEndian UInt128   4.49M (222.76ns) (± 0.72%)  0.0B/op        fastest
old decode BigEndian Int8   4.48M (223.07ns) (± 0.94%)  0.0B/op        fastest
new decode BigEndian Int8   4.45M (224.97ns) (± 0.67%)  0.0B/op   1.01× slower
old decode BigEndian UInt8   4.48M (223.39ns) (± 0.97%)  0.0B/op   1.01× slower
new decode BigEndian UInt8   4.53M (220.96ns) (± 0.73%)  0.0B/op        fastest
old decode BigEndian Int16   4.52M (221.31ns) (± 0.82%)  0.0B/op        fastest
new decode BigEndian Int16   4.48M (223.06ns) (± 0.93%)  0.0B/op   1.01× slower
old decode BigEndian UInt16   4.47M (223.78ns) (± 1.01%)  0.0B/op   1.01× slower
new decode BigEndian UInt16   4.51M (221.84ns) (± 0.67%)  0.0B/op        fastest
old decode BigEndian Int32   4.50M (222.36ns) (± 0.93%)  0.0B/op        fastest
new decode BigEndian Int32   4.48M (223.10ns) (± 0.83%)  0.0B/op   1.00× slower
old decode BigEndian UInt32   4.45M (224.72ns) (± 0.88%)  0.0B/op   1.01× slower
new decode BigEndian UInt32   4.50M (222.18ns) (± 1.17%)  0.0B/op        fastest
old decode BigEndian Int64   4.50M (221.98ns) (± 1.16%)  0.0B/op        fastest
new decode BigEndian Int64   4.48M (223.24ns) (± 0.84%)  0.0B/op   1.01× slower
old decode BigEndian UInt64   4.49M (222.73ns) (± 0.72%)  0.0B/op   1.00× slower
new decode BigEndian UInt64   4.50M (222.26ns) (± 0.65%)  0.0B/op        fastest
old decode BigEndian Int128 154.46k (  6.47µs) (± 1.14%)  0.0B/op   1.43× slower
new decode BigEndian Int128 220.18k (  4.54µs) (± 0.74%)  0.0B/op        fastest
old decode BigEndian UInt128 154.52k (  6.47µs) (± 1.05%)  0.0B/op   1.42× slower
new decode BigEndian UInt128 219.74k (  4.55µs) (± 1.08%)  0.0B/op        fastest
ARM64 results
old decode LittleEndian Int8   3.09M (323.65ns) (± 0.33%)  0.0B/op        fastest
new decode LittleEndian Int8   3.09M (323.79ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt8   3.09M (323.69ns) (± 0.32%)  0.0B/op        fastest
new decode LittleEndian UInt8   3.09M (324.13ns) (± 0.47%)  0.0B/op   1.00× slower
old decode LittleEndian Int16   3.09M (323.54ns) (± 0.21%)  0.0B/op        fastest
new decode LittleEndian Int16   3.09M (323.73ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt16   3.09M (323.68ns) (± 0.40%)  0.0B/op        fastest
new decode LittleEndian UInt16   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian Int32   3.09M (324.00ns) (± 0.89%)  0.0B/op   1.00× slower
new decode LittleEndian Int32   3.09M (323.90ns) (± 0.26%)  0.0B/op        fastest
old decode LittleEndian UInt32   3.09M (324.02ns) (± 0.86%)  0.0B/op   1.00× slower
new decode LittleEndian UInt32   3.09M (323.86ns) (± 0.20%)  0.0B/op        fastest
old decode LittleEndian Int64   3.09M (323.46ns) (± 0.14%)  0.0B/op        fastest
new decode LittleEndian Int64   3.09M (323.91ns) (± 0.31%)  0.0B/op   1.00× slower
old decode LittleEndian UInt64   3.09M (323.59ns) (± 0.23%)  0.0B/op        fastest
new decode LittleEndian UInt64   3.09M (323.93ns) (± 0.34%)  0.0B/op   1.00× slower
old decode LittleEndian Int128   3.09M (323.47ns) (± 0.15%)  0.0B/op        fastest
new decode LittleEndian Int128   3.09M (323.92ns) (± 0.24%)  0.0B/op   1.00× slower
old decode LittleEndian UInt128   3.09M (323.49ns) (± 0.14%)  0.0B/op        fastest
new decode LittleEndian UInt128   3.09M (323.94ns) (± 0.39%)  0.0B/op   1.00× slower
old decode BigEndian Int8   3.09M (323.54ns) (± 0.20%)  0.0B/op        fastest
new decode BigEndian Int8   3.09M (323.83ns) (± 0.27%)  0.0B/op   1.00× slower
old decode BigEndian UInt8   3.09M (323.43ns) (± 0.12%)  0.0B/op        fastest
new decode BigEndian UInt8   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode BigEndian Int16   3.09M (323.46ns) (± 0.13%)  0.0B/op        fastest
new decode BigEndian Int16   3.08M (324.20ns) (± 0.86%)  0.0B/op   1.00× slower
old decode BigEndian UInt16   3.09M (323.62ns) (± 0.26%)  0.0B/op        fastest
new decode BigEndian UInt16   3.09M (323.82ns) (± 0.16%)  0.0B/op   1.00× slower
old decode BigEndian Int32   3.09M (323.47ns) (± 0.16%)  0.0B/op        fastest
new decode BigEndian Int32   3.09M (323.88ns) (± 0.22%)  0.0B/op   1.00× slower
old decode BigEndian UInt32   3.09M (323.45ns) (± 0.15%)  0.0B/op        fastest
new decode BigEndian UInt32   3.09M (323.78ns) (± 0.16%)  0.0B/op   1.00× slower
old decode BigEndian Int64   3.09M (323.49ns) (± 0.18%)  0.0B/op        fastest
new decode BigEndian Int64   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode BigEndian UInt64   3.09M (323.65ns) (± 0.37%)  0.0B/op        fastest
new decode BigEndian UInt64   3.09M (323.87ns) (± 0.22%)  0.0B/op   1.00× slower
old decode BigEndian Int128 159.81k (  6.26µs) (± 0.32%)  0.0B/op   1.66× slower
new decode BigEndian Int128 265.72k (  3.76µs) (± 0.14%)  0.0B/op        fastest
old decode BigEndian UInt128 159.77k (  6.26µs) (± 0.30%)  0.0B/op   1.66× slower
new decode BigEndian UInt128 265.63k (  3.76µs) (± 0.18%)  0.0B/op        fastest

Seems like LLVM might be optimizing the stack allocation out of the resulting binary. I'm surprised your results were so consistently one-sided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants