Optimize Integer decoding from bytes #11796

carlhoerberg · 2022-02-03T22:14:49Z

By using Slice#copy_to(target : Pointer(T), count) and not (stack) allocating new slices.

Benchmark

require "benchmark"

module IO::ByteFormat
  {% for mod in %w(LittleEndian BigEndian) %}
    module {{mod.id}}
      {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
        {% bytesize = 2 ** (i // 2) %}
        def self.decode2(type : {{type.id}}.class, bytes : Bytes)
          buffer = uninitialized UInt8[{{bytesize}}]
          bytes.copy_to(buffer.to_unsafe, {{bytesize}})
          buffer.reverse! unless SystemEndian == self
          buffer.unsafe_as({{type.id}})
        end
      {% end %}
    end
  {% end %}
end

b = Bytes.new(128) { 1u8 }

{% for mod in %w(LittleEndian BigEndian) %}
  {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
    Benchmark.ips do |x|
      x.report("old decode {{mod.id}} {{ type.id }}") do
        IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b)
      end
      x.report("new decode {{mod.id}} {{ type.id }}") do
        IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b)
      end
    end
  {% end %}
{% end %}

Results

old decode LittleEndian Int8 481.42M (  2.08ns) (± 3.63%)  0.0B/op   1.80× slower
new decode LittleEndian Int8 865.40M (  1.16ns) (± 1.36%)  0.0B/op        fastest
old decode LittleEndian UInt8 532.87M (  1.88ns) (± 3.58%)  0.0B/op   1.65× slower
new decode LittleEndian UInt8 877.89M (  1.14ns) (± 3.18%)  0.0B/op        fastest
old decode LittleEndian Int16 492.96M (  2.03ns) (± 3.21%)  0.0B/op   1.79× slower
new decode LittleEndian Int16 881.37M (  1.13ns) (± 3.86%)  0.0B/op        fastest
old decode LittleEndian UInt16 468.77M (  2.13ns) (± 4.01%)  0.0B/op   1.84× slower
new decode LittleEndian UInt16 864.24M (  1.16ns) (± 1.69%)  0.0B/op        fastest
old decode LittleEndian Int32 491.97M (  2.03ns) (± 5.36%)  0.0B/op   1.85× slower
new decode LittleEndian Int32 911.73M (  1.10ns) (± 4.76%)  0.0B/op        fastest
old decode LittleEndian UInt32 492.52M (  2.03ns) (± 4.21%)  0.0B/op   1.75× slower
new decode LittleEndian UInt32 863.92M (  1.16ns) (± 2.46%)  0.0B/op        fastest
old decode LittleEndian Int64 481.59M (  2.08ns) (± 2.89%)  0.0B/op   1.92× slower
new decode LittleEndian Int64 925.13M (  1.08ns) (± 4.13%)  0.0B/op        fastest
old decode LittleEndian UInt64 474.08M (  2.11ns) (± 3.60%)  0.0B/op   1.83× slower
new decode LittleEndian UInt64 865.81M (  1.15ns) (± 1.47%)  0.0B/op        fastest
old decode LittleEndian Int128 479.22M (  2.09ns) (± 2.60%)  0.0B/op   1.80× slower
new decode LittleEndian Int128 861.06M (  1.16ns) (± 2.56%)  0.0B/op        fastest
old decode LittleEndian UInt128 478.12M (  2.09ns) (± 3.83%)  0.0B/op   1.85× slower
new decode LittleEndian UInt128 885.39M (  1.13ns) (± 3.43%)  0.0B/op        fastest
old decode BigEndian Int8 508.86M (  1.97ns) (± 2.52%)  0.0B/op   1.73× slower
new decode BigEndian Int8 882.45M (  1.13ns) (± 3.30%)  0.0B/op        fastest
old decode BigEndian UInt8 399.35M (  2.50ns) (±14.63%)  0.0B/op   2.18× slower
new decode BigEndian UInt8 871.88M (  1.15ns) (± 2.95%)  0.0B/op        fastest
old decode BigEndian Int16 520.54M (  1.92ns) (± 4.86%)  0.0B/op   1.76× slower
new decode BigEndian Int16 916.68M (  1.09ns) (± 5.12%)  0.0B/op        fastest
old decode BigEndian UInt16 493.51M (  2.03ns) (± 3.94%)  0.0B/op   1.75× slower
new decode BigEndian UInt16 863.52M (  1.16ns) (± 4.39%)  0.0B/op        fastest
old decode BigEndian Int32 144.54M (  6.92ns) (± 2.27%)  0.0B/op   1.20× slower
new decode BigEndian Int32 172.98M (  5.78ns) (± 2.75%)  0.0B/op        fastest
old decode BigEndian UInt32 144.65M (  6.91ns) (± 2.09%)  0.0B/op   1.19× slower
new decode BigEndian UInt32 172.56M (  5.80ns) (± 2.49%)  0.0B/op        fastest
old decode BigEndian Int64 172.93M (  5.78ns) (± 2.60%)  0.0B/op        fastest
new decode BigEndian Int64 150.70M (  6.64ns) (± 4.02%)  0.0B/op   1.15× slower
old decode BigEndian UInt64 175.44M (  5.70ns) (± 3.70%)  0.0B/op        fastest
new decode BigEndian UInt64 150.72M (  6.63ns) (± 3.56%)  0.0B/op   1.16× slower
old decode BigEndian Int128 123.75M (  8.08ns) (± 2.76%)  0.0B/op   1.32× slower
new decode BigEndian Int128 163.88M (  6.10ns) (± 5.83%)  0.0B/op        fastest
old decode BigEndian UInt128 132.80M (  7.53ns) (± 2.45%)  0.0B/op   1.21× slower
new decode BigEndian UInt128 161.01M (  6.21ns) (± 1.47%)  0.0B/op        fastest

Discussion

Decoding is about 80% faster with the notable of exception BigEndian (U)Int64, which becomes 15% slower.

By using Pointer#copy_to and not allocation new slices.

asterite · 2022-02-03T22:32:36Z

There's actually no memory allocation involved here. This works slightly faster in this PR because some bound checks are removed, which makes this easily prone to segfault. I'm not sure we should do that.

I'm not on the computer, but if I have time I will send code that makes this change segfault.

straight-shoota · 2022-02-03T22:47:09Z

Yes, the change to .encode is definitely incorrect. It removes bounds checking on bytes and doesn't validate that it's not read_only.

I believe the change to .decode should be safe, though. It still uses Slice#copy_to which includes bounds checking. And we can be certain of the buffer size, so the pointer parameter should not be an issue.

carlhoerberg · 2022-02-03T22:54:31Z

Removed .encode, it was an afterthought and didn't improve performance anyway.

beta-ziliani

🚀 Thanks!

The discussion within this PR shows that we're missing failing specs for encode/decode

jgaskins · 2022-03-22T23:15:46Z

@carlhoerberg Your benchmark results seem to have run into the issue where each iteration is so fast that the results are inaccurate. Benchmark.ips can't be trusted when the results are in the 1-3ns range.

I updated the benchmark code to run more iterations inside the report block and the only notable change in performance was on Int128. Everything else was within ±1% and even closer on ARM.

Updated code

require "benchmark"

module IO::ByteFormat
  {% for mod in %w(LittleEndian BigEndian) %}
    module {{mod.id}}
      {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
        {% bytesize = 2 ** (i // 2) %}
        def self.decode2(type : {{type.id}}.class, bytes : Bytes)
          buffer = uninitialized UInt8[{{bytesize}}]
          bytes.copy_to(buffer.to_unsafe, {{bytesize}})
          buffer.reverse! unless SystemEndian == self
          buffer.unsafe_as({{type.id}})
        end
      {% end %}
    end
  {% end %}
end

b = Bytes.new(128) { 1u8 }
iterations = 1_000

{% for mod in %w(LittleEndian BigEndian) %}
  {% for type, i in %w(Int8 UInt8 Int16 UInt16 Int32 UInt32 Int64 UInt64 Int128 UInt128) %}
    Benchmark.ips do |x|
      x.report("old decode {{mod.id}} {{ type.id }}") do
        iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
      end
      x.report("new decode {{mod.id}} {{ type.id }}") do
        iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) }
      end
    end
  {% end %}
{% end %}

Diff from the original benchmark code

19a20
> iterations = 1_000
25c26
<         IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b)
---
>         iterations.times { IO::ByteFormat::{{mod.id}}.decode({{type.id}}, b) }
28c29
<         IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b)
---
>         iterations.times { IO::ByteFormat::{{mod.id}}.decode2({{type.id}}, b) }

Intel results

old decode LittleEndian Int8   4.44M (224.98ns) (± 1.25%)  0.0B/op   1.01× slower
new decode LittleEndian Int8   4.49M (222.73ns) (± 0.78%)  0.0B/op        fastest
old decode LittleEndian UInt8   4.48M (223.25ns) (± 0.94%)  0.0B/op        fastest
new decode LittleEndian UInt8   4.44M (225.25ns) (± 1.85%)  0.0B/op   1.01× slower
old decode LittleEndian Int16   4.49M (222.81ns) (± 1.23%)  0.0B/op        fastest
new decode LittleEndian Int16   4.45M (224.68ns) (± 0.75%)  0.0B/op   1.01× slower
old decode LittleEndian UInt16   4.46M (224.09ns) (± 1.10%)  0.0B/op   1.01× slower
new decode LittleEndian UInt16   4.49M (222.90ns) (± 1.23%)  0.0B/op        fastest
old decode LittleEndian Int32   4.50M (222.37ns) (± 1.15%)  0.0B/op        fastest
new decode LittleEndian Int32   4.48M (223.13ns) (± 0.88%)  0.0B/op   1.00× slower
old decode LittleEndian UInt32   4.44M (225.38ns) (± 0.66%)  0.0B/op   1.01× slower
new decode LittleEndian UInt32   4.47M (223.47ns) (± 1.75%)  0.0B/op        fastest
old decode LittleEndian Int64   4.44M (225.45ns) (± 2.39%)  0.0B/op        fastest
new decode LittleEndian Int64   4.43M (225.85ns) (± 1.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt64   4.46M (224.41ns) (± 1.31%)  0.0B/op   1.00× slower
new decode LittleEndian UInt64   4.48M (223.41ns) (± 1.65%)  0.0B/op        fastest
old decode LittleEndian Int128   4.49M (222.78ns) (± 1.03%)  0.0B/op        fastest
new decode LittleEndian Int128   4.46M (224.29ns) (± 1.27%)  0.0B/op   1.01× slower
old decode LittleEndian UInt128   4.48M (223.31ns) (± 0.77%)  0.0B/op   1.00× slower
new decode LittleEndian UInt128   4.49M (222.76ns) (± 0.72%)  0.0B/op        fastest
old decode BigEndian Int8   4.48M (223.07ns) (± 0.94%)  0.0B/op        fastest
new decode BigEndian Int8   4.45M (224.97ns) (± 0.67%)  0.0B/op   1.01× slower
old decode BigEndian UInt8   4.48M (223.39ns) (± 0.97%)  0.0B/op   1.01× slower
new decode BigEndian UInt8   4.53M (220.96ns) (± 0.73%)  0.0B/op        fastest
old decode BigEndian Int16   4.52M (221.31ns) (± 0.82%)  0.0B/op        fastest
new decode BigEndian Int16   4.48M (223.06ns) (± 0.93%)  0.0B/op   1.01× slower
old decode BigEndian UInt16   4.47M (223.78ns) (± 1.01%)  0.0B/op   1.01× slower
new decode BigEndian UInt16   4.51M (221.84ns) (± 0.67%)  0.0B/op        fastest
old decode BigEndian Int32   4.50M (222.36ns) (± 0.93%)  0.0B/op        fastest
new decode BigEndian Int32   4.48M (223.10ns) (± 0.83%)  0.0B/op   1.00× slower
old decode BigEndian UInt32   4.45M (224.72ns) (± 0.88%)  0.0B/op   1.01× slower
new decode BigEndian UInt32   4.50M (222.18ns) (± 1.17%)  0.0B/op        fastest
old decode BigEndian Int64   4.50M (221.98ns) (± 1.16%)  0.0B/op        fastest
new decode BigEndian Int64   4.48M (223.24ns) (± 0.84%)  0.0B/op   1.01× slower
old decode BigEndian UInt64   4.49M (222.73ns) (± 0.72%)  0.0B/op   1.00× slower
new decode BigEndian UInt64   4.50M (222.26ns) (± 0.65%)  0.0B/op        fastest
old decode BigEndian Int128 154.46k (  6.47µs) (± 1.14%)  0.0B/op   1.43× slower
new decode BigEndian Int128 220.18k (  4.54µs) (± 0.74%)  0.0B/op        fastest
old decode BigEndian UInt128 154.52k (  6.47µs) (± 1.05%)  0.0B/op   1.42× slower
new decode BigEndian UInt128 219.74k (  4.55µs) (± 1.08%)  0.0B/op        fastest

ARM64 results

old decode LittleEndian Int8   3.09M (323.65ns) (± 0.33%)  0.0B/op        fastest
new decode LittleEndian Int8   3.09M (323.79ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt8   3.09M (323.69ns) (± 0.32%)  0.0B/op        fastest
new decode LittleEndian UInt8   3.09M (324.13ns) (± 0.47%)  0.0B/op   1.00× slower
old decode LittleEndian Int16   3.09M (323.54ns) (± 0.21%)  0.0B/op        fastest
new decode LittleEndian Int16   3.09M (323.73ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian UInt16   3.09M (323.68ns) (± 0.40%)  0.0B/op        fastest
new decode LittleEndian UInt16   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode LittleEndian Int32   3.09M (324.00ns) (± 0.89%)  0.0B/op   1.00× slower
new decode LittleEndian Int32   3.09M (323.90ns) (± 0.26%)  0.0B/op        fastest
old decode LittleEndian UInt32   3.09M (324.02ns) (± 0.86%)  0.0B/op   1.00× slower
new decode LittleEndian UInt32   3.09M (323.86ns) (± 0.20%)  0.0B/op        fastest
old decode LittleEndian Int64   3.09M (323.46ns) (± 0.14%)  0.0B/op        fastest
new decode LittleEndian Int64   3.09M (323.91ns) (± 0.31%)  0.0B/op   1.00× slower
old decode LittleEndian UInt64   3.09M (323.59ns) (± 0.23%)  0.0B/op        fastest
new decode LittleEndian UInt64   3.09M (323.93ns) (± 0.34%)  0.0B/op   1.00× slower
old decode LittleEndian Int128   3.09M (323.47ns) (± 0.15%)  0.0B/op        fastest
new decode LittleEndian Int128   3.09M (323.92ns) (± 0.24%)  0.0B/op   1.00× slower
old decode LittleEndian UInt128   3.09M (323.49ns) (± 0.14%)  0.0B/op        fastest
new decode LittleEndian UInt128   3.09M (323.94ns) (± 0.39%)  0.0B/op   1.00× slower
old decode BigEndian Int8   3.09M (323.54ns) (± 0.20%)  0.0B/op        fastest
new decode BigEndian Int8   3.09M (323.83ns) (± 0.27%)  0.0B/op   1.00× slower
old decode BigEndian UInt8   3.09M (323.43ns) (± 0.12%)  0.0B/op        fastest
new decode BigEndian UInt8   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode BigEndian Int16   3.09M (323.46ns) (± 0.13%)  0.0B/op        fastest
new decode BigEndian Int16   3.08M (324.20ns) (± 0.86%)  0.0B/op   1.00× slower
old decode BigEndian UInt16   3.09M (323.62ns) (± 0.26%)  0.0B/op        fastest
new decode BigEndian UInt16   3.09M (323.82ns) (± 0.16%)  0.0B/op   1.00× slower
old decode BigEndian Int32   3.09M (323.47ns) (± 0.16%)  0.0B/op        fastest
new decode BigEndian Int32   3.09M (323.88ns) (± 0.22%)  0.0B/op   1.00× slower
old decode BigEndian UInt32   3.09M (323.45ns) (± 0.15%)  0.0B/op        fastest
new decode BigEndian UInt32   3.09M (323.78ns) (± 0.16%)  0.0B/op   1.00× slower
old decode BigEndian Int64   3.09M (323.49ns) (± 0.18%)  0.0B/op        fastest
new decode BigEndian Int64   3.09M (323.77ns) (± 0.15%)  0.0B/op   1.00× slower
old decode BigEndian UInt64   3.09M (323.65ns) (± 0.37%)  0.0B/op        fastest
new decode BigEndian UInt64   3.09M (323.87ns) (± 0.22%)  0.0B/op   1.00× slower
old decode BigEndian Int128 159.81k (  6.26µs) (± 0.32%)  0.0B/op   1.66× slower
new decode BigEndian Int128 265.72k (  3.76µs) (± 0.14%)  0.0B/op        fastest
old decode BigEndian UInt128 159.77k (  6.26µs) (± 0.30%)  0.0B/op   1.66× slower
new decode BigEndian UInt128 265.63k (  3.76µs) (± 0.18%)  0.0B/op        fastest

Seems like LLVM might be optimizing the stack allocation out of the resulting binary. I'm surprised your results were so consistently one-sided.

Optimize Integer encoding/decoding to/from bytes

270f7c2

By using Pointer#copy_to and not allocation new slices.

fixup! Optimize Integer encoding/decoding to/from bytes

e3f776e

Blacksmoke16 added performance topic:stdlib:numeric labels Feb 3, 2022

straight-shoota approved these changes Feb 4, 2022

View reviewed changes

carlhoerberg changed the title ~~Optimize Integer encoding/decoding to/from bytes~~ Optimize Integer decoding from bytes Feb 4, 2022

beta-ziliani approved these changes Mar 18, 2022

View reviewed changes

beta-ziliani added this to the 1.4.0 milestone Mar 18, 2022

Merge branch 'master' into optimized-int-encoding-decoding

99fd136

straight-shoota merged commit 9a5e6fa into crystal-lang:master Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Integer decoding from bytes #11796

Optimize Integer decoding from bytes #11796

carlhoerberg commented Feb 3, 2022 •

edited

Loading

asterite commented Feb 3, 2022

straight-shoota commented Feb 3, 2022

carlhoerberg commented Feb 3, 2022 •

edited

Loading

beta-ziliani left a comment

jgaskins commented Mar 22, 2022

Optimize Integer decoding from bytes #11796

Optimize Integer decoding from bytes #11796

Conversation

carlhoerberg commented Feb 3, 2022 • edited Loading

Benchmark

Results

Discussion

asterite commented Feb 3, 2022

straight-shoota commented Feb 3, 2022

carlhoerberg commented Feb 3, 2022 • edited Loading

beta-ziliani left a comment

Choose a reason for hiding this comment

jgaskins commented Mar 22, 2022

carlhoerberg commented Feb 3, 2022 •

edited

Loading

carlhoerberg commented Feb 3, 2022 •

edited

Loading