Optimize `Array#concat(Indexable)` #13280

HertzDevil · 2023-04-04T17:18:40Z

When concatenating an Indexable to an Array, we already know the argument's number of elements in advance and this number won't change (formally this needs the wording from #13061), so we never need more than one reallocation of the array's buffer. Additionally, for containers that store their elements in a contiguous buffer (or two in the case of Deque), we can leverage Intrinsics.memcpy whenever the element types are the same. Benchmarks:

Source

require "benchmark"

class Array(T)
  def concat2(other : Indexable)
    # this PR's implementation
  end
end

{% begin %}
  M = {{ env("M").to_i }}
  N = {{ env("N").to_i }}
{% end %}
puts "M=#{M}, N=#{N}"

x = [] of Int64
y_slice = Slice(Int64).empty
y_static_array = uninitialized Int64[N]
y_deque = Deque(Int64).new

puts "Slice:"
Benchmark.ips do |b|
  b.report("ctrl") do
    x = Array(Int64).new(M, &.to_i64)
    y_slice = Slice(Int64).new(N, &.to_i64)
  end

  b.report("old") do
    x = Array(Int64).new(M, &.to_i64)
    y_slice = Slice(Int64).new(N, &.to_i64)
    x.concat(y_slice)
  end

  b.report("new") do
    x = Array(Int64).new(M, &.to_i64)
    y_slice = Slice(Int64).new(N, &.to_i64)
    x.concat2(y_slice)
  end
end

puts "StaticArray:"
Benchmark.ips do |b|
  b.report("old") do
    x = Array(Int64).new(M, &.to_i64)
    x.concat(y_static_array)
  end

  b.report("new") do
    x = Array(Int64).new(M, &.to_i64)
    x.concat2(y_static_array)
  end
end

puts "Deque:"
Benchmark.ips do |b|
  b.report("ctrl") do
    x = Array(Int64).new(M, &.to_i64)
    y_deque = Deque(Int64).new(N, &.to_i64)
  end

  b.report("old") do
    x = Array(Int64).new(M, &.to_i64)
    y_deque = Deque(Int64).new(N, &.to_i64)
    x.concat(y_deque)
  end

  b.report("new") do
    x = Array(Int64).new(M, &.to_i64)
    y_deque = Deque(Int64).new(N, &.to_i64)
    x.concat2(y_deque)
  end
end

M=1000, N=100
Slice:
ctrl 948.57k (  1.05µs) (± 0.92%)  8.86kB/op        fastest
 old 642.38k (  1.56µs) (± 0.80%)  20.1kB/op   1.48× slower
 new 668.03k (  1.50µs) (± 0.74%)  20.1kB/op   1.42× slower
StaticArray:
old 709.49k (  1.41µs) (± 0.31%)  19.1kB/op   1.05× slower
new 742.15k (  1.35µs) (± 0.78%)  19.1kB/op        fastest
Deque:
ctrl 917.24k (  1.09µs) (± 0.71%)  8.89kB/op        fastest
 old 627.34k (  1.59µs) (± 0.55%)  20.2kB/op   1.46× slower
 new 654.55k (  1.53µs) (± 0.72%)  20.2kB/op   1.40× slower

M=1000, N=1000
Slice:
ctrl 491.24k (  2.04µs) (± 0.93%)  15.7kB/op        fastest
 old 214.47k (  4.66µs) (± 0.64%)  63.6kB/op   2.29× slower
 new 283.87k (  3.52µs) (± 0.90%)  36.7kB/op   1.73× slower
StaticArray:
old 278.31k (  3.59µs) (± 0.68%)  55.8kB/op   1.37× slower
new 382.17k (  2.62µs) (± 1.30%)  28.8kB/op        fastest
Deque:
ctrl 495.76k (  2.02µs) (± 1.05%)  15.7kB/op        fastest
 old 222.66k (  4.49µs) (± 0.88%)  63.6kB/op   2.23× slower
 new 303.31k (  3.30µs) (± 0.90%)  36.7kB/op   1.63× slower

M=100, N=1000
Slice:
ctrl 968.83k (  1.03µs) (± 0.78%)  8.86kB/op        fastest
 old 400.14k (  2.50µs) (± 0.71%)  27.7kB/op   2.42× slower
 new 472.52k (  2.12µs) (± 0.59%)  20.7kB/op   2.05× slower
StaticArray:
old 614.25k (  1.63µs) (± 0.59%)  19.8kB/op   1.27× slower
new 778.45k (  1.28µs) (± 0.43%)  12.9kB/op        fastest
Deque:
ctrl 833.65k (  1.20µs) (± 0.55%)  8.89kB/op        fastest
 old 373.26k (  2.68µs) (± 0.62%)  27.7kB/op   2.23× slower
 new 436.66k (  2.29µs) (± 0.40%)  20.7kB/op   1.91× slower

The above results are collected on top of #13275; without that PR, when N ≪ M, i.e. a small Indexable is being concatenated to a large Array, the Array's buffer size would grow to the next power of 2, which is often less inefficient than the new growth policy. In contrast, the existing #concat(Enumerable) inserts elements one by one, so it already uses the new growth policy.

The old Array#concat(Array) overload is absorbed into the new implementation for Indexable arguments.

HertzDevil · 2023-04-04T21:03:31Z

There is a compilation error that only affects Crystal 1.0.0? Same piece of code works on Crystal 1.1.1, and I don't recall any changes to the type inference logic in-between. Only reason I could think of is #10582

Optimize Array#concat(Indexable)

65c787b

HertzDevil added performance topic:stdlib:collection labels Apr 4, 2023

straight-shoota approved these changes Apr 4, 2023

View reviewed changes

straight-shoota added this to the 1.9.0 milestone Apr 4, 2023

HertzDevil added a commit to HertzDevil/crystal that referenced this pull request Apr 4, 2023

Deque.half_slices from crystal-lang#13280

c56b7d2

HertzDevil mentioned this pull request Apr 4, 2023

Optimize Deque#concat(Indexable) #13283

Merged

add workaround for 1.0.0

e67b0dd

straight-shoota merged commit 70bf29e into crystal-lang:master Apr 18, 2023

HertzDevil deleted the perf/array-concat-indexable branch April 18, 2023 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `Array#concat(Indexable)` #13280

Optimize `Array#concat(Indexable)` #13280

HertzDevil commented Apr 4, 2023

HertzDevil commented Apr 4, 2023 •

edited

Loading

Optimize Array#concat(Indexable) #13280

Optimize Array#concat(Indexable) #13280

Conversation

HertzDevil commented Apr 4, 2023

HertzDevil commented Apr 4, 2023 • edited Loading

Optimize `Array#concat(Indexable)` #13280

Optimize `Array#concat(Indexable)` #13280

HertzDevil commented Apr 4, 2023 •

edited

Loading