Channel perf #2751

nikiforo · 2021-12-10T15:26:50Z

benchmark/ jmh:run -i 10 -wi 6 -f1 -t4 -gc true -jvmArgs "-XX:MaxRecursiveInlineLevel=3 -XX:MaxInlineSize=50 -Dcats.effect.tracing.mode=none" fs2.benchmark.ChannelBenchmark

channel-perf

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   10  70315,735 ± 990,801  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   10   8489,787 ± 220,705  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   10    586,689 ±  14,841  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   10  48603,532 ± 755,156  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   10  10497,909 ± 444,052  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   10    650,486 ±  36,143  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   10  46336,865 ± 640,137  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   10   3421,420 ± 268,378  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   10    162,638 ±   9,527  ops/s

main

[info] Benchmark                              (size)   Mode  Cnt      Score      Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   10  65333,103 ± 4694,045  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   10   7288,803 ±  309,856  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   10    487,553 ±   28,040  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   10  45527,449 ± 2153,177  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   10   8268,452 ±  320,778  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   10    490,139 ±   25,620  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   10  36154,018 ± 2300,873  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   10   2891,059 ±  219,453  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   10    143,254 ±   10,190  ops/s

New versions differs from the one from the main in two aspects. Firstly, it replaces Vector with List in the State. I don’t think that this change is responsible for the observed performance change. ~~Yet, currently fs2.Chunk doesn’t have a vector-specialized implementation, therefore it’ll anyway be converted to Array.~~ A more specialized implementation reduces the amount of computations required.
The second change reduces critical section in the CAS loop. Previously, not only the next state was computed in the loop, but also the old state was transformed into the output. Updated version splits this action into two distinct pieces of work. While we compute new state in CAS, the conversion of the previous state to the emitted chunk is done after the CAS. It reduces contention and eliminates duplicated emitted chunk preparations.

                                  method   channel-perf          main ratio
               ChannelBenchmark.sendPull      70315,735     65333,103 1,076
               ChannelBenchmark.sendPull       8489,787      7288,803 1,165
               ChannelBenchmark.sendPull        586,689       487,553 1,203
           ChannelBenchmark.sendPullPar8      48603,532     45527,449 1,068
           ChannelBenchmark.sendPullPar8      10497,909      8268,452 1,270
           ChannelBenchmark.sendPullPar8        650,486       490,139 1,327
   ChannelBenchmark.sendPullParUnlimited      46336,865     36154,018 1,282
   ChannelBenchmark.sendPullParUnlimited       3421,420      2891,059 1,183
   ChannelBenchmark.sendPullParUnlimited        162,638       143,254 1,135

mpilquist · 2021-12-10T15:45:27Z

Yet, currently fs2.Chunk doesn’t have a vector-specialized implementation, therefore it’ll anyway be converted to Array.

It should be wrapped in an instance of Chunk.IndexedSeqChunk and do no copying. The List version will result in a copy to an array.

nikiforo · 2021-12-10T19:01:38Z

It should be wrapped in an instance of Chunk.IndexedSeqChunk and do no copying. The List version will result in a copy to an array.

I think I have poorly formulated. I believe that combination of List(an immutable data structure used in CAS) + Array(for an emitted Chunk, that is filled in reverse order) might be the most effective approach here. However, I will change back to Vector and measure the performance.

Benchmarks show that we might gain 20%+ in this class. The question is: should we consider 20% increase in throughput in this class as an optimization worth exploring?

mpilquist · 2021-12-10T19:03:05Z

Yeah, that's very possible. List beats Vector in so many cases I don't expect. No objection overall as long as benchmark results point us in the right direction and we're testing both small and large collections.

diesalbla · 2021-12-10T19:50:44Z

core/shared/src/main/scala/fs2/concurrent/Channel.scala

        closed: Boolean
    )

-    val initial = State(Vector.empty, 0, None, Vector.empty, false)
+    val initial = State(List.empty, 0, None, List.empty, false)


Suggested change

val initial = State(List.empty, 0, None, List.empty, false)

def empty(isClosed: Boolean) = State(List.empty, 0, None, List.empty, false)

val initial = empty(isClosed = false)

diesalbla · 2021-12-10T19:56:31Z

core/shared/src/main/scala/fs2/concurrent/Channel.scala

+                  case prev @ State(values, size, ignorePreviousWaiting @ _, producers, closed) =>
+                    if (shouldEmit(prev)) (State(List.empty, 0, None, List.empty, closed), prev)
+                    else (State(values, size, waiting.some, producers, closed), prev)


Suggested change

case prev @ State(values, size, ignorePreviousWaiting @ _, producers, closed) =>

if (shouldEmit(prev)) (State(List.empty, 0, None, List.empty, closed), prev)

else (State(values, size, waiting.some, producers, closed), prev)

case prev if shouldEmit(prev) =>

empty(prev.closed) -> prev

case prev =>

prev.copy(waiting = Some(waiting)) -> prev

core/shared/src/main/scala/fs2/concurrent/Channel.scala

nikiforo · 2021-12-14T21:33:17Z

I've changed ChannelBenchmark a bit: changed bounded, added some dummy load. Also, I've changed the @diesalbla's command by adding Xmx, Xms in hope to run gc every iteration.

benchmark/ jmh:run -i 20 -wi 10 -f1 -t4 -gc true -jvmArgs "-XX:MaxRecursiveInlineLevel=3 -XX:MaxInlineSize=50 -Dcats.effect.tracing.mode=none -Xmx512m -Xms512m" fs2.benchmark.ChannelBenchmark

Using updated benchmarks I've tested four versions:

main
channel-perf (version from this PR)
vector-CAS (all occurrences of List in this PR was changed to Vector)
List only (all occurrences of Vector in main was changed to List)

main

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  48074,370 ± 734,320  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8525,740 ±  87,360  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    597,088 ±   2,922  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  37157,267 ± 424,852  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   7778,680 ±  36,153  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    516,252 ±   5,478  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  31382,873 ± 394,371  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2434,107 ±  14,289  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    101,704 ±   2,496  ops/s

Channel-perf

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  47493,665 ± 403,438  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   9822,312 ±  28,054  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    718,336 ±   4,700  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  37502,432 ± 270,349  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   8997,285 ±  23,286  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    665,925 ±   2,915  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  36746,409 ± 348,711  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2829,545 ±   7,964  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    133,678 ±   1,450  ops/s

Vector-CAS

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  46099,934 ± 398,498  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8318,947 ±  31,983  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    574,230 ±   1,996  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  35967,734 ± 306,821  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   7829,428 ±  16,850  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    531,156 ±   1,920  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  34700,785 ± 436,214  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2607,731 ±   6,898  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    117,265 ±   0,804  ops/s

List only

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  46135,928 ± 712,387  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8579,870 ±  25,847  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    641,415 ±   5,867  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  36748,903 ± 479,760  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   8342,074 ±  68,747  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    550,346 ±   5,130  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  33841,176 ± 348,651  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2566,426 ±   6,597  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    121,940 ±   4,638  ops/s

Comparison of scores of each of benchmarks with the main:

                                  method           main     1     2     3
               ChannelBenchmark.sendPull      48074,370 0,988 0,959 0,960
               ChannelBenchmark.sendPull       8525,740 1,152 0,976 1,006
               ChannelBenchmark.sendPull        597,088 1,203 0,962 1,074
           ChannelBenchmark.sendPullPar8      37157,267 1,009 0,968 0,989
           ChannelBenchmark.sendPullPar8       7778,680 1,157 1,007 1,072
           ChannelBenchmark.sendPullPar8        516,252 1,290 1,029 1,066
   ChannelBenchmark.sendPullParUnlimited      31382,873 1,171 1,106 1,078
   ChannelBenchmark.sendPullParUnlimited       2434,107 1,162 1,071 1,054
   ChannelBenchmark.sendPullParUnlimited        101,704 1,314 1,153 1,199

mpilquist · 2021-12-22T22:48:14Z

@nikiforo Think this is ready for merge? Anyone else you want a review from?

nikiforo · 2021-12-23T09:37:25Z

Think this is ready for merge?

I think it is. I'm sure that it shouldn't make things worse. For some scenarios it even shows 30% performance increase.

Anyone else you want a review from?

Because I haven't changed the behavior, there is no strong requirement for @SystemFw's review. Yet, I would love to hear his thoughts about the PR.

nikiforo added 3 commits December 9, 2021 23:48

perf

8539d09

benchmark

45eee2b

channel-perf

066253e

nikiforo force-pushed the channel-perf branch from 21b5bb2 to 066253e Compare December 10, 2021 15:28

diesalbla reviewed Dec 10, 2021

View reviewed changes

channel-perf - CR comments + scalafmt + blackhole

65a5ca4

channel-perf - CR comment

f12628f

nikiforo mentioned this pull request Dec 22, 2021

text.lines enhancements #2758

Merged

diesalbla approved these changes Dec 22, 2021

View reviewed changes

mpilquist merged commit 2173855 into typelevel:main Dec 23, 2021

nikiforo mentioned this pull request Nov 2, 2022

Reimplemented Channel in terms of Queue #2856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel perf #2751

Channel perf #2751

nikiforo commented Dec 10, 2021 •

edited

Loading

mpilquist commented Dec 10, 2021

nikiforo commented Dec 10, 2021

mpilquist commented Dec 10, 2021

diesalbla Dec 10, 2021

diesalbla Dec 10, 2021

nikiforo commented Dec 14, 2021 •

edited

Loading

mpilquist commented Dec 22, 2021

nikiforo commented Dec 23, 2021

	val initial = State(List.empty, 0, None, List.empty, false)
	def empty(isClosed: Boolean) = State(List.empty, 0, None, List.empty, false)
	val initial = empty(isClosed = false)

-                  case prev @ State(values, size, ignorePreviousWaiting @ _, producers, closed) =>
-                    if (shouldEmit(prev)) (State(List.empty, 0, None, List.empty, closed), prev)
-                    else (State(values, size, waiting.some, producers, closed), prev)
+                  case prev if shouldEmit(prev) =>
+                    empty(prev.closed) -> prev
+                  case prev =>
+                    prev.copy(waiting = Some(waiting)) -> prev

Channel perf #2751

Channel perf #2751

Conversation

nikiforo commented Dec 10, 2021 • edited Loading

mpilquist commented Dec 10, 2021

nikiforo commented Dec 10, 2021

mpilquist commented Dec 10, 2021

diesalbla Dec 10, 2021

Choose a reason for hiding this comment

diesalbla Dec 10, 2021

Choose a reason for hiding this comment

nikiforo commented Dec 14, 2021 • edited Loading

mpilquist commented Dec 22, 2021

nikiforo commented Dec 23, 2021

nikiforo commented Dec 10, 2021 •

edited

Loading

nikiforo commented Dec 14, 2021 •

edited

Loading