Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel perf #2751

Merged
merged 5 commits into from
Dec 23, 2021
Merged

Channel perf #2751

merged 5 commits into from
Dec 23, 2021

Conversation

nikiforo
Copy link
Contributor

@nikiforo nikiforo commented Dec 10, 2021

benchmark/ jmh:run -i 10 -wi 6 -f1 -t4 -gc true -jvmArgs "-XX:MaxRecursiveInlineLevel=3 -XX:MaxInlineSize=50 -Dcats.effect.tracing.mode=none" fs2.benchmark.ChannelBenchmark

channel-perf

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   10  70315,735 ± 990,801  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   10   8489,787 ± 220,705  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   10    586,689 ±  14,841  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   10  48603,532 ± 755,156  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   10  10497,909 ± 444,052  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   10    650,486 ±  36,143  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   10  46336,865 ± 640,137  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   10   3421,420 ± 268,378  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   10    162,638 ±   9,527  ops/s

main

[info] Benchmark                              (size)   Mode  Cnt      Score      Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   10  65333,103 ± 4694,045  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   10   7288,803 ±  309,856  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   10    487,553 ±   28,040  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   10  45527,449 ± 2153,177  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   10   8268,452 ±  320,778  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   10    490,139 ±   25,620  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   10  36154,018 ± 2300,873  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   10   2891,059 ±  219,453  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   10    143,254 ±   10,190  ops/s

New versions differs from the one from the main in two aspects. Firstly, it replaces Vector with List in the State. I don’t think that this change is responsible for the observed performance change. Yet, currently fs2.Chunk doesn’t have a vector-specialized implementation, therefore it’ll anyway be converted to Array. A more specialized implementation reduces the amount of computations required.
The second change reduces critical section in the CAS loop. Previously, not only the next state was computed in the loop, but also the old state was transformed into the output. Updated version splits this action into two distinct pieces of work. While we compute new state in CAS, the conversion of the previous state to the emitted chunk is done after the CAS. It reduces contention and eliminates duplicated emitted chunk preparations.

                                  method   channel-perf          main ratio
               ChannelBenchmark.sendPull      70315,735     65333,103 1,076
               ChannelBenchmark.sendPull       8489,787      7288,803 1,165
               ChannelBenchmark.sendPull        586,689       487,553 1,203
           ChannelBenchmark.sendPullPar8      48603,532     45527,449 1,068
           ChannelBenchmark.sendPullPar8      10497,909      8268,452 1,270
           ChannelBenchmark.sendPullPar8        650,486       490,139 1,327
   ChannelBenchmark.sendPullParUnlimited      46336,865     36154,018 1,282
   ChannelBenchmark.sendPullParUnlimited       3421,420      2891,059 1,183
   ChannelBenchmark.sendPullParUnlimited        162,638       143,254 1,135

@mpilquist
Copy link
Member

Yet, currently fs2.Chunk doesn’t have a vector-specialized implementation, therefore it’ll anyway be converted to Array.

It should be wrapped in an instance of Chunk.IndexedSeqChunk and do no copying. The List version will result in a copy to an array.

@nikiforo
Copy link
Contributor Author

It should be wrapped in an instance of Chunk.IndexedSeqChunk and do no copying. The List version will result in a copy to an array.

I think I have poorly formulated. I believe that combination of List(an immutable data structure used in CAS) + Array(for an emitted Chunk, that is filled in reverse order) might be the most effective approach here. However, I will change back to Vector and measure the performance.

Benchmarks show that we might gain 20%+ in this class. The question is: should we consider 20% increase in throughput in this class as an optimization worth exploring?

@mpilquist
Copy link
Member

Yeah, that's very possible. List beats Vector in so many cases I don't expect. No objection overall as long as benchmark results point us in the right direction and we're testing both small and large collections.

closed: Boolean
)

val initial = State(Vector.empty, 0, None, Vector.empty, false)
val initial = State(List.empty, 0, None, List.empty, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val initial = State(List.empty, 0, None, List.empty, false)
def empty(isClosed: Boolean) = State(List.empty, 0, None, List.empty, false)
val initial = empty(isClosed = false)

Comment on lines 187 to 189
case prev @ State(values, size, ignorePreviousWaiting @ _, producers, closed) =>
if (shouldEmit(prev)) (State(List.empty, 0, None, List.empty, closed), prev)
else (State(values, size, waiting.some, producers, closed), prev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case prev @ State(values, size, ignorePreviousWaiting @ _, producers, closed) =>
if (shouldEmit(prev)) (State(List.empty, 0, None, List.empty, closed), prev)
else (State(values, size, waiting.some, producers, closed), prev)
case prev if shouldEmit(prev) =>
empty(prev.closed) -> prev
case prev =>
prev.copy(waiting = Some(waiting)) -> prev

core/shared/src/main/scala/fs2/concurrent/Channel.scala Outdated Show resolved Hide resolved
@nikiforo
Copy link
Contributor Author

nikiforo commented Dec 14, 2021

I've changed ChannelBenchmark a bit: changed bounded, added some dummy load. Also, I've changed the @diesalbla's command by adding Xmx, Xms in hope to run gc every iteration.

benchmark/ jmh:run -i 20 -wi 10 -f1 -t4 -gc true -jvmArgs "-XX:MaxRecursiveInlineLevel=3 -XX:MaxInlineSize=50 -Dcats.effect.tracing.mode=none -Xmx512m -Xms512m" fs2.benchmark.ChannelBenchmark

Using updated benchmarks I've tested four versions:

  1. main
  2. channel-perf (version from this PR)
  3. vector-CAS (all occurrences of List in this PR was changed to Vector)
  4. List only (all occurrences of Vector in main was changed to List)

main

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  48074,370 ± 734,320  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8525,740 ±  87,360  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    597,088 ±   2,922  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  37157,267 ± 424,852  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   7778,680 ±  36,153  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    516,252 ±   5,478  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  31382,873 ± 394,371  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2434,107 ±  14,289  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    101,704 ±   2,496  ops/s

Channel-perf

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  47493,665 ± 403,438  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   9822,312 ±  28,054  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    718,336 ±   4,700  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  37502,432 ± 270,349  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   8997,285 ±  23,286  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    665,925 ±   2,915  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  36746,409 ± 348,711  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2829,545 ±   7,964  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    133,678 ±   1,450  ops/s

Vector-CAS

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  46099,934 ± 398,498  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8318,947 ±  31,983  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    574,230 ±   1,996  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  35967,734 ± 306,821  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   7829,428 ±  16,850  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    531,156 ±   1,920  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  34700,785 ± 436,214  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2607,731 ±   6,898  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    117,265 ±   0,804  ops/s

List only

[info] Benchmark                              (size)   Mode  Cnt      Score     Error  Units
[info] ChannelBenchmark.sendPull                  64  thrpt   20  46135,928 ± 712,387  ops/s
[info] ChannelBenchmark.sendPull                1024  thrpt   20   8579,870 ±  25,847  ops/s
[info] ChannelBenchmark.sendPull               16384  thrpt   20    641,415 ±   5,867  ops/s
[info] ChannelBenchmark.sendPullPar8              64  thrpt   20  36748,903 ± 479,760  ops/s
[info] ChannelBenchmark.sendPullPar8            1024  thrpt   20   8342,074 ±  68,747  ops/s
[info] ChannelBenchmark.sendPullPar8           16384  thrpt   20    550,346 ±   5,130  ops/s
[info] ChannelBenchmark.sendPullParUnlimited      64  thrpt   20  33841,176 ± 348,651  ops/s
[info] ChannelBenchmark.sendPullParUnlimited    1024  thrpt   20   2566,426 ±   6,597  ops/s
[info] ChannelBenchmark.sendPullParUnlimited   16384  thrpt   20    121,940 ±   4,638  ops/s

Comparison of scores of each of benchmarks with the main:

                                  method           main     1     2     3
               ChannelBenchmark.sendPull      48074,370 0,988 0,959 0,960
               ChannelBenchmark.sendPull       8525,740 1,152 0,976 1,006
               ChannelBenchmark.sendPull        597,088 1,203 0,962 1,074
           ChannelBenchmark.sendPullPar8      37157,267 1,009 0,968 0,989
           ChannelBenchmark.sendPullPar8       7778,680 1,157 1,007 1,072
           ChannelBenchmark.sendPullPar8        516,252 1,290 1,029 1,066
   ChannelBenchmark.sendPullParUnlimited      31382,873 1,171 1,106 1,078
   ChannelBenchmark.sendPullParUnlimited       2434,107 1,162 1,071 1,054
   ChannelBenchmark.sendPullParUnlimited        101,704 1,314 1,153 1,199

@mpilquist
Copy link
Member

@nikiforo Think this is ready for merge? Anyone else you want a review from?

@nikiforo
Copy link
Contributor Author

Think this is ready for merge?

I think it is. I'm sure that it shouldn't make things worse. For some scenarios it even shows 30% performance increase.

Anyone else you want a review from?

Because I haven't changed the behavior, there is no strong requirement for @SystemFw's review. Yet, I would love to hear his thoughts about the PR.

@mpilquist mpilquist merged commit 2173855 into typelevel:main Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants