Updated Http4sWebSockets. #3393

kamilkloch · 2023-12-12T15:13:08Z

Optimizes Http4sWebSockets.pipeToBody twofold:

fast_path If concatenateFragmentedFrames, ignorePong, autoPongOnPing, autoPing flags in WebSocketBodyOutput are all false, lift Http4sWebSocketFrames into REQ, run through pipe, convert RESP back to Http4sWebSocketFrame, use fs2 Chunks for mapping output.
standard_path Otherwise, concurrently merge business logic response, autoPings, autoPongOnPing. Use fs2.Channel to perform the merge (more efficient than Stream#mergeHaltL / Stream#parJoin.

Performance results

before:

#[Mean    =       41.509, StdDeviation   =       57.180]
#[Max     =      748.000, Total count    =      6000000]

after standard_path :

#[Mean    =        1.106, StdDeviation   =        0.309]
#[Max     =        4.000, Total count    =      6000000]

after fast_path :

#[Mean    =        0.000, StdDeviation   =        0.013]
#[Max     =        4.000, Total count    =      6000000]

Histograms

tapir_before vs tapir_after:

tapir_after vs plain http4s+blaze:

CPU pressure

Number of async-profiler samples:

before: 2M
after standard_path: 734k
after fast_path: 384k

2M -> 400K means 5x reduction of CPU cycles.

before:

after standard_path:

after fast_path:

Benchmark repo:

https://github.com/kamilkloch/websocket-benchmark/tree/master/results

plokhotnyuk · 2023-12-15T14:06:49Z

For comparison here is a flame graph for http4s with blaze only (no tapir) under the same load:

@kciesielski @adamw Is there anything left that could defer merging?

adamw · 2023-12-15T15:33:12Z

@plokhotnyuk the PR was in draft mode, didn't notice it's finalised now. Will take a look on Monday probably :)

adamw · 2023-12-18T15:17:02Z

server/http4s-server/src/main/scala/sttp/tapir/server/http4s/Http4sWebSockets.scala

+                                   pipe: Pipe[F, REQ, RESP],
+                                   o: WebSocketBodyOutput[Pipe[F, REQ, RESP], REQ, RESP, _, Fs2Streams[F]]
+                                 ): F[Pipe[F, Http4sWebSocketFrame, Http4sWebSocketFrame]] = {
+    if ((!o.concatenateFragmentedFrames) && (!o.ignorePong) && (!o.autoPongOnPing) && o.autoPing.isEmpty) {


I'm not quite sure what's the real difference between the fast & slow tracks - if the flags above are set to appropriate values, apart from some additional initialisation code, the fs2 pipeline seems to end up the same in both cases?

Second question: why these particular flag combination has been chosen for the fast track? The default is to have auto pings (which I think is reasonable), auto pongs, ignore pongs

The difference between the fast and slow track is that a fast track merely pipes the input stream through the business logic, whereas "slow" track involves a heavier machinery of concurrently merging a) business logic output b) autoPing and c) autoPongOnPing streams. This also answers the second question - if we want to avoid concurrent merging of streams for performance reasons, a certain flag combination is enforced, unfortunately. That said, looking at the flags in the fast track:

concatenateFragmentedFrames is currently a no-op,

ignorePong and autoPongOnPing can be handled on the http4s level. (autoPongOnPing is a bit of an unfortunate flag as it conflates two distinct behaviors - auto pong AND filtering ping. We cosidered adding WebSocketBodyOutput.ignorePing but ditched the idea due to binary compatibility issues. )

admittedly, the one remaing is autoPing == false, which has to be handled somewhere else (server layer or busines layer)

Also, even the "slow" track is quite performant compared to the status quo.

Thanks for the reply - so in the "slow path" we are allocating a Channel to do the concurrent merging.

We are using the channel for two purposes: auto-pong and auto-ping. The other flags simply modify the input / output streams, leaving them unchanged if they are false. So maybe we could relax the condition for the fast path, so that it's taken when auto-pong=false & auto-ping=false? I think we should be able to do the concatenating / close-decoding transformations on the fast path as well?

Secondly, I think it would be great to add some documentation, explaining how to best use http4s websockets. As I understand, http4s has also a built-in auto-ping mechanism, which should preferrably be used - that should end up in the docs (a similar note is present in the akka docs, btw.). Similarly, information about the slow/fast paths, what is the performance difference would be great to have, so that people could make an informed choice.

~~Will revise, cannot promise before Christmas, though :)~~
Done with the code part.

Regrettably, http4s lacks the auto-ping mechanism and doing it on the Stream (tapir) level turns out to be a (unbelievably) costly operation. Here is why: merging a stream with an empty(!) stream reduces the speed by x800:

val n = 10_000 val m = 100 val s = Stream(1).covary[IO].repeatN(n) val s2 = s.mergeHaltL(Stream.empty) s.compile.last.replicateA_(m).timed.map(_._1.toMillis).flatMap(IO.println) >> s2.compile.last.replicateA_(m).timed.map(_._1.toMillis).flatMap(IO.println)

130 105394

Regrettably, http4s lacks the auto-ping mechanism and doing it on the Stream (tapir) level turns out to be a (unbelievably) costly operation. Here is why: merging a stream with an empty(!) stream reduces the speed by x800:

That looks ... really bad. To be honest I did not suspect that usage of fs2 might dominate I/O, but I guess that's what we're looking at here.

Unless we're using fs2 in a totally wrong way ...

…pdated stream merging logic.

kamilkloch · 2023-12-21T15:03:37Z

Updated tests (increased number of connections from 10k to 25k). (results for current tapir not included as it failed to finish the test for perf reasons)

Optimizes Http4sWebSockets.pipeToBody twofold:

fast_path If autoPongOnPing and autoPing flags in WebSocketBodyOutput are both false, lift Http4sWebSocketFrames into REQ, run through pipe, convert RESP back to Http4sWebSocketFrame, use fs2 Chunks for mapping output.
standard_path Otherwise, concurrently merge business logic response, autoPings, autoPongOnPing. Use fs2.Channel to perform the merge (more efficient than Stream#mergeHaltL / Stream#parJoin.

Performance results

http4s + blaze:

#[Mean    =        0.255, StdDeviation   =        0.436]
#[Max     =        2.000, Total count    =     15000000]

after standard_path :

#[Mean    =        0.384, StdDeviation   =        0.486]
#[Max     =        3.000, Total count    =     15000000]

after fast_path :

#[Mean    =       20.424, StdDeviation   =       40.654]
#[Max     =      732.000, Total count    =     15000000]

Histograms

tapir_after_fast_track vs plain http4s+blaze:

tapir_after_fast_track vs tapir_after_slow_track vs plain http4s+blaze:

CPU pressure

Number of async-profiler samples:

after standard_path: 2M
after fast_path: 937k
plain http4s + blaze: 874k

http4s + blaze

after fast_path:

after standard_path:

Benchmark repo:

https://github.com/kamilkloch/websocket-benchmark/tree/master/results

adamw · 2023-12-21T15:25:34Z

@kamilkloch

Done with the code part.

Great, thanks! :) I can merge & release this now, and we can do the documentation / comments (so that future generation - that is us in a month - will know why the if is there) in a separate PR, if you'd like?

kamilkloch · 2023-12-21T15:30:40Z

@kamilkloch

Done with the code part.

Great, thanks! :) I can merge & release this now, and we can do the documentation / comments (so that future generation - that is us in a month - will know why the if is there) in a separate PR, if you'd like?

I think it a good idea. By that time we might perhaps land an auto-ping in the http4s/blaze layer, and then we will be abel to add more comprehensive docs, or even change tapir defaults.

adamw · 2023-12-21T15:39:33Z

Ok, a release is on its way. Please don't forget about the docs ;) I'll be waiting ;)

kamilkloch · 2024-01-19T12:45:39Z

Ok, a release is on its way. Please don't forget about the docs ;) I'll be waiting ;)

I have not forgotten, waiting until we land a autoPing in http4s, ember and blaze first.

Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput.

87d2a5c

kamilkloch changed the title ~~Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput.~~ Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput. WIP Dec 12, 2023

Updated Http4sWebSockets, removed ignorePing from WebSocketBodyOutput.

f65ea65

kamilkloch changed the title ~~Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput. WIP~~ Updated Http4sWebSockets. WIP Dec 13, 2023

kamilkloch marked this pull request as ready for review December 13, 2023 16:00

kamilkloch changed the title ~~Updated Http4sWebSockets. WIP~~ Updated Http4sWebSockets. Dec 13, 2023

Kamil Kloch added 2 commits December 14, 2023 11:56

Cosmetics - reverted changes in whitespaces.

7a9d408

Cosmetics - replaced Chunk.apply with Chunk.singleton.

c069e88

adamw reviewed Dec 18, 2023

View reviewed changes

Relaxed the constraint to hit the fast track in Http4sWebSockets, u…

7c52d45

…pdated stream merging logic.

adamw merged commit bf2cd21 into softwaremill:master Dec 21, 2023
23 checks passed

kamilkloch mentioned this pull request Dec 26, 2023

Add autoPing for web sockets (ember server). http4s/http4s#7348

Open

kciesielski mentioned this pull request Feb 28, 2024

Vert.X: improve WebSocket performance #3539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Http4sWebSockets. #3393

Updated Http4sWebSockets. #3393

kamilkloch commented Dec 12, 2023 •

edited

Loading

plokhotnyuk commented Dec 15, 2023

adamw commented Dec 15, 2023

adamw Dec 18, 2023

adamw Dec 18, 2023

kamilkloch Dec 18, 2023

adamw Dec 20, 2023

kamilkloch Dec 20, 2023 •

edited

Loading

kamilkloch Dec 21, 2023

adamw Dec 21, 2023

kamilkloch commented Dec 21, 2023 •

edited

Loading

adamw commented Dec 21, 2023

kamilkloch commented Dec 21, 2023

adamw commented Dec 21, 2023

kamilkloch commented Jan 19, 2024

Updated Http4sWebSockets. #3393

Updated Http4sWebSockets. #3393

Conversation

kamilkloch commented Dec 12, 2023 • edited Loading

Performance results

Histograms

CPU pressure

Benchmark repo:

plokhotnyuk commented Dec 15, 2023

adamw commented Dec 15, 2023

adamw Dec 18, 2023

Choose a reason for hiding this comment

adamw Dec 18, 2023

Choose a reason for hiding this comment

kamilkloch Dec 18, 2023

Choose a reason for hiding this comment

adamw Dec 20, 2023

Choose a reason for hiding this comment

kamilkloch Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

kamilkloch Dec 21, 2023

Choose a reason for hiding this comment

adamw Dec 21, 2023

Choose a reason for hiding this comment

kamilkloch commented Dec 21, 2023 • edited Loading

Performance results

Histograms

CPU pressure

Benchmark repo:

adamw commented Dec 21, 2023

kamilkloch commented Dec 21, 2023

adamw commented Dec 21, 2023

kamilkloch commented Jan 19, 2024

kamilkloch commented Dec 12, 2023 •

edited

Loading

kamilkloch Dec 20, 2023 •

edited

Loading

kamilkloch commented Dec 21, 2023 •

edited

Loading