Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Http4sWebSockets. #3393

Merged
merged 5 commits into from
Dec 21, 2023
Merged

Conversation

kamilkloch
Copy link
Contributor

@kamilkloch kamilkloch commented Dec 12, 2023

Fixes #3397.

Optimizes Http4sWebSockets.pipeToBody twofold:

  1. fast_path If concatenateFragmentedFrames, ignorePong, autoPongOnPing, autoPing flags in WebSocketBodyOutput are all false, lift Http4sWebSocketFrames into REQ, run through pipe, convert RESP back to Http4sWebSocketFrame, use fs2 Chunks for mapping output.
  2. standard_path Otherwise, concurrently merge business logic response, autoPings, autoPongOnPing. Use fs2.Channel to perform the merge (more efficient than Stream#mergeHaltL / Stream#parJoin.

Performance results

before:

#[Mean    =       41.509, StdDeviation   =       57.180]
#[Max     =      748.000, Total count    =      6000000]

after standard_path :

#[Mean    =        1.106, StdDeviation   =        0.309]
#[Max     =        4.000, Total count    =      6000000]

after fast_path :

#[Mean    =        0.000, StdDeviation   =        0.013]
#[Max     =        4.000, Total count    =      6000000]

Histograms

tapir_before vs tapir_after:
Screenshot from 2023-12-13 16-01-56

tapir_after vs plain http4s+blaze:
Screenshot from 2023-12-14 11-04-14

CPU pressure

Number of async-profiler samples:

  • before: 2M
  • after standard_path: 734k
  • after fast_path: 384k

2M -> 400K means 5x reduction of CPU cycles.

before:
Screenshot from 2023-12-13 15-53-26

after standard_path:
Screenshot from 2023-12-13 15-53-08

after fast_path:
Screenshot from 2023-12-13 15-52-47

Benchmark repo:

https://github.com/kamilkloch/websocket-benchmark/tree/master/results

@kamilkloch kamilkloch changed the title Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput. Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput. WIP Dec 12, 2023
@kamilkloch kamilkloch changed the title Updated Http4sWebSockets, added ignorePing to WebSocketBodyOutput. WIP Updated Http4sWebSockets. WIP Dec 13, 2023
@kamilkloch kamilkloch marked this pull request as ready for review December 13, 2023 16:00
@kamilkloch kamilkloch changed the title Updated Http4sWebSockets. WIP Updated Http4sWebSockets. Dec 13, 2023
@plokhotnyuk
Copy link
Contributor

For comparison here is a flame graph for http4s with blaze only (no tapir) under the same load:

image

@kciesielski @adamw Is there anything left that could defer merging?

@adamw
Copy link
Member

adamw commented Dec 15, 2023

@plokhotnyuk the PR was in draft mode, didn't notice it's finalised now. Will take a look on Monday probably :)

pipe: Pipe[F, REQ, RESP],
o: WebSocketBodyOutput[Pipe[F, REQ, RESP], REQ, RESP, _, Fs2Streams[F]]
): F[Pipe[F, Http4sWebSocketFrame, Http4sWebSocketFrame]] = {
if ((!o.concatenateFragmentedFrames) && (!o.ignorePong) && (!o.autoPongOnPing) && o.autoPing.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what's the real difference between the fast & slow tracks - if the flags above are set to appropriate values, apart from some additional initialisation code, the fs2 pipeline seems to end up the same in both cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second question: why these particular flag combination has been chosen for the fast track? The default is to have auto pings (which I think is reasonable), auto pongs, ignore pongs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between the fast and slow track is that a fast track merely pipes the input stream through the business logic, whereas "slow" track involves a heavier machinery of concurrently merging a) business logic output b) autoPing and c) autoPongOnPing streams. This also answers the second question - if we want to avoid concurrent merging of streams for performance reasons, a certain flag combination is enforced, unfortunately. That said, looking at the flags in the fast track:

  • concatenateFragmentedFrames is currently a no-op,
  • ignorePong and autoPongOnPing can be handled on the http4s level. (autoPongOnPing is a bit of an unfortunate flag as it conflates two distinct behaviors - auto pong AND filtering ping. We cosidered adding WebSocketBodyOutput.ignorePing but ditched the idea due to binary compatibility issues. )
  • admittedly, the one remaing is autoPing == false, which has to be handled somewhere else (server layer or busines layer)

Also, even the "slow" track is quite performant compared to the status quo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply - so in the "slow path" we are allocating a Channel to do the concurrent merging.

We are using the channel for two purposes: auto-pong and auto-ping. The other flags simply modify the input / output streams, leaving them unchanged if they are false. So maybe we could relax the condition for the fast path, so that it's taken when auto-pong=false & auto-ping=false? I think we should be able to do the concatenating / close-decoding transformations on the fast path as well?

Secondly, I think it would be great to add some documentation, explaining how to best use http4s websockets. As I understand, http4s has also a built-in auto-ping mechanism, which should preferrably be used - that should end up in the docs (a similar note is present in the akka docs, btw.). Similarly, information about the slow/fast paths, what is the performance difference would be great to have, so that people could make an informed choice.

Copy link
Contributor Author

@kamilkloch kamilkloch Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revise, cannot promise before Christmas, though :)
Done with the code part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrettably, http4s lacks the auto-ping mechanism and doing it on the Stream (tapir) level turns out to be a (unbelievably) costly operation. Here is why: merging a stream with an empty(!) stream reduces the speed by x800:

val n = 10_000
val m = 100
val s = Stream(1).covary[IO].repeatN(n)
val s2 = s.mergeHaltL(Stream.empty)

s.compile.last.replicateA_(m).timed.map(_._1.toMillis).flatMap(IO.println) >>
s2.compile.last.replicateA_(m).timed.map(_._1.toMillis).flatMap(IO.println)
130
105394

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regrettably, http4s lacks the auto-ping mechanism and doing it on the Stream (tapir) level turns out to be a (unbelievably) costly operation. Here is why: merging a stream with an empty(!) stream reduces the speed by x800:

That looks ... really bad. To be honest I did not suspect that usage of fs2 might dominate I/O, but I guess that's what we're looking at here.

Unless we're using fs2 in a totally wrong way ...

@kamilkloch
Copy link
Contributor Author

kamilkloch commented Dec 21, 2023

Updated tests (increased number of connections from 10k to 25k). (results for current tapir not included as it failed to finish the test for perf reasons)

Optimizes Http4sWebSockets.pipeToBody twofold:

  1. fast_path If autoPongOnPing and autoPing flags in WebSocketBodyOutput are both false, lift Http4sWebSocketFrames into REQ, run through pipe, convert RESP back to Http4sWebSocketFrame, use fs2 Chunks for mapping output.
  2. standard_path Otherwise, concurrently merge business logic response, autoPings, autoPongOnPing. Use fs2.Channel to perform the merge (more efficient than Stream#mergeHaltL / Stream#parJoin.

Performance results

http4s + blaze:

#[Mean    =        0.255, StdDeviation   =        0.436]
#[Max     =        2.000, Total count    =     15000000]

after standard_path :

#[Mean    =        0.384, StdDeviation   =        0.486]
#[Max     =        3.000, Total count    =     15000000]

after fast_path :

#[Mean    =       20.424, StdDeviation   =       40.654]
#[Max     =      732.000, Total count    =     15000000]

Histograms

tapir_after_fast_track vs plain http4s+blaze:
Screenshot from 2023-12-21 12-43-02

tapir_after_fast_track vs tapir_after_slow_track vs plain http4s+blaze:
Screenshot from 2023-12-21 12-42-37

CPU pressure

Number of async-profiler samples:

  • after standard_path: 2M
  • after fast_path: 937k
  • plain http4s + blaze: 874k

http4s + blaze
Screenshot from 2023-12-21 15-58-17

after fast_path:
Screenshot from 2023-12-21 15-58-30

after standard_path:
Screenshot from 2023-12-21 15-58-39

Benchmark repo:

https://github.com/kamilkloch/websocket-benchmark/tree/master/results

@adamw
Copy link
Member

adamw commented Dec 21, 2023

@kamilkloch

Done with the code part.

Great, thanks! :) I can merge & release this now, and we can do the documentation / comments (so that future generation - that is us in a month - will know why the if is there) in a separate PR, if you'd like?

@kamilkloch
Copy link
Contributor Author

@kamilkloch

Done with the code part.

Great, thanks! :) I can merge & release this now, and we can do the documentation / comments (so that future generation - that is us in a month - will know why the if is there) in a separate PR, if you'd like?

I think it a good idea. By that time we might perhaps land an auto-ping in the http4s/blaze layer, and then we will be abel to add more comprehensive docs, or even change tapir defaults.

@adamw adamw merged commit bf2cd21 into softwaremill:master Dec 21, 2023
23 checks passed
@adamw
Copy link
Member

adamw commented Dec 21, 2023

Ok, a release is on its way. Please don't forget about the docs ;) I'll be waiting ;)

@kamilkloch
Copy link
Contributor Author

Ok, a release is on its way. Please don't forget about the docs ;) I'll be waiting ;)

I have not forgotten, waiting until we land a autoPing in http4s, ember and blaze first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve server web socket performance.
3 participants