fixes performance degradation when fragmentation is used #995

koldat · 2021-03-03T20:45:17Z

When one define custom mtu to be used for fragment size it significantly degrades performance. Attached example code sends 1M records with size 5 bytes. Before fix it takes 39 seconds. After fix it takes 5 seconds (same time as with no custom fragmentation). We need to enable it, because Websocket max data size is 64kB and we support both transports.

See #994

OlegDokuka · 2021-03-03T20:50:15Z

Unfortunately, this fix breaks the idea of fragmentation and introduces head-of-the-line blocking since now basically all frames are fragmented and sent one by one meaning, that if one huge frame has to be sent, it will be blocking the others. I believe what you need is another fragmentation level which is given by WebSocket by default.

koldat · 2021-03-03T20:57:00Z

@OlegDokuka " that if one huge frame has to be sent, it will be blocking the others."

I am not sure I understand. Why it should be blocking others? original code is doing the same. There is only exception that concatMap has condition that when no fragmentation is not needed we do not need to send with "fragmented" branch.

OlegDokuka · 2021-03-03T20:57:38Z

I did. What you do is basically you have concatMap which sends all the frame in a single, delegate.send. Am I missing something?

koldat · 2021-03-03T20:58:25Z

delegate.send is blocking others? Then no multiple streams could even work on same channel.

OlegDokuka · 2021-03-03T20:59:31Z

am not sure I understand. Why it should be blocking others? original code is doing the same. There is only exception that concatMap has condition that when no fragmentation is not needed we do not need to send with "fragmented" branch.

cancatMap allows only a single Flux to produce frames, which means if we have a huuuuuge payload it will produce a loong Flux with small frames, so that payload will take a long time to complete and will put in a queue all the other frames, so they will be waiting their turn to be fragmented.

delegate.send is blocking others? Then no multiple streams could even work on same channel.

It sequentially drains frames, one-by-one

OlegDokuka · 2021-03-03T21:02:44Z

I mean, looking at your fix and the explanation, I believe that you need something different and at the netty level. Can you please look at my comments on #994?

koldat · 2021-03-03T21:10:01Z

What about flatMap then?

koldat · 2021-03-03T22:01:39Z

@OlegDokuka Why this line is not blocking the same way you say for my version? That concatMap also does one by one and if there is huge data frame then others waits in queue:

  @Override
  public Mono<Void> send(Publisher<ByteBuf> frames) {
    return Flux.from(frames).concatMap(this::sendOne).then();
  }

OlegDokuka · 2021-03-03T22:16:40Z

@koldat this code is weird, and it kind of the same and not. The main difference that every sub-flux generated within concatMap is sent to delegate.send. Under the hood, reactor-netty allocates the whole MonoSendMany for that purpose (and that is why you see so tremendous overhead because it creates a couple of queues and objects). MonoSendMany has a prefetch mechanic and technically it uses its own channel (if I'm not wrong), so it can write in the channel's queue some amount of data before being blocked. That said onComplete will be back fast enough compares to the case when all the data are being written into the same MonoSendMany (before it was MonoSendMany per Flux generated by concatMap -> after your changes it is single MonoSendMany for everything so whenever the channel says "I'm full" we have to wait and prefetch only after that). (apologies if the above reads hard, it is ~1 AM local time)

What about flatMap then?

Actually, it can be a good alternative and I guess we can land it as an improvement. So, your code is good, lets try to migrate from concatMap to flatMap and see if nothing is broken.

rsocket-core/src/main/java/io/rsocket/fragmentation/FragmentationDuplexConnection.java

koldat · 2021-03-03T22:50:19Z

Changed to flatmap. Performance is almost same (5 seconds vs 30 and more without fix). Tests pass (locally)

OlegDokuka

LGTM

OlegDokuka

from my private discussion with @rstoyanchev, we figured out that flatMap may break frame ordering so frames for the same stream can be potentially reordered which is something we don't wanna have. Actually, it turned out that the previous impl may do the same, so we need to iterate a little more to ensure we have good perf and do not break frame ordering

rstoyanchev · 2021-03-04T11:19:16Z

@koldat the head-of-line and performance issues with fragmentation are well known limitations in 1.0.x which required significant work that was done for 1.1, if you take a look at #761 and the issues linked to it. Given there are no easy solutions for this will likely remain as a limitation in 1.0.x and you'll need to upgrade to 1.1 to get the benefits of the rework. You mentioned Spring Boot 2.3 which has 3 months remaining as well so you'll need to upgrade to 2.4 which is based on RSocket Java 1.1.

As @OlegDokuka mentioned using flatMap is likely an issue and we can't roll it in with only 3 months of OSS support left, but if it works for you, feel free to apply it to your environment. Another idea, I don't know if reducing the fragmentation size to be less than 60K might give you a slightly more optimal performance for the remaining time before you upgrade.

OlegDokuka · 2021-03-04T13:05:02Z

@koldat as it turned out, you were right about behaviors, we kind of exploit a bug of reactor-netty which did shuffling frames, and the bug is fixed now, which says we have a head-of-line-blocking at the moment which is unfortunate.

We chatted with @rstoyanchev and figured out that we can use flatMap, however, we need to put groupBy operator beforehand to ensure that flatMap is not reordering frames for the same streamId.

I suggest doing the following:

delegate.send(
        Flux.from(frames)
            .groupBy(frame -> FrameHeaderCodec.streamId(frame), Integer.MAX_VALUE)
            .flatMap(groupedById -> groupedById
              .concatMap(
                frame -> {
                  FrameType frameType = FrameHeaderCodec.frameType(frame);
                  int readableBytes = frame.readableBytes();
                  if (!shouldFragment(frameType, readableBytes)) {
                    return Flux.just(frame);
                  }

                  return logFragments(Flux.from(fragmentFrame(alloc(), mtu, frame, frameType)));
              }), Integer.MAX_VALUE),
           Integer.MAX_VALUE);

can you please check if that solution is still good enough for you?

OlegDokuka · 2021-03-04T13:46:27Z

if the above will not work well for you, I guess we can stick to concatMap (instead of flatMap) and just state that we have a head-of-line blocking problem in 1.0.x though we support fragmentation (which is useless in that case)

koldat · 2021-03-04T13:57:24Z

I think groupBy is not a good idea as it keeps stream in groupBy operator forever (internally as Map).

I would go with concat as you have said it is same performance and least chance to make issue. Scaling this can be easily done by loadbalancing (more connections). But still I do not think it is an issue, because connection is just for the application it uses it so having fully utilized wire cannot be faster. So that having potential way to interleave the streams does not increase the final throughput.

@rstoyanchev yes 3 months sounds short, but some deployments does not go that fast. Especially in production. Yes we plan to more forward and do upgrade, but we also want to have current version stable and performance. Regarding the comment on changing the fragmentation. That is actually a problem. Setting any value cause this issue. It does not matter what value it is.

Should I switch back to concatMap or you do not want to include it in the release?

OlegDokuka · 2021-03-04T13:59:26Z

I think groupBy is not a good idea as it keeps stream in groupBy operator forever (internally as Map).

Technically, we can track the end of the stream, so it will not be a problem, and we can cancel the inner Flux when we see the terminal frame for that streamId. I will try to do that for 1.0.5 (if we end up doing 1.0.5)

Should I switch back to concatMap or you do not want to include it in the release?

@koldat yes, please

Signed-off-by: Tomas Kolda <[email protected]>

koldat · 2021-03-04T14:08:51Z

Change done

OlegDokuka · 2021-03-04T14:26:11Z

@koldat thanks for your contribution

koldat force-pushed the fragment_fix branch from 45677c7 to 5f894b2 Compare March 3, 2021 20:48

OlegDokuka requested review from rstoyanchev and OlegDokuka March 3, 2021 22:18

OlegDokuka added the enhancement label Mar 3, 2021

OlegDokuka added this to the 1.0.4 milestone Mar 3, 2021

OlegDokuka requested changes Mar 3, 2021

View reviewed changes

rsocket-core/src/main/java/io/rsocket/fragmentation/FragmentationDuplexConnection.java Outdated Show resolved Hide resolved

koldat force-pushed the fragment_fix branch from 5f894b2 to 28895ae Compare March 3, 2021 22:46

OlegDokuka approved these changes Mar 4, 2021

View reviewed changes

OlegDokuka self-requested a review March 4, 2021 09:59

OlegDokuka requested changes Mar 4, 2021

View reviewed changes

Fix performance degradation when fragmentation is used (rsocket#994)

bd7b674

Signed-off-by: Tomas Kolda <[email protected]>

koldat force-pushed the fragment_fix branch from 28895ae to bd7b674 Compare March 4, 2021 14:06

OlegDokuka changed the title ~~Fix performance degradation when fragmentation is used (#994)~~ fixes performance degradation when fragmentation is used Mar 4, 2021

OlegDokuka merged commit e4d62b6 into rsocket:1.0.x Mar 4, 2021

koldat deleted the fragment_fix branch March 4, 2021 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes performance degradation when fragmentation is used #995

fixes performance degradation when fragmentation is used #995

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021 •

edited

Loading

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021 •

edited

Loading

koldat commented Mar 3, 2021 •

edited

Loading

OlegDokuka left a comment

OlegDokuka left a comment

rstoyanchev commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 •

edited

Loading

OlegDokuka commented Mar 4, 2021 •

edited

Loading

koldat commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 •

edited

Loading

koldat commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 •

edited

Loading

fixes performance degradation when fragmentation is used #995

fixes performance degradation when fragmentation is used #995

Conversation

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021 • edited Loading

OlegDokuka commented Mar 3, 2021

koldat commented Mar 3, 2021

koldat commented Mar 3, 2021

OlegDokuka commented Mar 3, 2021 • edited Loading

koldat commented Mar 3, 2021 • edited Loading

OlegDokuka left a comment

Choose a reason for hiding this comment

OlegDokuka left a comment

Choose a reason for hiding this comment

rstoyanchev commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 • edited Loading

OlegDokuka commented Mar 4, 2021 • edited Loading

koldat commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 • edited Loading

koldat commented Mar 4, 2021

OlegDokuka commented Mar 4, 2021 • edited Loading

OlegDokuka commented Mar 3, 2021 •

edited

Loading

OlegDokuka commented Mar 3, 2021 •

edited

Loading

koldat commented Mar 3, 2021 •

edited

Loading

OlegDokuka commented Mar 4, 2021 •

edited

Loading

OlegDokuka commented Mar 4, 2021 •

edited

Loading

OlegDokuka commented Mar 4, 2021 •

edited

Loading

OlegDokuka commented Mar 4, 2021 •

edited

Loading