-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reactor.core.Exceptions$OverflowException: The receiver is overrun by more signals than expected (bounded queue...) #959
Comments
Looks like it is possible with fusion enabled. I will try to arrange something similar with the |
It is happening for us on more places. So far did not found any workaround. How to disable fusion? |
@OlegDokuka why is in handleRequestStream/handleChannel the assumption that the receiver queue has bounded size 1? I would understand it in handleRequestResponse, but not in Flux variants. io.rsocket.core.RSocketRequester.handleRequestStream(Payload)
Handle frame function can consume network frames as fast as it can so it can happen that processor did not yet dispatched previous frame downstream. When I used unbounded queue it started working without exception. |
There is no way to do so on your end.
please find a full example here -> https://github.com/rsocket/rsocket-java/blob/master/rsocket-examples/src/main/java/io/rsocket/examples/transport/tcp/plugins/LimitRateInterceptorExample.java The
there is a strict reactive streams relationship. If you requested N elements, you clearly stated that you are ready to receive N elements. Following reactive streams, the producer follows your demand and produce exactly N elements (or less). That said, there is no reason for us (rsocket core logic) to cache N elements anywhere before your events handler. That is why to avoid any redundant allocations we use a queue of a single element since we expect no values will be enqueued at all The mentioned issue is the result of Reactor optimizations that we have not taken into account. Which fuse with the unicastProcessor queue and does not take into account the queue capacity |
@OlegDokuka I am not sure I described that properly. Ofcourse I know that N requested elements are ready to be consumed. But the problem is that there is also asynchronous effect. So downstream can consume, but it can take a time. Our application uses usual backpressure (512 elements) and normal operators. As I said issue is fixed when queue is unbounded, because in rare case it hits the condition where in queue there is one element still not dispatched (randomly after 10k - 1M items received). But it also sometimes happen with much shorter streams. I tries to remove fusion everywhere. I will try also your suggestion, but again: |
here is the point. Your The problem is that Reactor is a smart framework and tries to do lots of optimizations. then Reactor will optimize your workflow and make it as the following PublishOn[UnicastProcessor[Queue]], which means that 1 less queue will be allocated. But here comes the problem. We use UnicastProcessor just because it is a convenient API that lets us send elements manually. On the other hand, we don't need to store any elements, because we know the strict relationship between publisher and subscriber, which says that if subscriber say request(5) it is ready to consume(5) if later on, it says request(10), but previous 5 was not yet delivered, it means it is ready to receive 15 (5 + 10) (and again we don't have to store anything). The problems come from the fact that UnicastProcessor fuses with any operator which creates its queue. This means now if we create queue(1) and publishOn wanted queue(512), the UnicastProcessor will have no clue about that requirement. hence we see overflow exceptions because of requirements mismatch. As I said, to avoid this issue for now, just place the Also, feel free to migrate to 1.1.0 which provides custom implementations of request operators and does not use UnicastProcessor anymore |
On the other hand, we have @lkolisko appologies for the confusion. Let me reproduce that issue locally and then I can provide you with more updates. For now please, use either 1.0.2 or 1.1.0 |
I am working with Lukas on this issue. We use spring-boot unfortunately and I am not sure we can upgrade to 1.1.0 and it brings some incompatibilities on onterfaces. On the other hand we tried 1.0.2 and it did not help. We placed ".hide" almost everywhere without any effect as we cannot go under "subscribedOn" in handleRequestStream. The only fix that helped is to use unbounded queue in handleRequestStream to cover that glitch. In worst case there will be 2 elements in memory, but still better to not crash the stream. |
The change (unbounded vs. one()) was introduced in 18050ed after 1.0.2. Hmm... We will try again if it is really happening even for 1.0.2. |
You are right, we have just verified that. The issue was that we got same exception text (overrun by more signals than expected) from server side as well thus were confused. So it looked the same, but was not. It means similar problem is also on the other side (RSocketResponder) . When we changed server and client issue was solved. Thanks a lot for looking into this. |
@koldat one more question. Do you have any custom publishers on the producer side? I did a couple of different test cases, and I can not reproduce this issue. Also, we have an extensive number of tests inside RSocket which means if we have a bug, it would eventually appear on the CI but it's not. Thus, I suspect that something is on your end. Thus to make sure it is our bug, I would appreciate if you share more details on what kind of operators do you use: what is the producer, what is the consumer; what Reactor operators do you use (if not reactor, what libraries do you use in combination with rsocket) |
We use Spring boot 2.3.5. Flux we return is flatmap:
where inner flux is using and fromIterable:
|
@koldat Anything special on the consumer side, e.g. custom subscriber, etc? |
It is failing in two different cases:
api call is built like this for stream:
api call is built like this for channel:
|
Yup. Looks like it is not related to the business logic at all |
@OlegDokuka any update on this issue? Using 1.0.2 version we have not met any single issue for last month we were facing almost immediately with 1.0.3 (queue with bounded size == 1). I would still like to use official latest version provided by spring that has also other fixes. Thanks you very much for your support so far. |
@koldat can we have a call so I can look at your application setup / how you write your dataflow, etc? Unfortunately, I'm was not able to reproduce the mentioned issue at any of the setups you mentioned. So... basically, I'm stacked with the progress on this. Cheers, |
@OlegDokuka We are still trying to collect more information to help you to resolve this. So far it looks to me like an issue that is happening to us also with WebFlux. Basically using limitRate (I think you are author) together with schedulers leads to some multithread issue I think. You can see more information on reactor page including simply unit test. When we removed limitRate is at least started to work properly with WebFlux. I will retry the same for rsocket (I just found that). |
Is it the same set of operators you are using with rsocket Java? |
@koldat Is it the same set of operators you are using with rsocket Java? |
Unfortunately not exactly. But we are using similar combination like this: @Override
public Flux<MetadataQueryMatch> query(Mono<MetadataQueryReactiveRequest> request, RoleBasedSession session) {
request = request.publishOn(readScheduler);
// Some standard operators here
return flux.subscribeOn(readScheduler);
} What can happen is that server side does not follow rule, because it emits some value twice. So in very rare cases client throws an exception as we see. Version 1.0.2 maybe just hide the reactor defect (double emit of the same item). Just brainstorming, have no evidence.... But if it can happen in the unit test why not in any other operator combination together with subscribeOn/publishOn. I have tried elastic and also executorService schedulers. No difference. |
can be a bug in the reactor. I will check that |
@koldat Also, I would appreciate it if you can enable rsocket frame logging and your flux logging and share these logs with me. That can helo with debugging as well |
@OlegDokuka no latest reactor did not help, it is another issue. Problem is still in this queue: final UnicastProcessor<Payload> receiver = UnicastProcessor.create(Queues.<Payload>one().get()); Issue is that there are two threads. One is requesting and one is dispatching to queue. As request from downstream is higher than one then there is more than one inflight message coming to receiver. Now imagine you are dispatching to queue(io.rsocket.core.RSocketRequester.handleFrame(int, FrameType, ByteBuf)), but at the same time there is a "request" call from downstream side. It locks WIP so that onNext will NOT drain. Then it dispatch immediately second message from wire which as a result end up with two elements in the queue -> overflow. |
Indeed. I see what you mean. Good catchup 👍 |
Do you want me to do PR for this or you will handle that? I would simply revert to default create (unbounded). It does not matter much if for tiny a bit of a time there will be couple of elements. |
@koldat thank you for offering! Let's keep what we have right and do that differently. Having an unbounded queue behind UnboundedProcessor brings other DoS-related problems (for more info see -> #887). Thus, I don't wanna circle back to that problem in 1.0.x again and would rather relax However, if you see what I mean by "relaxed impl" which has to be Queue-free and WIP-free, feel free to send a PR before me. (I'm planning to work on that this weekends) Thanks, |
#887 makes sense. What I would personally do is to add requestN assertion which is to check if queue is drained and filled with respect to downstream requests. If client does not follow the rules it will throw an exception. It can be maybe even nice operator in reactor. The point is that even if you fix this place the same type of attack can be applied to downstream so that every downstream service would maybe need to have same protection. So having that on entry would be the best I think. I am not sure what is meant by relaxed impl, but if it is easier then I would leave it up to you as an expert in this area. |
Right. The more challenging part is to have Processor API which does that + Single Subscriber Assertion + RequestN assertion though it is apart of rsocket impl. The point here is to avoid any queue on our end. Thus the first trial was implementing UnicastProcessor with Queues.one() but it seems that we have to come back to a custom UnicastProcessor impl again... Although... there is a simpler way. What if we
Having that, we can keep @koldat, Do you wish to make a PR for that and adding a test ensuring such racing is not reproducing anymore? |
There are many such issue in our production environment. dependency: spring-boot:2.3.12.RELEASE, rsocket:1.0.5, reactor-core:3.3.17 public Flux<Void> voiceChat(RSocketRequester requester, @Header String callId, @Payload final Flux<byte[]> payloads) {
payloads.subscribeOn(callRxScheduler, false)
.onBackpressureBuffer(100, rxOnBufferOverflow, BufferOverflowStrategy.DROP_OLDEST)
.subscribe(consumer, errorConsumer, completeConsumer);
}
reactor.core.Exceptions$OverflowException: The receiver is overrun by more signals than expected (bounded queue...)
at reactor.core.Exceptions.failWithOverflow(Exceptions.java:221)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Assembly trace from producer [reactor.core.publisher.FluxDoFinallyFuseable] :
reactor.core.publisher.Flux.doFinally
io.rsocket.core.RSocketResponder.handleChannel(RSocketResponder.java:563)
Error has been observed at the following site(s):
|_ Flux.doFinally ⇢ at io.rsocket.core.RSocketResponder.handleChannel(RSocketResponder.java:563)
|_ Flux.doOnDiscard ⇢ at io.rsocket.core.RSocketResponder.handleChannel(RSocketResponder.java:576)
|_ Flux.from ⇢ at org.springframework.messaging.rsocket.annotation.support.MessagingRSocket.requestChannel(MessagingRSocket.java:131)
|_ Flux.map ⇢ at org.springframework.messaging.rsocket.annotation.support.MessagingRSocket.handleAndReply(MessagingRSocket.java:169)
|_ Flux.doOnSubscribe ⇢ at org.springframework.messaging.rsocket.annotation.support.MessagingRSocket.handleAndReply(MessagingRSocket.java:169)
|_ Flux.from ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.extractContent(PayloadMethodArgumentResolver.java:178)
|_ Flux.map ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.extractContent(PayloadMethodArgumentResolver.java:178)
|_ Flux.map ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.decodeContent(PayloadMethodArgumentResolver.java:235)
|_ Flux.error ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.lambda$decodeContent$2(PayloadMethodArgumentResolver.java:236)
|_ Flux.onErrorResume ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.decodeContent(PayloadMethodArgumentResolver.java:236)
|_ Flux.switchIfEmpty ⇢ at org.springframework.messaging.handler.annotation.reactive.PayloadMethodArgumentResolver.decodeContent(PayloadMethodArgumentResolver.java:238)
|_ Flux.subscribeOn ⇢ at com.xxx.dm.controller.VoiceCallController.voiceChat(VoiceCallController.java:371)
|_ Flux.onBackpressureBuffer ⇢ at com.xxx.dm.controller.VoiceCallController.voiceChat(VoiceCallController.java:372)
Stack trace:
at reactor.core.Exceptions.failWithOverflow(Exceptions.java:221)
at reactor.core.publisher.UnicastProcessor.onNext(UnicastProcessor.java:394)
at io.rsocket.core.RSocketResponder.handleFrame(RSocketResponder.java:336)
at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:160)
at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onNext(MonoFlatMapMany.java:242)
at reactor.core.publisher.FluxGroupBy$UnicastGroupedFlux.drainRegular(FluxGroupBy.java:560)
at reactor.core.publisher.FluxGroupBy$UnicastGroupedFlux.drain(FluxGroupBy.java:645)
at reactor.core.publisher.FluxGroupBy$UnicastGroupedFlux.onNext(FluxGroupBy.java:685)
at reactor.core.publisher.FluxGroupBy$GroupByMain.onNext(FluxGroupBy.java:205)
at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:178)
at reactor.core.publisher.FluxMapFuseable$MapFuseableConditionalSubscriber.onNext(FluxMapFuseable.java:287)
at reactor.core.publisher.FluxMapFuseable$MapFuseableConditionalSubscriber.onNext(FluxMapFuseable.java:287)
at reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:265)
at reactor.netty.channel.FluxReceive.onInboundNext(FluxReceive.java:371)
at reactor.netty.channel.ChannelOperations.onInboundNext(ChannelOperations.java:358)
at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:96)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748) |
It does not work for me, rsocket:1.0.5 dependency and context: #959 (comment) |
Using RSocket REQUEST_STREAM to implement a query where results are streamed back to the client. The client sends REQUEST_N frame 512. It is successfully receiving results. Once the client successfully receives >256 results, it requests an additional 256. The communication works fine. FrameLogger shows expected receiving and sending frames. After some time the following exception is thrown on the event of additional REQUEST_N on client;
This happens as part of a large application. Threfore I do not have a standalone replication example at the moment.
The text was updated successfully, but these errors were encountered: