Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Hubs: Performance degradation between 5.6.0 and 5.7.0 #20841

Closed
Tracked by #18819
conniey opened this issue Apr 20, 2021 · 10 comments
Closed
Tracked by #18819

Event Hubs: Performance degradation between 5.6.0 and 5.7.0 #20841

conniey opened this issue Apr 20, 2021 · 10 comments
Assignees
Labels
amqp Label for tracking issues related to AMQP customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-performance The issue is related to performance, one of our core engineering pillars. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing)

Comments

@conniey
Copy link
Member

conniey commented Apr 20, 2021

From #19698:

I replaced in my project
azure-messaging-eventhubs: 5.6.0 and azure-messaging-eventhubs-checkpointstore-blob: 1.5.0
with
azure-messaging-eventhubs: 5.7.0 and azure-messaging-eventhubs-checkpointstore-blob: 1.6.0
and unfortunately I see a performance decrease by max 50%.

I also tested
azure-messaging-eventhubs: 5.6.0 and azure-messaging-eventhubs-checkpointstore-blob: 1.6.0
as well as
azure-messaging-eventhubs: 5.7.0 and azure-messaging-eventhubs-checkpointstore-blob: 1.5.0
with the same decreasing performance.

I didn't dive into this yet in detail.
Would be nice if you can check this behavior on your site.

Related #20791

@conniey conniey added Event Hubs pillar-performance The issue is related to performance, one of our core engineering pillars. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) amqp Label for tracking issues related to AMQP customer-reported Issues that are reported by GitHub users external to the Azure organization. labels Apr 20, 2021
@the-mod
Copy link

the-mod commented Apr 20, 2021

@conniey: sorry for the late reply and thanks for creating a new Issue.

I ran my Testapp with 3 Replicas on AKS 1.19.9 with 3 Nodes of Standard_D8as_v4.
Each JVM (zulu-openjdk 16) had 3GB Heap Space assigned.
The Code was compiled for Java13 and is using Spring Boot 2.3.2.RELEASE.
I send 10 mill Messages to an eventhub for each run (lets name it input eventhub) where my App is reading from.
My App is writing the Results roundrobin to two Eventhubs (output eventhubs) after processing.

  • dark blue line: incoming messages of the input eventhub.
  • turquoise line: outgoing messages of the input eventhub. This are the messages read by my app
  • red/blue lines: incoming messages of the output eventhubs. This are the messages my app is writing.
  • The Resolution of the Chart is 1 Minute

Here are my results:
benchmarking

We can see that the SDK combinations mentioned above seams to face a limit (1mill/minute) of reading/receiving messages.
Let me know if I can provide more Informations.
Regards

@mikeharder
Copy link
Member

@the-mod: We believe this performance regression should be fixed in [email protected] and [email protected]. Could you please try upgrading to these versions and let us know if it has fixed your issue?

@the-mod
Copy link

the-mod commented Aug 27, 2021

Hi @mikeharder thanks for pinging me.

I tried the Version you mentioned and I saw reactor.core.Exceptions$OverflowException Exceptions flying:

reactor.core.Exceptions$OverflowException: at reactor.core.Exceptions.failWithOverflow (Exceptions.java:220) at reactor.core.publisher.FluxWindowTimeout$WindowTimeoutSubscriber.onNext (FluxWindowTimeout.java:241) at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext (FluxPeek.java:200) at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext (FluxDoFinally.java:130) at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext (FluxPeek.java:200) at reactor.core.publisher.FluxMap$MapSubscriber.onNext (FluxMap.java:120) at com.azure.messaging.eventhubs.implementation.AmqpReceiveLinkProcessor.drainQueue (AmqpReceiveLinkProcessor.java:486) at com.azure.messaging.eventhubs.implementation.AmqpReceiveLinkProcessor.drain (AmqpReceiveLinkProcessor.java:447) at com.azure.messaging.eventhubs.implementation.AmqpReceiveLinkProcessor.lambda$onNext$8 (AmqpReceiveLinkProcessor.java:261) at reactor.core.publisher.LambdaSubscriber.onNext (LambdaSubscriber.java:160) at reactor.core.publisher.FluxOnBackpressureBufferStrategy$BackpressureBufferDropOldestSubscriber.innerDrain (FluxOnBackpressureBufferStrategy.java:270) at reactor.core.publisher.FluxOnBackpressureBufferStrategy$BackpressureBufferDropOldestSubscriber.drain (FluxOnBackpressureBufferStrategy.java:234) at reactor.core.publisher.FluxOnBackpressureBufferStrategy$BackpressureBufferDropOldestSubscriber.onNext (FluxOnBackpressureBufferStrategy.java:199) at reactor.core.publisher.FluxFlatMap$FlatMapMain.tryEmit (FluxFlatMap.java:543) at reactor.core.publisher.FluxFlatMap$FlatMapInner.onNext (FluxFlatMap.java:984) at reactor.core.publisher.MonoCreate$DefaultMonoSink.success (MonoCreate.java:160) at com.azure.core.amqp.implementation.ReactorReceiver.lambda$new$0 (ReactorReceiver.java:78) at com.azure.core.amqp.implementation.handler.DispatchHandler.onTimerTask (DispatchHandler.java:34) at com.azure.core.amqp.implementation.ReactorDispatcher$WorkScheduler.run (ReactorDispatcher.java:184) at org.apache.qpid.proton.reactor.impl.SelectableImpl.readable (SelectableImpl.java:118) at org.apache.qpid.proton.reactor.impl.IOHandler.handleQuiesced (IOHandler.java:61) at org.apache.qpid.proton.reactor.impl.IOHandler.onUnhandled (IOHandler.java:390) at com.azure.core.amqp.implementation.handler.CustomIOHandler.onUnhandled (CustomIOHandler.java:41) at org.apache.qpid.proton.engine.BaseHandler.onReactorQuiesced (BaseHandler.java:87) at org.apache.qpid.proton.engine.BaseHandler.handle (BaseHandler.java:206) at org.apache.qpid.proton.engine.impl.EventImpl.dispatch (EventImpl.java:108) at org.apache.qpid.proton.engine.impl.EventImpl.delegate (EventImpl.java:129) at org.apache.qpid.proton.engine.impl.EventImpl.dispatch (EventImpl.java:114) at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch (ReactorImpl.java:324) at org.apache.qpid.proton.reactor.impl.ReactorImpl.process (ReactorImpl.java:291) at com.azure.core.amqp.implementation.ReactorExecutor.run (ReactorExecutor.java:86) at reactor.core.scheduler.SchedulerTask.call (SchedulerTask.java:68) at reactor.core.scheduler.SchedulerTask.call (SchedulerTask.java:28) at java.util.concurrent.FutureTask.run (FutureTask.java:264) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:304) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:630) at java.lang.Thread.run (Thread.java:831)

@mikeharder
Copy link
Member

The OverflowException in [email protected] is a known issue and we are working on a fix.

As a workaround, I believe the EventProcessorClientBuilder.processEvent() and EventProcessorClientBuilder.processEventBatch() APIs which do not accept a maxWaitTime parameter should avoid the OverflowException, if these would work in your application.

@the-mod
Copy link

the-mod commented Sep 8, 2021

@mikeharder

I tried the processEventBatch() without the maxWaitTime Parameter as suggested.
It run stable at a high level of Performance.

I only got some Errors in the log without any Stacktrace:

com.azure.core.amqp.implementation.ReactorDispatcher - ReactorDispatcher instance is closed.

@mikeharder
Copy link
Member

mikeharder commented Sep 8, 2021

@the-mod: I see existing issues with error ReactorDispatcher instance is closed, but they are all closed: #19698, #19753

@conniey, @anuchandy: Do you know more about this issue and whether it has been fixed?

@conniey
Copy link
Member Author

conniey commented Sep 10, 2021

This may be a false warning (we need to suppress) because any time work is scheduled on a closed reactor, we emit that message. It's normal for Reactor instances to be closed when we are recreating connections, etc.

Does this message impact your application? Do you see it not recovering?

@the-mod
Copy link

the-mod commented Sep 13, 2021

@conniey The App was running fine. Only saw this popping up in the logs.

@conniey
Copy link
Member Author

conniey commented Sep 13, 2021

Thanks for confirming! @anuchandy and I were talking about how useful this log is because it seems to add to noise rather than root cause.

@conniey
Copy link
Member Author

conniey commented Sep 21, 2021

It sounds like we were able to solve the issue. We are still looking at resolving the overflow exception via #23950. Please feel free to open another issue if it crops up again. Thanks!

@conniey conniey closed this as completed Sep 21, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
amqp Label for tracking issues related to AMQP customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-performance The issue is related to performance, one of our core engineering pillars. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing)
Projects
None yet
Development

No branches or pull requests

4 participants