[QUERY] Azure.Messaging.EventHub PartitionReceiver/EventHubConsumerClient High CPU Usage? #21099

diverges · 2021-05-16T23:00:32Z

Context
When consuming from a dedicated event hub containing more than 200 partitions and using an instance of the SDK's PartitionReceiver with TransportType = EventHubsTransportType.AmqpWebSockets and TrackLastEnqueuedEventProperties = true for each partition. I've observed high CPU usage when a single box consumes from a large number of partitions and high event rate (9k - 70k events per second).

Query/Question
What's the best practice when consuming from many partitions and/or high event rate? I ran the same test targeting .Net 5 and it was not an issue, CPU at 20% while consuming ~50k events per second.

Environment:

Azure.Messaging.EventHub 5.3.0
.Net Framework 4.7.2/ .Net Framework 4.8

The text was updated successfully, but these errors were encountered:

diverges · 2021-05-17T01:33:49Z

Could this be related? Microsft.Azure.Amqp: lock contention is extremely high when the request rate is high

jsquire · 2021-05-17T14:33:52Z

It's not possible to make any definitive statements with the current context and available information, so the best that I can do is generalize and speculate a bit.

Generally speaking, it sounds like you've hit a point where you're doing too much on a single machine and should consider a different distribution of work. The best practice for consuming from many partitions at a high rate is to spread out partitions among different machines. Each partition requires a dedicated AMQP link to read from the service and how that link is managed differs between the clients:

The PartitionReceiver will open a dedicated AMQP connection and link for the partition that it is associated with in its default configuration. With many of these running concurrently, you're potentially seeing network contention due to having a high number of connections open to the same host. This is especially noticable using web sockets.
The EventHubConsumerClient represents a single AMQP connection and each partition that is being read will open an AMQP link that shares the connection, taking advantage of multiplexing. Having too many partitions read concurrently can saturate the connection and cause request queuing which triggers a higher degree of synchronization within the AMQP transport library.

It's difficult to say what the optimal number of partitions to a given machine is, as it will vary quite a bit by the size of the machine, the size of events, the work being done, the hosting environment, and other factors. My advice would be to try starting with 2-4 partitions per CPU thread and then measuring and experimenting to tune from there.

Some other potential things that you can consider:

The PartitionReceiver and EventHubConsumerClient types each have a constructor overload that takes an EventHubConnection; this allows you to explicitly control how many connections are used and share them among multiple receivers to take advantage of multiplexing. You may want to experiment with trying to find a balance for the number of connections that gives you the best performance.
Experimenting with the PrefetchCount in the PartitionReceiverOptions may help you tune the buffering to reduce contention.
Experimenting with the PrefetchCount and CacheEventCount in the ReadEventOptions used by the EventHubConsumerClient may help you tune the buffering to reduce contention.
Experimenting with Server GC may help lower CPU use or at least make things more consistent and predictable.
Experimenting with the DefaultConnectionLimit of the ServicePointManager may help reduce contention and throttling for the number of connections open.

Some wider speculation on causes:

If you're seeing behavior differ by the host framework, where .NET 5 and the desktop framework performance differ, then it's quite possible that you're seeing something related to the networking stack behavior, where a large amount of effort was spent to improve allocations and reduce performance. You'd see a non-trivial increase in garbage collection for the desktop framework which would drive up CPU use.
The majority of the work in a scenario like you're describing is going to be deserializing of data from the AMQP message format to the EventData model for consumption. This kind of workload is likely to be CPU-bound in general and will trigger a high degree of allocations, which would also then likely trigger more frequent garbage collections.
It is possible that you're seeing network contention with a large number of partitions, which would potentially be related to the lock contention within the AMQP transport library. That said, the issue has been open there for 3 years and the maintainers of that library have not been convinced this is the case. We'd need to have data to support that speculation and either contribute to that issue or open a new one to report for their consideration.

diverges · 2021-05-19T05:09:20Z

Thank you @jsquire, appreciate the detailed response which provided an excellent starting point for my investigation. Here's a couple follow up observation:

Reducing the PrefetchCount of the PartitionReceivers did help reduce CPU in the original scenario, but it's also attributed with a noticeable decrease in throughput. It's also a bit tricky thinking about the relationship between PrefetchCount and CPU usage, as one also needs to consider the partitions assigned to a box.
The biggest change occurred when comparing the two EventHubsTransportType; AmqpTcp and AmqpWebSocket. The second regardless of the assigned partition count (4, 16, or 256) would peak at an event rate of 20,000 and consume about 60% of CPU. While AmqpTcp maintains a steady 16-20% usage while consuming 50,000+ events per second (awesome!).

Using PerfView, I dug a bit deeper into the .Net5 and .NetFramework differences for the AmqpWebSocket results. It seems that the largest difference occurs in a call to microsoft.azure.amqp!Microsoft.Azure.Amqp.AsyncIO+AsyncReader.OnReadBufferComplete(class Microsoft.Azure.Amqp.Transport.TransportAsyncCallbackArgs). Net 5 would experience 1% inclusion time during this call, while .NetFramework is at around 25% inclusion time. I downloaded PerfView specifically for this experimentation and have maybe 4 hours of experience with it, so I'm still learning to interpret the data it provides.

.Net Framework 4.7.2

.Net5

The inclusion time for this method call wasn't so large when using AmqpTcp in .NetFramework where it hovered at around 10% in that scenario (still much higher than Net5). It's pretty hard to tell what's going on from this data alone and without an understanding of what the azure.amqp library is doing. For reference, I think this is the call that's showing up in with high inclusion time.

Could such a large gap be tied to the allocation and networking improvements of .Net 5 like you mentioned? We have some scenarios that read from a remote event hubs (compute in west coast, event hub in east coast) that benefit greatly from the AmqpWebSocket flag since AmqpTcp struggles to keep a high throughput there.

jsquire · 2021-05-19T13:48:55Z

Could such a large gap be tied to the allocation and networking improvements of .Net 5 like you mentioned?

It is very possible and, in this case quite likely. What you're observing is the same code running on two different host frameworks with different performance characteristics. We use very few compiler constant branches to sniff frameworks and those are intended only to work around compatibility issues.

There was a very high amount of effort put into reducing allocations in .NET 5 and a large focus on performance tuning the network components used by ASP.NET. Though it's a bit outdated by now, this blog post by Stephen Toub highlights some of the significant areas where were further improved since. I don't want to link to non-authoritative sources, but there are plenty of more recent articles around performance testing. The .NET team may have additional resources to share if you decide to reach out to them directly.

Using PerfView, I dug a bit deeper into the .Net5 and .NetFramework differences for the AmqpWebSocket results. It seems that the largest difference occurs in a call to microsoft.azure.amqp!Microsoft.Azure.Amqp.AsyncIO+AsyncReader.OnReadBufferComplete(class Microsoft.Azure.Amqp.Transport.TransportAsyncCallbackArgs)

The AMQP library was developed alongside the Azure Messaging services and was written in a time less allocation-focused. There are definitely code paths within that could be improved, but the networking primitives that it uses are provided by .NET itself. In the case of web sockets, the System.Net.WebSockets.WebSocket class, which I believe is where the observed path is hitting a hotspot. That said, the Event Hubs SDK is just a consumer of that package and my knowledge around some of the deeper details isn't good. You may want to consider opening an issue in the Microsoft.Azure.Amqp repository to discuss that further.

diverges · 2021-05-20T05:50:23Z

Thanks for the detailed explanation and references! I got help parsing the traces today and observed that a lot of the extra time spent by the .Net Framework version is in SslStream. Kind of cool to see how System.Net.WebSockets.WebSocket has improved in Net5.

jsquire self-assigned this May 17, 2021

jsquire added Client This issue points to a problem in the data-plane of the library. Event Hubs labels May 17, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label May 17, 2021

jsquire added the needs-author-feedback Workflow: More information is needed from author to address the issue. label May 17, 2021

ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels May 19, 2021

jsquire added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels May 19, 2021

diverges closed this as completed May 20, 2021

github-actions bot locked and limited conversation to collaborators Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUERY] Azure.Messaging.EventHub PartitionReceiver/EventHubConsumerClient High CPU Usage? #21099

[QUERY] Azure.Messaging.EventHub PartitionReceiver/EventHubConsumerClient High CPU Usage? #21099

diverges commented May 16, 2021 •

edited

Loading

diverges commented May 17, 2021

jsquire commented May 17, 2021

diverges commented May 19, 2021 •

edited

Loading

jsquire commented May 19, 2021

diverges commented May 20, 2021

[QUERY] Azure.Messaging.EventHub PartitionReceiver/EventHubConsumerClient High CPU Usage? #21099

[QUERY] Azure.Messaging.EventHub PartitionReceiver/EventHubConsumerClient High CPU Usage? #21099

Comments

diverges commented May 16, 2021 • edited Loading

diverges commented May 17, 2021

jsquire commented May 17, 2021

diverges commented May 19, 2021 • edited Loading

jsquire commented May 19, 2021

diverges commented May 20, 2021

diverges commented May 16, 2021 •

edited

Loading

diverges commented May 19, 2021 •

edited

Loading