-
Notifications
You must be signed in to change notification settings - Fork 595
Strange memory allocation after RabbitMQ-disconnect #1213
Replies: 3 comments · 27 replies
-
Does this look like the contents of any of the messages you are sending via RabbitMQ?
|
Beta Was this translation helpful? Give feedback.
All reactions
-
So here I am with more logs and they brought me to a new working hypothesis on what is going on (and figured out some things I probably could have noticed earlier). From my wireshark I don't see any AMQP protocol itself, but basically only TCP with one exception: the package that went awry: My wireshark thinks this looks like a HTTP package, and from the content it certainly does. The What I don't know yet where the loose package comes from. Not sure if it is Traefik or something during the start of the new rabbit-container, though my money would be on Traefik. (Which also would explain why I cannot reproduce it locally.) I think it is still good RabbitMQ.Client is resilient against such a loose package, but I think that is a problem you don't need to follow through. Full wireshark traffic captured by the client: rabbit-tcp-full.zip Container stopping initiated packages 141+ A full log of the first container:
A full log of the second container:
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Right-click on a packet and choose "Decode As" from the menu. You can set port 5674 to decode as AMQP. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you for the full packet capture. I can see a typical AMQP connection negotiation in frames 1 through 10. I compared packet 4 and packet 151 and don't see any meaningful difference, so the fact that Wirkshark mis-decodes 151 does not indicate an issue. You can see in packets 152 through 156 that the connection that was going to start using client port 49635 is shutdown cleanly. I also don't see anything in the RabbitMQ logs that corresponds to this connection shutting down early (it would be logged). Then a new connection attempt is made using port 49636, and that is shut down. Then you see 49637, etc etc. Again, nothing in the RabbitMQ logs. Your load balancer should be logging these connection attempts. If you can increase verbosity to see what the heck it's doing, that would be great. My guess is that the LB is causing this issue. I should have a beta release of version 6.4.0 soon for you to try in your environment. Have a good weekend. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Have a good weekend as well and thanks for looking into :-) |
Beta Was this translation helpful? Give feedback.
All reactions
-
No rush, but if you get the chance to take a traffic capture on the client and on the RabbitMQ node at the same time it should show that RabbitMQ is not getting the reconnection attempts. At least, that's what I think is going on. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Given the provided information, I'm not convinced this is a bug in this library unless it can be shown that a memory pool is rented by this client with an enormous size (in the above code). I say this because it's more likely that some other part of your application is renting an extremely large pool and it just happens that it's also being used by this client to read a frame. The evidence for this is the content of the rented pool, which appears to be |
Beta Was this translation helpful? Give feedback.
All reactions
-
Wow, that is surprising! |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm wondering if that content returned by MassTransit somehow makes the client read a frame header and what it thinks is the "frame-size" part of that data in that response makes it rent this huge array? If you take that response and treat is as a AMQP Frame, what does it look like? |
Beta Was this translation helpful? Give feedback.
All reactions
-
From what I figured further it looks like Traefik produces a HTTP Bad Request response during container-start/failover. To me it looks like that RabbitMQ.Client tries interpreting it as an AMQP package resulting in the problem. Like that payloadSize (which is 0x502F312E in hex) corresponds to the
Tha part starting at bytes 0036 (so 48 54 54...) in UTF8 is
|
Beta Was this translation helpful? Give feedback.
All reactions
-
What I'm wondering is why the RabbitMQ client even gets that far. Doesn't it go through the protocol handshakes again after disconnecting or is there some magic going on in Traefik? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, we see the AMQP 091 connection header sent over-and-over. At some point, though, the client does make it past that. You can see it in this capture.. ...several incomplete reconnect attempts, then the one using port 49637 reads the invalid frame with the huge payload size. I will dig into this further, but I agree with @ChristianSteu that this is due to the load balancer and the best we can do is be a bit more defensive in the code. |
Beta Was this translation helpful? Give feedback.
All reactions
-
@ChristianSteu - please test out this version in your environment: https://www.nuget.org/packages/RabbitMQ.Client/6.4.0-rc.1 Also, please see my comment here - #1213 (reply in thread) Thanks! |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Yeah that is certainly much better. Unfortunately I currently see no way to configure the setting via MassTransit, but that is not your responsibility that the setting is/can be exposed via another library. It is still much better, because it doesn't get into the while true loop, so it can recover. And while the allocation still happens, it is not permanent, so the memory isn't lost. I will see if I can get TCP dump of the RabbitMQ side, but I'm not familiar enough with this part of the stack to do it myself without investing too much time. I have asked a coworker if that is feasible, but I cannot promise I can provide one. But to me it is relatively safe that it is no problem with RabbitMQ, but with our Treafik setup. The same happens (for an increased time) if the health-check period is changed, so it seems that happens when the container is down and the new one isn't healthy yet. If I'll get one I'll post it, but to me the issue is solved (at least from the side of RabbitMQ.Client), so thank you for your time :-) |
Beta Was this translation helpful? Give feedback.
All reactions
-
🎉 1
-
Great, I appreciate the follow up and will release 6.4.0 today. |
Beta Was this translation helpful? Give feedback.
All reactions
-
https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v6.4.0 |
Beta Was this translation helpful? Give feedback.
-
I'm tracking down a problematic behavior of my application using RabbitMQ (via MassTransit).
Normally everything works fine but we recently had a short outage of our RabbitMQ container and the application failed to reconnect, had a spontaneous allocation of ~1.3GB memory and a constant CPU usage increase to 50% (while normally needing only a few percent at most). Unfortunately this doesn't occur every time a disconnect happens, making things quite harder to figure out...
Investigating the issue so far makes me believe these are network-shenanigans that aren't handled well by RabbitMQ.Client: The memory is one byte array of length 1345270063, the first 96 bytes decoded to utf8 are:
while the remainder of it seems to be 0. I could memory profile a run with dotMemory and the allocation stack is the reason I'm here (though I have previously started a discussion on MassTransit):
I could not reproduce it on a locally run rabbitmq-docker-node.
Tested/affected versions:
The following is the last output of the rabbitmq-container shutting down before the incident happened:
I couldn't get any useful log output form the client, the only warnings that pop up are the same that gets logged when everything works fine (that is RabbitMQ being unavailable for some time but the application can reconnect afterwards without issues):
Here is the full client project code I used trying to reproduce the issue (minus changed host/credentials to the other rabbit node) that still can trigger it (though unfortunately not guarenteed):
Any help/solution would be greatly appreciated. I'm still trying to further isolate/pinpoint the issue, but have the feeling that I'm going to wireshark network traffic to get closer to what the root issue is...
Beta Was this translation helpful? Give feedback.
All reactions