Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

rajmaniar · 2023-08-16T18:24:55Z

When we have a connection to rabbit with very low consumer traffic or a connection with not consumers (only used for publishing) we get an occasion close with the message too many missed heartbeats

I'll add more details to this ticket as I investigate this a bit more so it's a placeholder for now.

The text was updated successfully, but these errors were encountered:

rajmaniar · 2023-08-16T18:37:17Z

We're also seeing the client generated exception: Server did not respond to heartbeats for 30s

achilleasa · 2023-08-17T20:28:59Z

According to https://www.rabbitmq.com/heartbeats.html#heartbeats-interval any incoming message (heartbeat or not) from the broker can be treated as a heartbeat. This is the approach that this client also implements.

In addition, the client sets up a timer (interval = half the heartbeat timeout) which is used to send out periodic heartbeat frames to the broker to prevent exactly the scenario you describe...

To help me diagnose the issue, can you add some logging statements in writeHeartbeat (specifically, L64 before the return, L71 before returning and L76 inside the catch block) and let me know if the client fails to write the heartbeat frame?

mchandler-plato · 2023-12-07T23:52:42Z

So working with @rajmaniar we have found what i think is causing this issue.

It looks like reading the spec the rabbitmq server only sends at least one heartbeat every heartbeat interval the client sets:

sent if no non-heartbeat AMQP traffic is sent for longer than one heartbeat interval

So since dart_amqp also sets a timer to the same interval it can race the rabbitmq server heartbeat and close the connection just before it reads the heart beat. That is shown in the following wireshark trace:

(10.240.51.176 is the client, dark blue is sending to rabbitmq server)

Suggestions on how to fix:

set the heartbeat interval dart_amqp sends to rabbit at half the actual value so we are guaranteed to get at least one heartbeat from the server in the interval
add a small interval to the timer (like 2 seconds) to avoid the race condition with the server
switch to the model that go amqp driver uses where it counts outstanding heartbeats and closes when more than 3 are in flight

achilleasa · 2024-01-02T06:18:56Z

The 3rd option (counting heartbeats in flight and disconnecting once a threshold is exceeded) sounds like the more robust approach to me. I will propose a PR to implement this mechanism.

This commit implements a suggestion from #106 (comment). The client-side heartbeat implementation is modified via the introduction of a new tuning parameter called `maxMissedHeartbeats` (default: 3). If heartbeats are enabled and the client sends `maxMissedHeartbeats` consecutive heartbeat messages at the interval negotiated with the broker without receiving any message (heartbeat or regular traffic) back, the client assumes that the server is not available and will raise a HeartbeatFailedException.

achilleasa · 2024-01-02T09:34:09Z

@mchandler-plato Can you please check if #111 rectifies the issues you are having?

This commit implements a suggestion from #106 (comment). The client-side heartbeat implementation is modified via the introduction of a new tuning parameter called `maxMissedHeartbeats` (default: 3). If heartbeats are enabled and the client sends `maxMissedHeartbeats` consecutive heartbeat messages at the interval negotiated with the broker without receiving any message (heartbeat or regular traffic) back, the client assumes that the server is not available and will raise a HeartbeatFailedException.

Boris-Plato · 2024-01-08T15:12:56Z

@achilleasa We work together with @mchandler-plato. I have a branch in my forked repo where I skip every second sent heartbeat, which helps catch broken heartbeat state quite fast. I just tried your PR with this scenario, and it looks to be fixed - I do not see "broken heartbeats" restarts anymore. So 1st check is good. But the main check will be when we deeply your changes to prod, I'll discuss it with the team soon. Right now, we use in prod our own "2x check timer duration" fix, which also helps for my test scenario with skipped heartbeats.

Boris-Plato · 2024-01-08T16:07:19Z

btw since in your PR missInterval became heartbeatPeriod x3 with default value it is very close to our current fix with hardcoded heartbeatPeriod x2 =)

achilleasa mentioned this issue Jan 2, 2024

Raise HeartbeatFailedException after missing maxMissedHeartbeats (new tuning param) consecutive heartbeats #111

Merged

achilleasa closed this as completed in #111 Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

rajmaniar commented Aug 16, 2023

rajmaniar commented Aug 16, 2023

achilleasa commented Aug 17, 2023

mchandler-plato commented Dec 7, 2023 •

edited

Loading

achilleasa commented Jan 2, 2024

achilleasa commented Jan 2, 2024

Boris-Plato commented Jan 8, 2024 •

edited

Loading

Boris-Plato commented Jan 8, 2024

Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

Comments

rajmaniar commented Aug 16, 2023

rajmaniar commented Aug 16, 2023

achilleasa commented Aug 17, 2023

mchandler-plato commented Dec 7, 2023 • edited Loading

achilleasa commented Jan 2, 2024

achilleasa commented Jan 2, 2024

Boris-Plato commented Jan 8, 2024 • edited Loading

Boris-Plato commented Jan 8, 2024

mchandler-plato commented Dec 7, 2023 •

edited

Loading

Boris-Plato commented Jan 8, 2024 •

edited

Loading