Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low traffic or publish-only connections are closed with "too many missed heartbeats" #106

Closed
rajmaniar opened this issue Aug 16, 2023 · 7 comments · Fixed by #111
Closed

Comments

@rajmaniar
Copy link
Contributor

When we have a connection to rabbit with very low consumer traffic or a connection with not consumers (only used for publishing) we get an occasion close with the message too many missed heartbeats

I'll add more details to this ticket as I investigate this a bit more so it's a placeholder for now.

@rajmaniar
Copy link
Contributor Author

We're also seeing the client generated exception: Server did not respond to heartbeats for 30s

@achilleasa
Copy link
Owner

According to https://www.rabbitmq.com/heartbeats.html#heartbeats-interval any incoming message (heartbeat or not) from the broker can be treated as a heartbeat. This is the approach that this client also implements.

In addition, the client sets up a timer (interval = half the heartbeat timeout) which is used to send out periodic heartbeat frames to the broker to prevent exactly the scenario you describe...

To help me diagnose the issue, can you add some logging statements in writeHeartbeat (specifically, L64 before the return, L71 before returning and L76 inside the catch block) and let me know if the client fails to write the heartbeat frame?

@mchandler-plato
Copy link

mchandler-plato commented Dec 7, 2023

So working with @rajmaniar we have found what i think is causing this issue.

It looks like reading the spec the rabbitmq server only sends at least one heartbeat every heartbeat interval the client sets:

sent if no non-heartbeat AMQP traffic is sent for longer than one heartbeat interval

So since dart_amqp also sets a timer to the same interval it can race the rabbitmq server heartbeat and close the connection just before it reads the heart beat. That is shown in the following wireshark trace:
image
(10.240.51.176 is the client, dark blue is sending to rabbitmq server)

Suggestions on how to fix:

  • set the heartbeat interval dart_amqp sends to rabbit at half the actual value so we are guaranteed to get at least one heartbeat from the server in the interval
  • add a small interval to the timer (like 2 seconds) to avoid the race condition with the server
  • switch to the model that go amqp driver uses where it counts outstanding heartbeats and closes when more than 3 are in flight

@achilleasa
Copy link
Owner

The 3rd option (counting heartbeats in flight and disconnecting once a threshold is exceeded) sounds like the more robust approach to me. I will propose a PR to implement this mechanism.

achilleasa added a commit that referenced this issue Jan 2, 2024
This commit implements a suggestion from #106 (comment).

The client-side heartbeat implementation is modified via the
introduction of a new tuning parameter called `maxMissedHeartbeats`
(default: 3).

If heartbeats are enabled and the client sends `maxMissedHeartbeats`
consecutive heartbeat messages at the interval negotiated with the
broker without receiving any message (heartbeat or regular traffic)
back, the client assumes that the server is not available and will raise
a HeartbeatFailedException.
@achilleasa
Copy link
Owner

@mchandler-plato Can you please check if #111 rectifies the issues you are having?

achilleasa added a commit that referenced this issue Jan 4, 2024
This commit implements a suggestion from #106 (comment).

The client-side heartbeat implementation is modified via the
introduction of a new tuning parameter called `maxMissedHeartbeats`
(default: 3).

If heartbeats are enabled and the client sends `maxMissedHeartbeats`
consecutive heartbeat messages at the interval negotiated with the
broker without receiving any message (heartbeat or regular traffic)
back, the client assumes that the server is not available and will raise
a HeartbeatFailedException.
@Boris-Plato
Copy link

Boris-Plato commented Jan 8, 2024

@achilleasa We work together with @mchandler-plato. I have a branch in my forked repo where I skip every second sent heartbeat, which helps catch broken heartbeat state quite fast. I just tried your PR with this scenario, and it looks to be fixed - I do not see "broken heartbeats" restarts anymore. So 1st check is good. But the main check will be when we deeply your changes to prod, I'll discuss it with the team soon. Right now, we use in prod our own "2x check timer duration" fix, which also helps for my test scenario with skipped heartbeats.

@Boris-Plato
Copy link

btw since in your PR missInterval became heartbeatPeriod x3 with default value it is very close to our current fix with hardcoded heartbeatPeriod x2 =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants