ACK timeout kills connection without getting restarted #106

D4no0 · 2021-09-24T09:52:24Z

versions:
broadway: 1.0.0
bradway_rabbitmq: 0.7.0
amqp: 2.1
elixir: 1.12
otp: 24.0.5

I have some long-running tasks that sometime may time-out the consumer_timeout from rabbitmq with message:

09:38:26.044 [warn] AMQP channel went down with reason: {:shutdown, {:server_initiated_close, 406, "PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 7200000 ms. This timeout value can be configured, see consumers doc guide to learn more"}}

The expected behavior would be to reestablish a new connection, kill the timed-out processors and rabbitmq to redeliver messages.

The current behavior is that the GenServer is killed and broadway can no longer send messages to rabbitmq. This is fixed only by restarting the broadway process.

The text was updated successfully, but these errors were encountered:

josevalim · 2021-09-24T10:35:40Z

Thanks for the report. The log you are seeing immediately causes the client to reconnect, so i am assuming there is something more at play here: https://github.com/dashbitco/broadway_rabbitmq/blob/master/lib/broadway_rabbitmq/producer.ex#L527-L536

D4no0 · 2021-09-27T08:06:40Z

The error after that is related to a genserver call, with the genserver down:

07:57:38.686 [error] ** (exit) exited in: :gen_server.call(#PID<0.6829.0>, {:call, {:"basic.ack", 6, false}, :none, #PID<0.2649.0>}, 70000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

See #106.

whatyouhide · 2023-02-16T07:26:36Z

So, we ack from a different process than the RabbitMQ producer (the processor or batcher acks). I don't think we can "save" the ack if the channel is down. What we can do, however, is have a better error message from Broadway, which is what I did with #122. I think for now that's pretty much it. 😞 Eventually the producer should reconnect.

See #106.

v-anyukov · 2024-04-12T18:46:11Z

We fall into related issue: ack timeout -> channel closed by rabbitMQ server -> while broadway reconnects there is several log messages about unable to ack/reject messages because of dead channel -> more ack timeouts growing every 30 minutes (default rabbitMQ consumer timeout) -> eventually we have a lot of channel reconnects but worst thing is that it appears rabbitMQ will keep all mnesia segments containing unacked messages, with 30 minutes timeout and high throughput it eats disk space pretty wild. We are going to try short timeout as our ingestion is intended to be pretty fast.

Regarding the topic: does it makes any sense to retry ack/reject several times when channel is not alive? Another option would be to at least give some control over messages broadway is unable to ack/reject, something like handle_ack_error or so.

whatyouhide added the Kind:Bug Something isn't working label Feb 8, 2023

whatyouhide added a commit that referenced this issue Feb 16, 2023

Improve error when acking on a closed channel

497b142

See #106.

whatyouhide mentioned this issue Feb 16, 2023

Improve error when acking on a closed channel #122

Merged

whatyouhide added a commit that referenced this issue Feb 16, 2023

Improve error when acking on a closed channel (#122)

d90814d

See #106.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACK timeout kills connection without getting restarted #106

ACK timeout kills connection without getting restarted #106

D4no0 commented Sep 24, 2021

josevalim commented Sep 24, 2021

D4no0 commented Sep 27, 2021 •

edited

Loading

whatyouhide commented Feb 16, 2023

v-anyukov commented Apr 12, 2024

ACK timeout kills connection without getting restarted #106

ACK timeout kills connection without getting restarted #106

Comments

D4no0 commented Sep 24, 2021

josevalim commented Sep 24, 2021

D4no0 commented Sep 27, 2021 • edited Loading

whatyouhide commented Feb 16, 2023

v-anyukov commented Apr 12, 2024

D4no0 commented Sep 27, 2021 •

edited

Loading