RabbitMQ unavailability halts processing of `log_message_saving` without triggering liveness probe failure #86

cailyoung · 2023-08-02T06:55:09Z

Hi there; I posted in Slack about this and I think this is the right place for feedback?

We recently had a routine cluster operation which took down the RabbitMQ node which held the log_message_saving queue. This jobs service retried a few times, but eventually gave up and stopped the consumer.

Because this didn't flag anything out to the cluster, Kubernetes left the pod running as-is, and we ended up with 16 million messages in the log_message_saving queue as whatever was producing them continued to do so.

Ideally, the liveness probe responder would be checking that this queue consumer is running and flag externally if it fails so that the cluster can restart the pod.

As an aside, I'm surprised that this consumer is in this service, as jobs cannot be run replicated (I assume, due to the cleanup jobs that would likely run in parallel as a result) - and therefore we cannot scale out to recover from high message backlogs. Is there a reason it's in this service and not somewhere else?

The text was updated successfully, but these errors were encountered:

DzmitryHumianiuk · 2023-08-07T23:54:18Z

related:
reportportal/service-api#1744
#86
reportportal/service-api#1745

DzmitryHumianiuk · 2023-08-08T01:48:33Z

@cailyoung Scaling cleanup jobs doesn't sound feasible, as it creates a lot of intersections with the data being deleted and requires synchronization. Therefore, it's easier for us to leave the cleaning jobs without scaling. An alternative is to convert them into serverless calls.

DzmitryHumianiuk mentioned this issue Aug 7, 2023

RabbitMQ queues are not declared as replicated reportportal/service-api#1745

Open

DzmitryHumianiuk mentioned this issue Aug 7, 2023

RabbitMQ partition allocation is not uniformly distributed reportportal/service-api#1744

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ unavailability halts processing of `log_message_saving` without triggering liveness probe failure #86

RabbitMQ unavailability halts processing of `log_message_saving` without triggering liveness probe failure #86

cailyoung commented Aug 2, 2023

DzmitryHumianiuk commented Aug 7, 2023

DzmitryHumianiuk commented Aug 8, 2023

RabbitMQ unavailability halts processing of log_message_saving without triggering liveness probe failure #86

RabbitMQ unavailability halts processing of log_message_saving without triggering liveness probe failure #86

Comments

cailyoung commented Aug 2, 2023

DzmitryHumianiuk commented Aug 7, 2023

DzmitryHumianiuk commented Aug 8, 2023

RabbitMQ unavailability halts processing of `log_message_saving` without triggering liveness probe failure #86

RabbitMQ unavailability halts processing of `log_message_saving` without triggering liveness probe failure #86