You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there; I posted in Slack about this and I think this is the right place for feedback?
We recently had a routine cluster operation which took down the RabbitMQ node which held the log_message_saving queue. This jobs service retried a few times, but eventually gave up and stopped the consumer.
Because this didn't flag anything out to the cluster, Kubernetes left the pod running as-is, and we ended up with 16 million messages in the log_message_saving queue as whatever was producing them continued to do so.
Ideally, the liveness probe responder would be checking that this queue consumer is running and flag externally if it fails so that the cluster can restart the pod.
As an aside, I'm surprised that this consumer is in this service, as jobs cannot be run replicated (I assume, due to the cleanup jobs that would likely run in parallel as a result) - and therefore we cannot scale out to recover from high message backlogs. Is there a reason it's in this service and not somewhere else?
The text was updated successfully, but these errors were encountered:
@cailyoung Scaling cleanup jobs doesn't sound feasible, as it creates a lot of intersections with the data being deleted and requires synchronization. Therefore, it's easier for us to leave the cleaning jobs without scaling. An alternative is to convert them into serverless calls.
Hi there; I posted in Slack about this and I think this is the right place for feedback?
We recently had a routine cluster operation which took down the RabbitMQ node which held the
log_message_saving
queue. Thisjobs
service retried a few times, but eventually gave up and stopped the consumer.Because this didn't flag anything out to the cluster, Kubernetes left the pod running as-is, and we ended up with 16 million messages in the
log_message_saving
queue as whatever was producing them continued to do so.Ideally, the liveness probe responder would be checking that this queue consumer is running and flag externally if it fails so that the cluster can restart the pod.
As an aside, I'm surprised that this consumer is in this service, as
jobs
cannot be run replicated (I assume, due to the cleanup jobs that would likely run in parallel as a result) - and therefore we cannot scale out to recover from high message backlogs. Is there a reason it's in this service and not somewhere else?The text was updated successfully, but these errors were encountered: