[🐛 BUG]: wait all in-flight tasks are complete after SIGTERM during the grace period #1776
Labels
B-missing feature
Bug: missing feature after implementation
beta-nominated
Nominated for backporting to the RR in the beta channel.
C-feature-accepted
Category: Feature discussed and accepted
P-temporal
Plugin: Temporal
R-beta
Release: Nominated for backporting to the RR in the beta channel.
Milestone
Plugin
Temporal
I have an idea!
After the SigTerm is received by RR during the grace period there are multiple Activity errors:
It would be great to let RR to complete in-flight tasks before shutting down
UPD:
After the testing, It seems that there is a bug on 2023.3.3 (probably on 2023.3.*)
Some workers after receiving SIGTERM signal can consume hundreds of Temporal Activity Tasks for a while and fail them with an error at the same moment:
Example:
RR received SIGTERM signal
There are 3 destroy signal received records because we have http, temporal and rpc plugins enabled (or because of several pools in temporal plugin), but not because of the multiple SIGTERMs.
Activity task events logs:
Activity is configured without retries so there was only one attempt.
Here we can see that the Activity event was scheduled after the RR worker received the SIGTERM signal, but anyway, it was consumed and failed by the RR worker (this is only one failed activity from hundreds for the same RR worker).
RR configuration was the same for both RR versions (2022.2.2 and 2023.3.3)
The bug can not be reproduced on the RR version 2022.2.2 and never happen on previous versions (we are using the same configuration for about 1.5 years in production) on our prod env.
The text was updated successfully, but these errors were encountered: