Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: wait all in-flight tasks are complete after SIGTERM during the grace period #1776

Closed
pfy-oleksii-storozhylov opened this issue Nov 10, 2023 · 0 comments · Fixed by temporalio/roadrunner-temporal#441
Assignees
Labels
B-missing feature Bug: missing feature after implementation beta-nominated Nominated for backporting to the RR in the beta channel. C-feature-accepted Category: Feature discussed and accepted P-temporal Plugin: Temporal R-beta Release: Nominated for backporting to the RR in the beta channel.
Milestone

Comments

@pfy-oleksii-storozhylov
Copy link

pfy-oleksii-storozhylov commented Nov 10, 2023

Plugin

Temporal

I have an idea!

After the SigTerm is received by RR during the grace period there are multiple Activity errors:

activity_pool_execute_activity: Workers watcher stopped:
	static_pool_exec:
	static_pool_exec:
	worker_watcher_get_free_worker

It would be great to let RR to complete in-flight tasks before shutting down

UPD:
After the testing, It seems that there is a bug on 2023.3.3 (probably on 2023.3.*)

Some workers after receiving SIGTERM signal can consume hundreds of Temporal Activity Tasks for a while and fail them with an error at the same moment:

activity_pool_execute_activity: Workers watcher stopped:
	static_pool_exec:
	static_pool_exec:
	worker_watcher_get_free_worker

Example:
RR received SIGTERM signal

2023-11-10 16:35:11.155	 Activity error.          roadrunner-temporal-74f9b86d98-hdk8w Temporal.Activity 
...
2023-11-10 16:34:17.459	 destroy signal received  roadrunner-temporal-74f9b86d98-hdk8w  
2023-11-10 16:34:17.422	 destroy signal received  roadrunner-temporal-74f9b86d98-hdk8w  
2023-11-10 16:34:17.393  Failed to poll for task. roadrunner-temporal-74f9b86d98-hdk8w  
2023-11-10 16:34:17.393	 destroy signal received  roadrunner-temporal-74f9b86d98-hdk8w  

There are 3 destroy signal received records because we have http, temporal and rpc plugins enabled (or because of several pools in temporal plugin), but not because of the multiple SIGTERMs.

Activity task events logs:

2023-11-10 EET 16:35:11.13 Scheduled
2023-11-10 EET 16:35:11.14 Started
2023-11-10 EET 16:35:11.15 Failed

Activity is configured without retries so there was only one attempt.

Here we can see that the Activity event was scheduled after the RR worker received the SIGTERM signal, but anyway, it was consumed and failed by the RR worker (this is only one failed activity from hundreds for the same RR worker).
RR configuration was the same for both RR versions (2022.2.2 and 2023.3.3)

version: "3"

rpc:
  listen: tcp://0.0.0.0:6001

server:
  command: "php /srv/app/bin/console worker:temporal"
  env:
    APP_RUNTIME: Shared\Infrastructure\TemporalRunner\Runtime
    APP_ROLE: "temporal"
    LOG_PATH: php://stderr

http:
  # host and port separated by semicolon
  address: 0.0.0.0:8080
  pool:
    num_workers: 1
    command: "php /srv/app/public/index.php"

temporal:
  address: "${TEMPORAL_ADDRESS}"
  namespace: "${TEMPORAL_NAMESPACE}"
  activities:
    num_workers: 3
    supervisor:
      max_worker_memory: 100
      ttl: 10800s
  metrics:
    address: 0.0.0.0:8082
    type: "summary"

metrics:
  address: 0.0.0.0:8081

logs:
  output: stderr
  encoding: console
  level: info
  mode: production

The bug can not be reproduced on the RR version 2022.2.2 and never happen on previous versions (we are using the same configuration for about 1.5 years in production) on our prod env.

@pfy-oleksii-storozhylov pfy-oleksii-storozhylov added the C-feature-request Category: feature requested, but need to be discussed label Nov 10, 2023
@rustatian rustatian added B-missing feature Bug: missing feature after implementation C-feature-accepted Category: Feature discussed and accepted beta-nominated Nominated for backporting to the RR in the beta channel. R-beta Release: Nominated for backporting to the RR in the beta channel. P-temporal Plugin: Temporal and removed C-feature-request Category: feature requested, but need to be discussed labels Nov 10, 2023
@rustatian rustatian added this to the v2023.3.5 milestone Nov 10, 2023
@rustatian rustatian moved this to 🏗 In progress in Jira 😄 Nov 10, 2023
@rustatian rustatian changed the title [💡 FEATURE REQUEST]: wait all in-flight temporal tasks are complete after sigterm during the grace period [💡 FEATURE REQUEST]: wait all in-flight tasks are complete after SIGTERM during the grace period Nov 10, 2023
@rustatian rustatian changed the title [💡 FEATURE REQUEST]: wait all in-flight tasks are complete after SIGTERM during the grace period [🐛 BUG]: wait all in-flight tasks are complete after SIGTERM during the grace period Nov 13, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Jira 😄 Nov 13, 2023
@rustatian rustatian mentioned this issue Nov 16, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-missing feature Bug: missing feature after implementation beta-nominated Nominated for backporting to the RR in the beta channel. C-feature-accepted Category: Feature discussed and accepted P-temporal Plugin: Temporal R-beta Release: Nominated for backporting to the RR in the beta channel.
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants