fix: worker not exited when executing quit or reload command #9909

jiangfucheng · 2023-07-26T13:15:34Z

Description

Fixes #9802

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

monkeyDluffy6017 · 2023-07-27T01:50:57Z

How long does the long connection prevent the quit operation?

kingluo · 2023-07-27T01:57:47Z

@jiangfucheng Please show your steps to reproduce this issue.

jiangfucheng · 2023-07-27T02:01:20Z

How long does the long connection prevent the quit operation?

Accoding my test, the worker will never exit.

jiangfucheng · 2023-07-27T02:03:05Z

@jiangfucheng Please show your steps to reproduce this issue.

make run
make reload or make quit
ps -ef |grep nginx check the worker if exited

apisix/core/config_etcd.lua

kingluo · 2023-08-03T14:09:05Z

@jiangfucheng
The code is ok.
Please check the failed ci: t/node/healthcheck-stop-checker.t.

apisix/core/config_etcd.lua

jiangfucheng · 2023-08-08T12:58:11Z

t/node/healthcheck-stop-checker.t

@@ -256,7 +247,6 @@ ok
 --- grep_error_log eval
 qr/create new checker: table: 0x|try to release checker: table: 0x/
 --- grep_error_log_out
-try to release checker: table: 0x


This line can be remove, the reasons are as follows:

1.healthchekcer will be remove after worker exit since the status store in memroy, that's why TEST 5 will not print try to release checker: table: 0x in first line.
2.Why the try to release checker: table: 0x will be print if we add sleep(1) before worker exited
- Because the log is printed by old worker. Before old worker be killed, the new worker will be created, at this moment, both old worker and new worker can receive events, it's easy to be proved through print debug log before execute fire_all_clean_handlers

if pre_index then local pre_val = self.values[pre_index] log.info("sync_data: check pre_val: ", inspect(pre_val.clean_handlers), worker_id_str, " pid: ", ngx.worker.pid()) if pre_val then config_util.fire_all_clean_handlers(pre_val) end

logs:

2023/08/08 18:04:31 [info] 77582#1004031: *312 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: { { f = <function 1>, id = 1 }, _id = 2 } worker_id: 0 pid: 77582, context: ngx.timer 2023/08/08 18:04:31 [info] 77620#1004305: *447 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: {} worker_id: 0 pid: 77620, context: ngx.timer

We can see the etcd events be received with two wokrer 0, and there pid is different, we can easliy to confirm the worker is old worker which has check_handlers field

3.Why these test cases can passed in old version(before #9456 be merged)
- Because in the old version, the worker is not exit immediately too, it will exit after make quit/reload about 60s, so the reason is same as above.

kingluo · 2023-08-14T08:36:11Z

apisix/core/config_etcd.lua

@@ -257,6 +260,30 @@ local function run_watch(premature)
 end


+local function run_watch(premature)
+    ::restart::


No! The watch routine must not be restarted!
Because it's a stateful routine (e.g. last watch starts revision) and should start once and notify all child watchers that it started.

@jiangfucheng
Instead, you just wait for check_worker_th, if it exits, it means the worker process is exiting, then kill run_watch_th, and that's it.
Please make a change again on the code, thank you very much!

What if the thread do_run_watch crashes ?

What if the thread do_run_watch crashes?

It's not supposed to crash.
In fact, almost all timers did not have a crash guard.
And even if run_watch crashes, you cannot fix it by restarting it, 'cause it's a stateful routine.

Frankly, restarting it in this scenario is an over-design.

Agreed with @kingluo, restart is unnecessary

Updated, thank you for your review.❤️

kingluo · 2023-08-14T14:30:03Z

@jiangfucheng The code is ok. I will approve it after the ci finishes successfully.

jiangfucheng · 2023-08-16T14:15:07Z

@monkeyDluffy6017 @kingluo Please take a look again, thanks.❤️

moonming · 2023-08-17T02:35:18Z

@kingluo please take a look again

* upstream/master: (77 commits) docs: Update admin-api.md (apache#10056) ci: fix a bug that can not open nginx.pid (apache#10061) feat: remove rust dependency by rollback lua-resty-ldap on master (apache#9936) docs: fix grpc-transcode.md error (apache#10059) feat: upgrade lua dependencies (apache#10051) fix: rollback lua-resty-session to 3.10 (apache#10046) feat: upgrade resty-redis-cluster from 1.02-4->1.05-1 (apache#10041) feat: update lua library (apache#10037) fix: worker not exited when executing quit or reload command (apache#9909) fix: traffic split plugin not validating upstream_id (apache#10008) ci: update the timeout value in cli.yml (apache#10026) fix(tencent-cloud-cls): DNS parsing failure (apache#9843) chore(deps): bump actions/setup-node from 3.7.0 to 3.8.0 (apache#10025) feat(openid-connect): add proxy_opts attribute (apache#9948) perf(log-rotate): replace string.sub with string.byte (apache#9984) fix(ci): replace github action in update-labels.yml (apache#9987) fix: can't sync etcd data if key has special character (apache#9967) perf(aws-lambda): cache the index of the array (apache#9944) fix: add support for dependency installation on endeavouros (apache#9985) chore(ci): automate management of unresponded issues (apache#9927) ...

jiangfucheng added 4 commits July 26, 2023 21:14

fix: worker not exited when executing quit or reload command

5ec9f64

fix lint

9c3aeb3

refactor

3268145

fix

c1e3b2b

monkeyDluffy6017 requested a review from kingluo July 27, 2023 01:51

kingluo reviewed Jul 27, 2023

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

kingluo reviewed Jul 27, 2023

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

fixg

9407ed5

monkeyDluffy6017 added the discuss label Aug 3, 2023

kingluo reviewed Aug 3, 2023

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

jiangfucheng added 2 commits August 3, 2023 21:15

fix

ea1d312

Merge branch 'master' into worker_cant_exit

b06da5b

monkeyDluffy6017 added wait for update wait for the author's response in this issue/PR and removed discuss labels Aug 4, 2023

jiangfucheng added 4 commits August 6, 2023 15:41

debug

995a7fe

test

ab00691

fix

658cc1c

fix

5169849

monkeyDluffy6017 requested a review from kingluo August 7, 2023 06:46

kingluo previously approved these changes Aug 7, 2023

View reviewed changes

monkeyDluffy6017 reviewed Aug 7, 2023

View reviewed changes

apisix/core/config_etcd.lua Outdated Show resolved Hide resolved

monkeyDluffy6017 reviewed Aug 7, 2023

View reviewed changes

apisix/core/config_etcd.lua Show resolved Hide resolved

test

960b0e4

jiangfucheng dismissed kingluo’s stale review via 960b0e4 August 7, 2023 15:24

jiangfucheng marked this pull request as draft August 7, 2023 15:24

jiangfucheng added 2 commits August 7, 2023 23:25

fix

b270856

fix

4fb91d1

jiangfucheng marked this pull request as ready for review August 8, 2023 12:50

jiangfucheng commented Aug 8, 2023

View reviewed changes

monkeyDluffy6017 previously approved these changes Aug 14, 2023

View reviewed changes

monkeyDluffy6017 added approved and removed wait for update wait for the author's response in this issue/PR labels Aug 14, 2023

monkeyDluffy6017 requested a review from kingluo August 14, 2023 07:33

kingluo requested changes Aug 14, 2023

View reviewed changes

fix

6eaec15

jiangfucheng dismissed monkeyDluffy6017’s stale review via 6eaec15 August 14, 2023 13:59

monkeyDluffy6017 approved these changes Aug 16, 2023

View reviewed changes

kingluo approved these changes Aug 17, 2023

View reviewed changes

moonming merged commit 5d6bde2 into apache:master Aug 17, 2023

jiangfucheng deleted the worker_cant_exit branch August 23, 2023 06:47

jensonfunfun mentioned this pull request Nov 27, 2023

bug: nginx relaod but worker does not exit #10554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: worker not exited when executing quit or reload command #9909

fix: worker not exited when executing quit or reload command #9909

jiangfucheng commented Jul 26, 2023 •

edited

Loading

monkeyDluffy6017 commented Jul 27, 2023

kingluo commented Jul 27, 2023

jiangfucheng commented Jul 27, 2023

jiangfucheng commented Jul 27, 2023

kingluo commented Aug 3, 2023

jiangfucheng Aug 8, 2023

kingluo Aug 14, 2023 •

edited

Loading

monkeyDluffy6017 Aug 14, 2023

kingluo Aug 14, 2023

kingluo Aug 14, 2023

monkeyDluffy6017 Aug 14, 2023 •

edited

Loading

jiangfucheng Aug 14, 2023

kingluo commented Aug 14, 2023

jiangfucheng commented Aug 16, 2023

moonming commented Aug 17, 2023

fix: worker not exited when executing quit or reload command #9909

fix: worker not exited when executing quit or reload command #9909

Conversation

jiangfucheng commented Jul 26, 2023 • edited Loading

Description

Checklist

monkeyDluffy6017 commented Jul 27, 2023

kingluo commented Jul 27, 2023

jiangfucheng commented Jul 27, 2023

jiangfucheng commented Jul 27, 2023

kingluo commented Aug 3, 2023

jiangfucheng Aug 8, 2023

Choose a reason for hiding this comment

kingluo Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

monkeyDluffy6017 Aug 14, 2023

Choose a reason for hiding this comment

kingluo Aug 14, 2023

Choose a reason for hiding this comment

kingluo Aug 14, 2023

Choose a reason for hiding this comment

monkeyDluffy6017 Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

jiangfucheng Aug 14, 2023

Choose a reason for hiding this comment

kingluo commented Aug 14, 2023

jiangfucheng commented Aug 16, 2023

moonming commented Aug 17, 2023

jiangfucheng commented Jul 26, 2023 •

edited

Loading

kingluo Aug 14, 2023 •

edited

Loading

monkeyDluffy6017 Aug 14, 2023 •

edited

Loading