-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: worker not exited when executing quit or reload command #9909
Conversation
How long does the long connection prevent the quit operation? |
@jiangfucheng Please show your steps to reproduce this issue. |
Accoding my test, the worker will never exit. |
|
@jiangfucheng |
@@ -256,7 +247,6 @@ ok | |||
--- grep_error_log eval | |||
qr/create new checker: table: 0x|try to release checker: table: 0x/ | |||
--- grep_error_log_out | |||
try to release checker: table: 0x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be remove, the reasons are as follows:
1.healthchekcer
will be remove after worker exit since the status store in memroy, that's why TEST 5
will not print try to release checker: table: 0x
in first line.
2.Why the try to release checker: table: 0x
will be print if we add sleep(1)
before worker exited
- Because the log is printed by old worker. Before old worker be killed, the new worker will be created, at this moment, both old worker and new worker can receive events, it's easy to be proved through print debug log before execute fire_all_clean_handlers
if pre_index then
local pre_val = self.values[pre_index]
log.info("sync_data: check pre_val: ", inspect(pre_val.clean_handlers), worker_id_str, " pid: ", ngx.worker.pid())
if pre_val then
config_util.fire_all_clean_handlers(pre_val)
end
logs:
2023/08/08 18:04:31 [info] 77582#1004031: *312 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: { {
f = <function 1>,
id = 1
},
_id = 2
} worker_id: 0 pid: 77582, context: ngx.timer
2023/08/08 18:04:31 [info] 77620#1004305: *447 [lua] config_etcd.lua:741: sync_data(): sync_data: check pre_val: {} worker_id: 0 pid: 77620, context: ngx.timer
We can see the etcd events be received with two wokrer 0
, and there pid is different, we can easliy to confirm the worker is old worker which has check_handlers
field
3.Why these test cases can passed in old version(before #9456 be merged)
- Because in the old version, the worker is not exit immediately too, it will exit after make quit/reload
about 60s, so the reason is same as above.
apisix/core/config_etcd.lua
Outdated
@@ -257,6 +260,30 @@ local function run_watch(premature) | |||
end | |||
|
|||
|
|||
local function run_watch(premature) | |||
::restart:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No! The watch routine must not be restarted!
Because it's a stateful routine (e.g. last watch starts revision) and should start once and notify all child watchers that it started.
@jiangfucheng
Instead, you just wait for check_worker_th, if it exits, it means the worker process is exiting, then kill run_watch_th, and that's it.
Please make a change again on the code, thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the thread do_run_watch
crashes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the thread
do_run_watch
crashes?
It's not supposed to crash.
In fact, almost all timers did not have a crash guard.
And even if run_watch crashes, you cannot fix it by restarting it, 'cause it's a stateful routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, restarting it in this scenario is an over-design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @kingluo, restart is unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thank you for your review.❤️
@jiangfucheng The code is ok. I will approve it after the ci finishes successfully. |
@monkeyDluffy6017 @kingluo Please take a look again, thanks.❤️ |
@kingluo please take a look again |
* upstream/master: (77 commits) docs: Update admin-api.md (apache#10056) ci: fix a bug that can not open nginx.pid (apache#10061) feat: remove rust dependency by rollback lua-resty-ldap on master (apache#9936) docs: fix grpc-transcode.md error (apache#10059) feat: upgrade lua dependencies (apache#10051) fix: rollback lua-resty-session to 3.10 (apache#10046) feat: upgrade resty-redis-cluster from 1.02-4->1.05-1 (apache#10041) feat: update lua library (apache#10037) fix: worker not exited when executing quit or reload command (apache#9909) fix: traffic split plugin not validating upstream_id (apache#10008) ci: update the timeout value in cli.yml (apache#10026) fix(tencent-cloud-cls): DNS parsing failure (apache#9843) chore(deps): bump actions/setup-node from 3.7.0 to 3.8.0 (apache#10025) feat(openid-connect): add proxy_opts attribute (apache#9948) perf(log-rotate): replace string.sub with string.byte (apache#9984) fix(ci): replace github action in update-labels.yml (apache#9987) fix: can't sync etcd data if key has special character (apache#9967) perf(aws-lambda): cache the index of the array (apache#9944) fix: add support for dependency installation on endeavouros (apache#9985) chore(ci): automate management of unresponded issues (apache#9927) ...
Description
Fixes #9802
Checklist