Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: the worker run into a dead end in events.lua #10431

Closed
nanamikon opened this issue Nov 1, 2023 · 1 comment · Fixed by #10550
Closed

bug: the worker run into a dead end in events.lua #10431

nanamikon opened this issue Nov 1, 2023 · 1 comment · Fixed by #10550
Assignees
Labels
bug Something isn't working

Comments

@nanamikon
Copy link
Contributor

nanamikon commented Nov 1, 2023

Current Behavior

image
One of worker run into a dead end

Expected Behavior

Not run into a dead end

Error Logs

Lua Thread stack as follow

(gdb) lbt
C:ngx_http_lua_ngx_sleep
builtin#21
@/scripts//deps/share/lua/5.1/resty/worker/events.lua:353
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:1156
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:621
builtin#21
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:505
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:527
@/scripts/apisix/balancer.lua:235
@/scripts/apisix/balancer.lua:363
@/scripts/apisix/init.lua:830
=balancer_by_lua:2
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007f08bd849000 in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) lbt
C:ngx_http_lua_ngx_sleep
builtin#21
@/scripts//deps/share/lua/5.1/resty/worker/events.lua:353
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:1156
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:621
builtin#21
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:505
@/scripts//deps/share/lua/5.1/resty/healthcheck.lua:527
@/scripts/apisix/balancer.lua:235
@/scripts/apisix/balancer.lua:363
@/scripts/apisix/init.lua:830

Thread stack as follow

(gdb) c
Continuing.

Breakpoint 1, lj_str_new (L=0x405b6380, 
    str=0x47b72010 "API disabled in the context of balancer_by_lua*-gz/upstreams/740001-cbca25b9-2f64-46b0-80f5-62f5cb408ea1:state:10.112.230.60:8080808080r4LYLNew!10.124.117.66direction=OUTdirection=IN475&clusters=&dire"..., lenx=47) at lj_str.c:351
351	in lj_str.c
(gdb) bt
#0  lj_str_new (L=0x405b6380, 
    str=0x47b72010 "API disabled in the context of balancer_by_lua*-gz/upstreams/740001-cbca25b9-2f64-46b0-80f5-62f5cb408ea1:state:10.112.230.60:8080808080r4LYLNew!10.124.117.66direction=OUTdirection=IN475&clusters=&dire"..., lenx=47) at lj_str.c:351
#1  0x00007f08bfa2305a in lj_buf_str (sb=0x405b6460, L=0x405b6380) at lj_buf.h:195
#2  lj_strfmt_pushvf (L=L@entry=0x405b6380, fmt=<optimized out>, argp=argp@entry=0x7fff7f224028) at lj_strfmt.c:590
#3  0x00007f08bfa1966e in luaL_error (L=0x405b6380, fmt=<optimized out>) at lj_err.c:1097
#4  0x00007f08bfa14d73 in lj_BC_FUNCC () from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#5  0x00007f08bfa281ed in lua_pcall (L=L@entry=0x405b6380, nargs=nargs@entry=0, nresults=nresults@entry=1, errfunc=errfunc@entry=1) at lj_api.c:1145
#6  0x000000000052c413 in ngx_http_lua_balancer_by_chunk (L=0x405b6380, r=0x1e12400) at ../ngx_lua-0.10.21/src/ngx_http_lua_balancer.c:189
#7  0x000000000052cbe9 in ngx_http_lua_balancer_get_peer (pc=0x1cc2388, data=0x1d8a528) at ../ngx_lua-0.10.21/src/ngx_http_lua_balancer.c:456
#8  0x0000000000450c68 in ngx_event_connect_peer (pc=pc@entry=0x1cc2388) at src/event/ngx_event_connect.c:34
#9  0x000000000048581b in ngx_http_upstream_connect (r=0x1e12400, u=0x1cc2378) at src/http/ngx_http_upstream.c:1559
#10 0x0000000000481c58 in ngx_http_upstream_handler (ev=<optimized out>) at src/http/ngx_http_upstream.c:1310
#11 0x00000000004589db in ngx_epoll_process_events (cycle=<optimized out>, timer=<optimized out>, flags=<optimized out>)
    at src/event/modules/ngx_epoll_module.c:901
#12 0x000000000044f213 in ngx_process_events_and_timers (cycle=cycle@entry=0x16c6fb0) at src/event/ngx_event.c:257
#13 0x0000000000456bb2 in ngx_worker_process_cycle (cycle=0x16c6fb0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:806
#14 0x000000000045543c in ngx_spawn_process (cycle=cycle@entry=0x16c6fb0, proc=proc@entry=0x456b40 <ngx_worker_process_cycle>, data=data@entry=0x3, 
    name=name@entry=0x5df3cd "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:199
#15 0x00000000004570ac in ngx_start_worker_processes (cycle=cycle@entry=0x16c6fb0, n=8, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:392
#16 0x00000000004578d4 in ngx_master_process_cycle (cycle=cycle@entry=0x16c6fb0) at src/os/unix/ngx_process_cycle.c:138
#17 0x000000000042e419 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

Steps to Reproduce

Refer to lib/resty/worker/events.lua (tag 1.0.0)
https://github.com/Kong/lua-resty-worker-events/blob/1.0.0/lib/resty/worker/events.lua

local count = 0
  local cache_data = {}
  local cache_err = {}
  -- in case an event id has been published, but we're fetching it before
  -- its data was posted and we have to wait, we don't want the next
  -- event to timeout before we get to it, so go and cache what's
  -- available, to minimize lost data
  while _last_event < event_id do
    count = count + 1
    _last_event = _last_event + 1
    --debug("fetching event", _last_event)
    cache_data[count], cache_err[count] = get_event_data(_last_event)
  end
  local expire = now() + _wait_max
  for idx = 1, count do
    local data = cache_data[idx]
    local err = cache_err[idx]
    while not data do
      if err then
        log(ERR, "worker-events: error fetching event data: ", err)
        break
      else
        -- just nil, so must wait for data to appear
        if now() >= expire then
          break
        end
        -- wait and retry
        -- if the `sleep` function is unavailable in the current openresty
        -- 'context' (eg. 'init_worker'), then the pcall fails. We're not
        -- checking the result, but will effectively be doing a busy-wait
        -- by looping until it hits the time-out, or the data is retrieved
        _busy_polling = true  -- need to flag because `sleep` will yield control
                              -- and another coroutine might re-enter
        pcall(sleep, _wait_interval)
        _busy_polling = nil
        data, err = get_event_data(_last_event - count + idx)
      end
    end

If no data found in the cache_data, it will pcall ngx sleep int th balancer_by_lua context, which is not allow here.
When pcall fail at once, now() will not update, and the worker run into a dead end

I found lua-resty-worker-events fixed this bug, but the version of worker-events in the 2.15.3 and the lastest version of apisix is 1.0.0, so it is time to update the version of worker-events ?

Environment

  • APISIX version (run apisix version): 2.15.3
  • Operating system (run uname -a): ubuntu 16
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@moonming
Copy link
Member

I am considering removing lua-resty-worker-events from Apache APISIX and using lua-resty-events instead. This will not only fix the bug, but also allow upgrade with the upstream. The disadvantage is that it requires more time to prepare and test.
@nanamikon What do you think?

@moonming moonming moved this from 🏗 In progress to 👀 In review in Apache APISIX backlog Dec 13, 2023
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Apache APISIX backlog Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants