Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db-less kong get empty reply when targets numbers > 3 in post json #5869

Closed
yaoice opened this issue May 8, 2020 · 11 comments · Fixed by #5917
Closed

db-less kong get empty reply when targets numbers > 3 in post json #5869

yaoice opened this issue May 8, 2020 · 11 comments · Fixed by #5917
Labels

Comments

@yaoice
Copy link

yaoice commented May 8, 2020

Summary

I use kong and kubernetes ingress controller in arm64v8 environment, but I encountered a strange problem, kubernetes-ingress-controller failed to send post to kong admin api. At first, I thought it was a kubernetes-ingress-controller bug, and later I found that it was a kong problem.

kong ingress controller error message:

E0507 13:35:46.952831 1 controller.go:119] unexpected failure updating Kong configuration:
posting new config to /config: making HTTP reqeust: Post http://localhost:8001/config?check_hash=1: EOF
W0507 13:35:46.952869 1 queue.go:112] requeuing tcnp/dashboard, err posting new config to /config: making HTTP reqeust: Post http://localhost:8001/config?check_hash=1: EOF

Steps To Reproduce

  1. Use all-in-one-dbless.yaml to Install kubernetes kong ingress controller
  2. Run an deployment with ingress in k8s
  3. Scale deployment replicas to 5, then kubernetes ingress controller will get error
  4. Scale deployment replicas to 3, wait an while, kubernetes ingress controller will get back to ok

Additional Details & Logs

Kubernetes version: 1.14.6
Kong version 1.4.3
kubernetes-ingress-controller version: 0.7.1
Kong debug-level startup logs ($ kong start --vv)
Kong error logs ( It seems that none error message) log link
Kubernetes-ingress-controller error logs: log link
Json post to kong admin api: JSON
Operating system: Ubuntu 9.1.0-2ubuntu2~16.04
Arch: arm64v8

@yaoice
Copy link
Author

yaoice commented May 9, 2020

The same issue when using kong 2.0.4 and kubernetes-ingress-controller 0.8.1. k8s deploy yaml: yaml. But works fine in x86 Arch. Maybe is the bug of luajit in arm64v8?
@hbagdi

@hbagdi
Copy link
Member

hbagdi commented May 11, 2020

ping @javierguerragiraldez @guanlan

@javierguerragiraldez
Copy link
Contributor

we recently added a workaround for a bug in OpenResty's fork of LuaJIT in #5797. Can you try in your version of Kong?
The final fix to the JIT will be included in the next release of OpenResty.

@Guojian
Copy link

Guojian commented May 12, 2020

@javierguerragiraldez we have try the workaround method, which is disable luajit get_bulk, however problem still exist. Also I think this is another problem, because this is not a random problem. It can reproduce in any arm64 machine and also see a core dump in log.

2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:465: f(): [ringbalancer 3] dns record type changed for 172.20.0.9, nil -> 1
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:330: newAddress(): [ringbalancer 3] new address for host '172.20.0.9' created: 172.20.0.9:1337 (weight 100)
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:529: f(): [ringbalancer 3] updating balancer based on dns changes for 172.20.0.9
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] ring.lua:238: redistributeIndices(): [ringbalancer 3] redistributed indices, size=10000, dropped=3334, assigned=3334, left unassigned
=0
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:539: f(): [ringbalancer 3] querying dns and updating for 172.20.0.9 completed
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:780: newHost(): [ringbalancer 3] created a new host for: 172.20.0.63
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:550: queryDns(): [ringbalancer 3] querying dns for 172.20.0.63
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:465: f(): [ringbalancer 3] dns record type changed for 172.20.0.63, nil -> 1
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:330: newAddress(): [ringbalancer 3] new address for host '172.20.0.63' created: 172.20.0.63:1337 (weight 100)
2020/05/12 12:07:21 [debug] 24#0: *5450 [lua] base.lua:529: f(): [ringbalancer 3] updating balancer based on dns changes for 172.20.0.63
2020/05/12 12:07:21 [notice] 1#0: signal 17 (SIGCHLD) received from 24
2020/05/12 12:07:21 [alert] 1#0: worker process 24 exited on signal 11 (core dumped)
2020/05/12 12:07:21 [notice] 1#0: start worker process 64
2020/05/12 12:07:21 [debug] 64#0: *5451 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2020/05/12 12:07:21 [debug] 64#0: *5451 [lua] globalpatches.lua:269: randomseed(): random seed: 210131225321 for worker nb 0
2020/05/12 12:07:21 [debug] 64#0: *5451 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=started, pid=64, data=nil
2020/05/12 12:07:21 [debug] 64#0: *5451 [lua] counter.lua:50: new(): start timer for shdict kong on worker 0
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] balancer.lua:99: fetching upstream: b99d933e-1cf1-5112-ac6e-1930ec327d60
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:1381: new(): [ringbalancer 1] balancer_base created
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] ring.lua:492: new(): [ringbalancer 1] ringbalancer created
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] cache.lua:289: invalidate_local(): [DB cache] invalidating (local): 'balancer:upstreams:b99d933e-1cf1-5112-ac6e-1930ec327d60'
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, pid=64, data=balancer:upstreams:b99d933e-1cf1-5112-ac6e-1930ec327d60
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] cache.lua:289: invalidate_local(): [DB cache] invalidating (local): 'balancer:targets:b99d933e-1cf1-5112-ac6e-1930ec327d60'
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=mlcache, event=mlcache:invalidations:kong_db_cache, pid=64, data=balancer:targets:b99d933e-1cf1-5112-ac6e-1930ec327d60
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] balancer.lua:125: fetching targets for upstream: b99d933e-1cf1-5112-ac6e-1930ec327d60
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:780: newHost(): [ringbalancer 1] created a new host for: 172.20.0.64
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:550: queryDns(): [ringbalancer 1] querying dns for 172.20.0.64
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:465: f(): [ringbalancer 1] dns record type changed for 172.20.0.64, nil -> 1
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:330: newAddress(): [ringbalancer 1] new address for host '172.20.0.64' created: 172.20.0.64:1337 (weight 100)
2020/05/12 12:07:22 [debug] 64#0: *5453 [lua] base.lua:529: f(): [ringbalancer 1] updating balancer based on dns changes for 172.20.0.64

@Guojian
Copy link

Guojian commented May 13, 2020

@javierguerragiraldez @hbagdi @guanlan I have confirm the bug is from https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/balancer/ring.lua L136

self.indices = table.move(self.indices, 1, size, 1, {})

table.move will lead to core dumped in arm64 environment if the number of any backend upstream is larger than 3.
As far as I know, table.move is introduce by lua 5.3, but I found that kong is using lua 5.1 in both x86 and arm64. however x86 is ok for this. Whether kong is patched lua or not on x86?

Finally, If I replace the table.move to a simple loop assignment then problem solved.

@javierguerragiraldez
Copy link
Contributor

oh, this is very interesting! table.move() is a Lua 5.3 feature, as you point, but it was backported into LuaJIT. probably there's something not quite right when compiled on ARM64, or on the OpenResty's fork. maybe we should force the fallback on ARM64 until these kinks are ironed out.

@hbagdi hbagdi added task/bug task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. labels May 13, 2020
@hbagdi
Copy link
Member

hbagdi commented May 13, 2020

One meta question for @Guojian @yaoice: are you folks compiling Ingress Controller for arm64 (cause we don't officially supply a build for that)?

@javierguerragiraldez
Copy link
Contributor

I've confirmed that table.move() segfaults on ARM64. A simple luajit -e 'table.move({1}, 1, 1, 1, {})' is enough to trigger it.

@Guojian
Copy link

Guojian commented May 14, 2020

@hbagdi Yes, we compile the Ingress Controller for arm64.

@javierguerragiraldez Thank you. Is there any way to disable luajit in arm64? Maybe this make us easy to debug problem.

@javierguerragiraldez
Copy link
Contributor

Is there any way to disable luajit in arm64? Maybe this make us easy to debug problem.

This issue is a bug in LuaJIT's interpreter, not in the JIT itself. specifically, it's in the bytecode instruction used to set a table element within a "spliced in" built-in function, like table.move. the best solution is not to use table.move until it's fixed.

Since in this case it's just copying to a new table, you can use

local function copy_n(t, n)
  local o = {}
  for i = 1, n do
    o[i] = t[i]
  end
  return o
end

and replace the self.indices = table.move(self.indices, 1, size, 1, {}) with self.indices = copy_n(self.indices, size) (disclaimer, i haven't tested this, I've just submitted a patch to LuaJIT repo)

@Guojian
Copy link

Guojian commented May 14, 2020

@javierguerragiraldez thanks, I have make a patch for our kong in arm64 environment and it works.

@hbagdi hbagdi removed the task/needs-investigation Requires investigation and reproduction before classifying it as a bug or not. label May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants