Cassandra error from `kong.core.cluster` #966

mars · 2016-02-08T23:24:30Z

I've been seeing these errors coming out of Nginx:

(Edited for clarity. Removed misleading log INFO entry.)

2016/02/08 22:46:39 [notice] 91#0: signal 17 (SIGCHLD) received
2016/02/08 22:46:39 [notice] 91#0: unknown process 1476 exited with code 0
2016/02/08 22:46:39 [error] 91#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer

This seems to indicate that the calls from Kong.init_worker() are missing the global configuration set in Kong.init().

So, I tried adding the global configuration to init_worker(), and now I see a much more descriptive error:

2016/02/08 22:45:57 [notice] 62#0: start worker process 88
2016/02/08 22:45:57 [info] 89#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: init_worker_by_lua*
2016/02/08 22:45:57 [error] 88#0: init_worker_by_lua error: /app/.heroku/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*
stack traceback:
 [C]: in function 'sleep'
 /app/.heroku/lualib/resty/lock.lua:153: in function 'lock'
 /app/.heroku/share/lua/5.1/cassandra.lua:51: in function 'lock_mutex'
 /app/.heroku/share/lua/5.1/cassandra.lua:1104: in function 'refresh_hosts'
 /app/.heroku/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:38: in function 'new'
2016/02/08 22:45:57 [error] 89#0: init_worker_by_lua error: /app/.heroku/share/lua/5.1/cassandra.lua:109: API disabled in the context of init_worker_by_lua*
stack traceback:
 [C]: in function 'tcp_sock'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong.lua:161: in function 'init_worker'
 init_worker_by_lua:1: in main chunk
 /app/.heroku/share/lua/5.1/cassandra.lua:109: in function 'new_socket'
 /app/.heroku/share/lua/5.1/cassandra.lua:261: in function 'connect'
 /app/.heroku/share/lua/5.1/cassandra.lua:509: in function 'get_first_coordinator'
 /app/.heroku/share/lua/5.1/cassandra.lua:1118: in function 'refresh_hosts'
 /app/.heroku/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:38: in function 'new'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong.lua:161: in function 'init_worker'
 init_worker_by_lua:1: in main chunk

A few preliminary conclusions:

Confusion arises from kong.core.cluster relying on globals configuration & dao
It seems sleep & tcp_sock are not allowed in init_worker_by_lua*

As @thibaultcha & @thefosk know, I'm operating Kong on an experimental fork using an external supervisor #928, so this issue may be self-inflicted or a canary indicator of a latent issue.

The text was updated successfully, but these errors were encountered:

mars · 2016-02-08T23:53:44Z

BTW, I see the same error when running using Kong's internal process manager, via kong start -c $KONG_CONF:

2016/02/08 23:52:13 [notice] 91#0: signal 17 (SIGCHLD) received
2016/02/08 23:52:13 [notice] 91#0: unknown process 241 exited with code 0
2016/02/08 23:52:13 [error] 91#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer

mars · 2016-02-08T23:58:41Z

…and these errors from a different function in cluster.lua:

2016/02/08 23:57:21 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.13.186:9042, context: ngx.timer
2016/02/08 23:57:24 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.60.153, context: ngx.timer
2016/02/08 23:57:27 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.16.105, context: ngx.timer

thibaultcha · 2016-02-09T01:26:17Z

Hi @mars,

The first log you are seeing is not an error but an info message (the authentication logs).

The second log is weird, lua-cassandra should print nicer error messages than this.

So, I tried adding the global configuration to init_worker(), and now I see a much more descriptive error:

Globals initialized in the init_by_lua phase should also be available in init_worker_by_lua, so that shouldn't be necessary.

Otherwise, I believe lua-cassandra was never intended to be used inside of init_worker_by_lua (except for the cluster initialization step) and I was not aware that clustering used it in there. lua-cassandra would have to be updated for this but that is unrelated to Kong itself.

mars · 2016-02-09T19:25:02Z

Hi @thibaultcha! Thanks for taking time to look into this,

I do see lots of log entries like [error] 91#0: [lua] cluster.lua:84: Cassandra error when running Mashape/kong 0.6.1. This is not something unique to my fork. Can anyone corroborate these errors from their own {nginx_working}/logs/error.log?

Globals initialized in the init_by_lua phase should also be available in init_worker_by_lua

init_worker_by_lua is run in each child/worker process of Nginx. That means this is interprocess sharing of globals 😨

So, I've been trying to diagnose further by exposing configuration & dao as top-level module locals or creating them again in Kong.init_worker(), but these reveal other errors like a nil conf in kong.tools.cluster.get_node_name() and API disabled in the context of init_worker_by_lua*.

thibaultcha · 2016-02-09T19:32:54Z

init_worker_by_lua is run in each child/worker process of Nginx. That means this is interprocess sharing of globals

That is correct. This is one thing that extremely annoys/disgusts me too. Big, long-term plans are to refactor this as well as the plugin definition. We have brainstormed over a few ideas to make plugins more robust and most of all to not depend on upvalues such as the Kong ones or ngx

mars · 2016-02-09T22:28:25Z

In another experiment, I just converted the Kong codebase from those global configuration & dao variables to reference a "singletons" module containing those values.

With that in place, I still see the same Cassandra error appear in the error log 😵

mars · 2016-02-10T19:16:10Z

Here's a branch to demo defining configuration & dao globals for each Nginx process (in init_by_lua & init_worker_by_lua.)

On that branch, when running busted spec/integration/cluster/cluster_spec.lua, the Nginx error.log shows these errors (like quoted above):

2016/02/10 10:15:56 [info] 48259#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: init_worker_by_lua*
2016/02/10 10:15:56 [error] 48259#0: init_worker_by_lua error: /usr/local/share/lua/5.1/cassandra.lua:109: API disabled in the context of init_worker_by_lua*
stack traceback:
    [C]: in function 'tcp_sock'
    /usr/local/share/lua/5.1/cassandra.lua:109: in function 'new_socket'
    /usr/local/share/lua/5.1/cassandra.lua:261: in function 'connect'
    /usr/local/share/lua/5.1/cassandra.lua:509: in function 'get_first_coordinator'
    /usr/local/share/lua/5.1/cassandra.lua:1118: in function 'refresh_hosts'
    /usr/local/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
    ./kong/dao/cassandra/factory.lua:38: in function 'new'
    /usr/local/share/lua/5.1/classic.lua:63: in function 'load'
    ./kong/kong.lua:112: in function 'define_globals'
    ./kong/kong.lua:154: in function 'init_worker'
    init_worker_by_lua:1: in main chunk
2016/02/10 10:15:56 [error] 48260#0: init_worker_by_lua error: /usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*
stack traceback:
    [C]: in function 'sleep'
    /usr/local/openresty/lualib/resty/lock.lua:153: in function 'lock'
    /usr/local/share/lua/5.1/cassandra.lua:51: in function 'lock_mutex'
    /usr/local/share/lua/5.1/cassandra.lua:1104: in function 'refresh_hosts'
    /usr/local/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
    ./kong/dao/cassandra/factory.lua:38: in function 'new'
    /usr/local/share/lua/5.1/classic.lua:63: in function 'load'
    ./kong/kong.lua:112: in function 'define_globals'
    ./kong/kong.lua:154: in function 'init_worker'
    init_worker_by_lua:1: in main chunk

Reading the docs for ngx.sleep & ngx.socket.tcp, it's clear that this code shouldn't ever be working in this context init_worker_by_lua.

Co-sockets are not supported in these contexts.

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers. Those ngx.sleep & ngx.socket.tcp APIs are available within the timer callback functions.

thibaultcha · 2016-02-10T19:26:22Z

Thank you for taking time to investigate this! I find it very valuable that you dedicate time to look into Kong. I did not know lua-cassandra was even being used in there, and curiously I never encountered this error, which either is very strange, either because I missed it.

Yes, like I said, lua-cassandra was never meant to be working in init_worker_by_lua precisely because cosockets are not supported there. While not being a fan of the idea, I started implementing such availability yesterday night here (it stills need some work, second commit broke some stuff), assuming people will know that it is not ideal to run lua-cassandra in non-supported contexts.

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

That could be one solution too, maybe more short-term, since ideally those globals should be nuked anyways (#709 is a good place to talk about this, even though since then we have a clear ideas of how this should be accomplished and would like to POC it someday).

mars · 2016-02-18T17:06:11Z

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

I just tried that fix which resulted in a new mutex problem from resty.lock:

2016/02/18 16:39:28 [info] 71#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: ngx.timer
2016/02/18 16:39:28 [error] 71#0: lua entry thread aborted: runtime error: unknown reason
stack traceback:
coroutine 0:
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: in function 'new'
 [C]: in function 'error'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong/core/cluster.lua:26: in function </app/.heroku/share/lua/5.1/kong/core/cluster.lua:21>, context: ngx.timer
2016/02/18 16:39:38 [error] 72#0: lua entry thread aborted: runtime error: /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: Error locking mutex: timeout
stack traceback:
coroutine 0:
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: in function 'new'
 [C]: in function 'error'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong/core/cluster.lua:26: in function </app/.heroku/share/lua/5.1/kong/core/cluster.lua:21>, context: ngx.timer

I see @thibaultcha making progress on lua-resty-socket & lua-cassandra late last night 😲 So, I'm going to jump back out of this 🐰 hole.

thibaultcha · 2016-02-18T19:27:53Z

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

I don't see how doing so will solve the problem, since those values are globals anyways, they are already available in all the contexts and the timers (regardless of how wrong that is).

There is one thing I still don't understand though, I am really unable to replicate this, and not seeing any such error. Especially since dao.nodes:find_by_keys is being invoked in a timer context, and not in init_worker. So are you talking about an issue on next or on your fork?

mars · 2016-02-18T20:35:11Z

The Cassandra error only happens when running a cluster with at least two nodes. A single Kong doesn't do it.

I see the error running Kong 0.6.1 with both:

kong start CLI via heroku run bash
- the error appears in {nginx-working-dir}/logs/error.log
web dyno via the external supervisor Procfile.web
- the error appears in heroku logs stream

The only place I have a Kong cluster running is in a Heroku Private Space. I've been trying to reproduce locally, but would need to setup a development cluster on OS X. I have not yet figured that out. Do you have any notes for how you all work with clustering locally/in development?

mars · 2016-02-18T21:06:59Z

I finally figured out how to get a second Kong running locally in a cluster.

I now confirm that this error does not occur when running locally on OS X with either Mashape/kong 0.6.1 or the fork mars/kong.

I guess there's something not-quite-right about the build on Heroku.

I'll close this in favor of a new issue in that repo.

Thanks for your patience.

This was referenced Feb 11, 2016

Replace globals with kong.singletons module variables. #982

Merged

Foreground process model #928

Closed

mars mentioned this issue Feb 18, 2016

Cluster members loose communication (SSL error in keep alive) heroku/heroku-kong#6

Closed

mars closed this as completed Feb 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra error from `kong.core.cluster` #966

Cassandra error from `kong.core.cluster` #966

mars commented Feb 8, 2016

mars commented Feb 8, 2016

mars commented Feb 8, 2016

thibaultcha commented Feb 9, 2016

mars commented Feb 9, 2016

thibaultcha commented Feb 9, 2016

mars commented Feb 9, 2016

mars commented Feb 10, 2016

thibaultcha commented Feb 10, 2016

mars commented Feb 18, 2016

thibaultcha commented Feb 18, 2016

mars commented Feb 18, 2016

mars commented Feb 18, 2016

Cassandra error from kong.core.cluster #966

Cassandra error from kong.core.cluster #966

Comments

mars commented Feb 8, 2016

mars commented Feb 8, 2016

mars commented Feb 8, 2016

thibaultcha commented Feb 9, 2016

mars commented Feb 9, 2016

thibaultcha commented Feb 9, 2016

mars commented Feb 9, 2016

mars commented Feb 10, 2016

thibaultcha commented Feb 10, 2016

mars commented Feb 18, 2016

thibaultcha commented Feb 18, 2016

mars commented Feb 18, 2016

mars commented Feb 18, 2016

Cassandra error from `kong.core.cluster` #966

Cassandra error from `kong.core.cluster` #966