Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra error from kong.core.cluster #966

Closed
mars opened this issue Feb 8, 2016 · 12 comments
Closed

Cassandra error from kong.core.cluster #966

mars opened this issue Feb 8, 2016 · 12 comments

Comments

@mars
Copy link
Contributor

mars commented Feb 8, 2016

I've been seeing these errors coming out of Nginx:

(Edited for clarity. Removed misleading log INFO entry.)

2016/02/08 22:46:39 [notice] 91#0: signal 17 (SIGCHLD) received
2016/02/08 22:46:39 [notice] 91#0: unknown process 1476 exited with code 0
2016/02/08 22:46:39 [error] 91#0: [lua] cluster.lua:84: Cassandra error: 10.1.16.105, context: ngx.timer

This seems to indicate that the calls from Kong.init_worker() are missing the global configuration set in Kong.init().

So, I tried adding the global configuration to init_worker(), and now I see a much more descriptive error:

2016/02/08 22:45:57 [notice] 62#0: start worker process 88
2016/02/08 22:45:57 [info] 89#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: init_worker_by_lua*
2016/02/08 22:45:57 [error] 88#0: init_worker_by_lua error: /app/.heroku/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*
stack traceback:
 [C]: in function 'sleep'
 /app/.heroku/lualib/resty/lock.lua:153: in function 'lock'
 /app/.heroku/share/lua/5.1/cassandra.lua:51: in function 'lock_mutex'
 /app/.heroku/share/lua/5.1/cassandra.lua:1104: in function 'refresh_hosts'
 /app/.heroku/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:38: in function 'new'
2016/02/08 22:45:57 [error] 89#0: init_worker_by_lua error: /app/.heroku/share/lua/5.1/cassandra.lua:109: API disabled in the context of init_worker_by_lua*
stack traceback:
 [C]: in function 'tcp_sock'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong.lua:161: in function 'init_worker'
 init_worker_by_lua:1: in main chunk
 /app/.heroku/share/lua/5.1/cassandra.lua:109: in function 'new_socket'
 /app/.heroku/share/lua/5.1/cassandra.lua:261: in function 'connect'
 /app/.heroku/share/lua/5.1/cassandra.lua:509: in function 'get_first_coordinator'
 /app/.heroku/share/lua/5.1/cassandra.lua:1118: in function 'refresh_hosts'
 /app/.heroku/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:38: in function 'new'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong.lua:161: in function 'init_worker'
 init_worker_by_lua:1: in main chunk

A few preliminary conclusions:

  • Confusion arises from kong.core.cluster relying on globals configuration & dao
  • It seems sleep & tcp_sock are not allowed in init_worker_by_lua*

As @thibaultcha & @thefosk know, I'm operating Kong on an experimental fork using an external supervisor #928, so this issue may be self-inflicted or a canary indicator of a latent issue.

@mars
Copy link
Contributor Author

mars commented Feb 8, 2016

BTW, I see the same error when running using Kong's internal process manager, via kong start -c $KONG_CONF:

2016/02/08 23:52:13 [notice] 91#0: signal 17 (SIGCHLD) received
2016/02/08 23:52:13 [notice] 91#0: unknown process 241 exited with code 0
2016/02/08 23:52:13 [error] 91#0: [lua] cluster.lua:84: Cassandra error: 10.1.60.153, context: ngx.timer

@mars
Copy link
Contributor Author

mars commented Feb 8, 2016

…and these errors from a different function in cluster.lua:

2016/02/08 23:57:21 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.13.186:9042, context: ngx.timer
2016/02/08 23:57:24 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.60.153, context: ngx.timer
2016/02/08 23:57:27 [error] 158#0: [lua] cluster.lua:40: Cassandra error: 10.1.16.105, context: ngx.timer

@thibaultcha
Copy link
Member

Hi @mars,

The first log you are seeing is not an error but an info message (the authentication logs).

The second log is weird, lua-cassandra should print nicer error messages than this.

So, I tried adding the global configuration to init_worker(), and now I see a much more descriptive error:

Globals initialized in the init_by_lua phase should also be available in init_worker_by_lua, so that shouldn't be necessary.

Otherwise, I believe lua-cassandra was never intended to be used inside of init_worker_by_lua (except for the cluster initialization step) and I was not aware that clustering used it in there. lua-cassandra would have to be updated for this but that is unrelated to Kong itself.

@mars
Copy link
Contributor Author

mars commented Feb 9, 2016

Hi @thibaultcha! Thanks for taking time to look into this,

I do see lots of log entries like [error] 91#0: [lua] cluster.lua:84: Cassandra error when running Mashape/kong 0.6.1. This is not something unique to my fork. Can anyone corroborate these errors from their own {nginx_working}/logs/error.log?

Globals initialized in the init_by_lua phase should also be available in init_worker_by_lua

init_worker_by_lua is run in each child/worker process of Nginx. That means this is interprocess sharing of globals 😨

So, I've been trying to diagnose further by exposing configuration & dao as top-level module locals or creating them again in Kong.init_worker(), but these reveal other errors like a nil conf in kong.tools.cluster.get_node_name() and API disabled in the context of init_worker_by_lua*.

@thibaultcha
Copy link
Member

init_worker_by_lua is run in each child/worker process of Nginx. That means this is interprocess sharing of globals

That is correct. This is one thing that extremely annoys/disgusts me too. Big, long-term plans are to refactor this as well as the plugin definition. We have brainstormed over a few ideas to make plugins more robust and most of all to not depend on upvalues such as the Kong ones or ngx

@mars
Copy link
Contributor Author

mars commented Feb 9, 2016

In another experiment, I just converted the Kong codebase from those global configuration & dao variables to reference a "singletons" module containing those values.

With that in place, I still see the same Cassandra error appear in the error log 😵

@mars
Copy link
Contributor Author

mars commented Feb 10, 2016

Here's a branch to demo defining configuration & dao globals for each Nginx process (in init_by_lua & init_worker_by_lua.)

On that branch, when running busted spec/integration/cluster/cluster_spec.lua, the Nginx error.log shows these errors (like quoted above):

2016/02/10 10:15:56 [info] 48259#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: init_worker_by_lua*
2016/02/10 10:15:56 [error] 48259#0: init_worker_by_lua error: /usr/local/share/lua/5.1/cassandra.lua:109: API disabled in the context of init_worker_by_lua*
stack traceback:
    [C]: in function 'tcp_sock'
    /usr/local/share/lua/5.1/cassandra.lua:109: in function 'new_socket'
    /usr/local/share/lua/5.1/cassandra.lua:261: in function 'connect'
    /usr/local/share/lua/5.1/cassandra.lua:509: in function 'get_first_coordinator'
    /usr/local/share/lua/5.1/cassandra.lua:1118: in function 'refresh_hosts'
    /usr/local/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
    ./kong/dao/cassandra/factory.lua:38: in function 'new'
    /usr/local/share/lua/5.1/classic.lua:63: in function 'load'
    ./kong/kong.lua:112: in function 'define_globals'
    ./kong/kong.lua:154: in function 'init_worker'
    init_worker_by_lua:1: in main chunk
2016/02/10 10:15:56 [error] 48260#0: init_worker_by_lua error: /usr/local/openresty/lualib/resty/lock.lua:153: API disabled in the context of init_worker_by_lua*
stack traceback:
    [C]: in function 'sleep'
    /usr/local/openresty/lualib/resty/lock.lua:153: in function 'lock'
    /usr/local/share/lua/5.1/cassandra.lua:51: in function 'lock_mutex'
    /usr/local/share/lua/5.1/cassandra.lua:1104: in function 'refresh_hosts'
    /usr/local/share/lua/5.1/cassandra.lua:1216: in function 'spawn_cluster'
    ./kong/dao/cassandra/factory.lua:38: in function 'new'
    /usr/local/share/lua/5.1/classic.lua:63: in function 'load'
    ./kong/kong.lua:112: in function 'define_globals'
    ./kong/kong.lua:154: in function 'init_worker'
    init_worker_by_lua:1: in main chunk

Reading the docs for ngx.sleep & ngx.socket.tcp, it's clear that this code shouldn't ever be working in this context init_worker_by_lua.

Co-sockets are not supported in these contexts.

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers. Those ngx.sleep & ngx.socket.tcp APIs are available within the timer callback functions.

@thibaultcha
Copy link
Member

Thank you for taking time to investigate this! I find it very valuable that you dedicate time to look into Kong. I did not know lua-cassandra was even being used in there, and curiously I never encountered this error, which either is very strange, either because I missed it.

Yes, like I said, lua-cassandra was never meant to be working in init_worker_by_lua precisely because cosockets are not supported there. While not being a fan of the idea, I started implementing such availability yesterday night here (it stills need some work, second commit broke some stuff), assuming people will know that it is not ideal to run lua-cassandra in non-supported contexts.

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

That could be one solution too, maybe more short-term, since ideally those globals should be nuked anyways (#709 is a good place to talk about this, even though since then we have a clear ideas of how this should be accomplished and would like to POC it someday).

@mars
Copy link
Contributor Author

mars commented Feb 18, 2016

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

I just tried that fix which resulted in a new mutex problem from resty.lock:

2016/02/18 16:39:28 [info] 71#0: [lua] log.lua:22: info(): Refreshing local and peers info, context: ngx.timer
2016/02/18 16:39:28 [error] 71#0: lua entry thread aborted: runtime error: unknown reason
stack traceback:
coroutine 0:
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: in function 'new'
 [C]: in function 'error'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong/core/cluster.lua:26: in function </app/.heroku/share/lua/5.1/kong/core/cluster.lua:21>, context: ngx.timer
2016/02/18 16:39:38 [error] 72#0: lua entry thread aborted: runtime error: /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: Error locking mutex: timeout
stack traceback:
coroutine 0:
 /app/.heroku/share/lua/5.1/kong/dao/cassandra/factory.lua:40: in function 'new'
 [C]: in function 'error'
 /app/.heroku/share/lua/5.1/classic.lua:63: in function 'load'
 /app/.heroku/share/lua/5.1/kong/core/cluster.lua:26: in function </app/.heroku/share/lua/5.1/kong/core/cluster.lua:21>, context: ngx.timer

I see @thibaultcha making progress on lua-resty-socket & lua-cassandra late last night 😲 So, I'm going to jump back out of this 🐰 hole.

@thibaultcha
Copy link
Member

It looks like the solution might be to define the configuration & dao globals inside each of the kong.core.cluster timers.

I don't see how doing so will solve the problem, since those values are globals anyways, they are already available in all the contexts and the timers (regardless of how wrong that is).

There is one thing I still don't understand though, I am really unable to replicate this, and not seeing any such error. Especially since dao.nodes:find_by_keys is being invoked in a timer context, and not in init_worker. So are you talking about an issue on next or on your fork?

@mars
Copy link
Contributor Author

mars commented Feb 18, 2016

The Cassandra error only happens when running a cluster with at least two nodes. A single Kong doesn't do it.

I see the error running Kong 0.6.1 with both:

  • kong start CLI via heroku run bash
    • the error appears in {nginx-working-dir}/logs/error.log
  • web dyno via the external supervisor Procfile.web
    • the error appears in heroku logs stream

The only place I have a Kong cluster running is in a Heroku Private Space. I've been trying to reproduce locally, but would need to setup a development cluster on OS X. I have not yet figured that out. Do you have any notes for how you all work with clustering locally/in development?

@mars
Copy link
Contributor Author

mars commented Feb 18, 2016

I finally figured out how to get a second Kong running locally in a cluster.

I now confirm that this error does not occur when running locally on OS X with either Mashape/kong 0.6.1 or the fork mars/kong.

I guess there's something not-quite-right about the build on Heroku.

I'll close this in favor of a new issue in that repo.

Thanks for your patience.

@mars mars closed this as completed Feb 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants