Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(healthchecks) add active and passive health checks #3096

Merged
merged 17 commits into from
Dec 15, 2017

Conversation

hishamhm
Copy link
Contributor

Summary

Adds support for health checks in Kong.

Health checkers can perform active health checks, periodically pinging a configurable upstream URL and checking for the HTTP status code, TCP errors and timeouts, and also passive health checks, checking these same signals on ongoing traffic, allowing for circuit breaker functionality.

Major kudos to @Tieske for his work on lua-resty-dns-client and our joint work on lua-resty-healthcheck (on which this is based), and to @thibaultcha for his reviews while this PR was in the making.

Full changelog

  • Overhaul of the update flow for ring-balancer objects of upstream entities: instead of checking and updating balancer objects on-demand based on client requests, the module now keeps these objects updated based on CRUD updates of the underlying entities
  • Attaches a health checker object to each balancer object
  • Adds the healthchecks configuration field for the upstreams entity
  • Adds support for active health checks
  • Adds support for passive health checks
  • Adds endpoints for manually setting the health status: /upstream/:id/targets/:target/healthy and /upstream/:id/targets/:target/unhealthy
  • Adds a dependency on lua-resty-healthcheck, which provides the backend functionality for health checks
  • Bumps the dependency versions for lua-resty-worker-events and lua-resty-worker-events
  • Adds related unit and integration tests for all functionality listed above

Issues resolved

Fix #112
Fix #154

@hishamhm hishamhm changed the base branch from master to next December 14, 2017 22:35
end


local function check_http_statuses(arg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also apply check_positive_int in this check as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The only logic I can see the check_positive_int() validator enforcing would be checking if the argument is an integer. Is that what you are suggesting we should check in this validator? Otherwise, the check_http_statuses() validator already has narrower acceptance rules regarding the allowed range.

Copy link
Member

@thibaultcha thibaultcha Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying this isn't an valid check to perform, ofc, just making sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is why I'm suggesting, just for completeness sake

if not ok then
ngx_log(ngx_ERR, "failed to retry the dns/balancer resolver for ",
tostring(addr.host), "' with: ", tostring(err))

return responses.send(500)
return ngx.exit(errcode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious about this change away from responses.send?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 500 in particular it would work, but it's for consistency with the change above. Our responses module sometimes does ngx.say which causes a crash in the balancer context.

@@ -32,6 +32,7 @@ lua_shared_dict kong 5m;
lua_shared_dict kong_cache ${{MEM_CACHE_SIZE}};
lua_shared_dict kong_process_events 5m;
lua_shared_dict kong_cluster_events 5m;
lua_shared_dict kong_healthchecks 5m;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just doing due diligence, have we investigated that this is a large enough allocation? and that the healthcheck library handles and alerts on OOM/eviction conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large enough is always relative, but yes, the library does alert on shm allocation failures. This PR logs whatever comes from the healthcheck library.

healthchecks = {
type = "table",
schema = gen_schema(healthchecks_defaults),
default = healthchecks_defaults,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a UX perspective, what happens when I want to define a healthchecks sub-schema with mostly default data, but a few changes? do i need to manually fill out the whole healthchecks_defaults as part of my request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above gen_schema() function generates fields that have the default attribute set to the one defined in the above stub healthchecks_defaults, so you should indeed be able to partially specify health check configuration and expect default values to be inserted when necessary.

@thibaultcha thibaultcha added this to the 0.12.0 milestone Dec 15, 2017
@thibaultcha
Copy link
Member

About to push the following patch to this branch:

diff --git a/kong/core/balancer.lua b/kong/core/balancer.lua
index 664c0e55..d45cc027 100644
--- a/kong/core/balancer.lua
+++ b/kong/core/balancer.lua
@@ -604,7 +604,7 @@ end
 local function init()
   local upstreams, err = get_all_upstreams()
   if not upstreams then
-    log(ERR, "failed loading initial list of upstreams: ", err)
+    log(ngx.STDERR, "failed loading initial list of upstreams: ", err)
     return
   end
 
@@ -615,7 +615,7 @@ local function init()
     if ok ~= nil then
       oks = oks + 1
     else
-      log(ERR, "error creating balancer for ", name, ": ", err)
+      log(ngx.STDERR, "failed creating balancer for ", name, ": ", err)
       errs = errs + 1
     end
   end
diff --git a/kong/dao/schemas/upstreams.lua b/kong/dao/schemas/upstreams.lua
index de3444e8..cf409850 100644
--- a/kong/dao/schemas/upstreams.lua
+++ b/kong/dao/schemas/upstreams.lua
@@ -41,6 +41,10 @@ local function check_http_statuses(arg)
       return false, "array element is not a number"
     end
 
+    if math.floor(s) ~= s then
+      return false, "must be an integer"
+    end
+
     -- Accept any three-digit status code,
     -- applying Postel's law in case of nonstandard HTTP codes
     if s < 100 or s > 999 then
diff --git a/spec/01-unit/007-entities_schemas_spec.lua b/spec/01-unit/007-entities_schemas_spec.lua
index ebc9f32a..05881ede 100644
--- a/spec/01-unit/007-entities_schemas_spec.lua
+++ b/spec/01-unit/007-entities_schemas_spec.lua
@@ -763,6 +763,7 @@ describe("Entities Schemas", function()
         {{ active = { healthy = { http_statuses = { -1 }}}}, "status code" },
         {{ active = { healthy = { http_statuses = { 99 }}}}, "status code" },
         {{ active = { healthy = { http_statuses = { 1000 }}}}, "status code" },
+        {{ active = { healthy = { http_statuses = { 111.314 }}}}, "must be an integer" },
         {{ active = { healthy = { successes = 0.5 }}}, "must be an integer" },
         {{ active = { healthy = { successes = 0 }}}, "must be an integer" },
         {{ active = { healthy = { successes = -1 }}}, "an integer between" },

Backport from commit fd5c8dc39127e9362975faefd30d915f0710fa77
by @kikito
Reworks the management of balancer objects from being on-demand,
on each request, to be event-based, responding to DAO events.
Balancers are now initialized on system startup, and updated
based on propagated worker-events and cluster-events.
These changes allow healthchecks to update properly.

Attaches a healthchecker object to each balancer. The balancers report HTTP
statuses and TCP failures to the healthchecker.

Moves objects that were attached to the balancer object using
fields such as `__target_history` into weakly-keyed local tables.
Use ctx.KONG_PROXIED to differentiate between responses
produced by an upstream vs. responses produced by Kong plugins.
Allow the internal http_server to produce a series of success
and failure responses (e.g. 10 successes, followed by 5
failures, followed by 10 successes).
Adds two new endpoints to Target entities:

* `/upstreams/:id/targets/:target/healthy`
* `/upstreams/:id/targets/:target/unhealthy`

These post an event that is forwarded to all relevant
healthcheckers, updating their immediate status. This
is useful for manually re-enabling a target that has been
disabled by passive health checks.
Add tests for:

* manual recovery using /healthy endpoint: test scenario using
  passive healthchecks as circuit breaker, then using the
  endpoint to restore the target.
* manual shutdown using the /unhealthy endpoint.
Remove hardcoded references to 127.0.0.1 in balancer integration tests.
This sets the stage for IPv6 tests, pending an update of
lua-resty-dns-client.
Adds a test that attests that a Kong-generated error (in this case
a `401 Unauthorized` generated by the key-auth plugin) does not add
up to the health checker event counter.

This integration test:

1. configures healthchecks with a 1-error threshold
2. adds a key-auth plugin with no configured consumer
3. runs request, which fails with 401, but doesn't hit the 1-error threshold
4. deletes the plugin
5. starts proxied servers, and runs two full rounds of the balancer wheel
6. each server reports one full round of the wheel worth of successes,
   meaning they are both healthy and were not affected by the 401.
Copy link
Member

@thibaultcha thibaultcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the above patch, and rebased on top of next (itself including the recent master work).

@hishamhm Please do merge at your convenience, but please do so with the "Create a merge commit" option in this particular case (large features). Thanks!

@thibaultcha thibaultcha added the pr/ready This PR is considered ready and can be merged at anytime (given it received no subsequent changes) label Dec 15, 2017
@hishamhm hishamhm merged commit ceab1d9 into next Dec 15, 2017
@xiaolongiii
Copy link

xiaolongiii commented Dec 25, 2017

After install the version 0.12, how can I set the upstream to use the active health check?
I can't find the doc explain it.

@coopr
Copy link
Contributor

coopr commented Dec 26, 2017 via email

@Tieske Tieske deleted the feat/healthchecks branch December 29, 2017 07:59
@craig-mld
Copy link

I can't work out how to set the array of status for healthchecks.active.healthy.http_statuses. Could you provide an example curl command? No matter what i try, i just get "array element is not a number" in response.

@craig-mld
Copy link

Sorted - requires JSON input. Doesn't seem to work with "key=var" input format.
curl -X PATCH \ http://localhost:8001/upstreams/my.new.upstream \ -H 'content-type: application/json' \ -d '{"healthchecks.active.healthy.http_statuses":[200, 302, 401]}'

@gohonsen
Copy link

gohonsen commented Apr 19, 2018

@craig-mld Thanks, I have the same problem with you,I try -d "healthchecks.active.healthy.http_statuses[]=200" -d "healthchecks.active.healthy.http_statuses=200" , etc , all get wrong messages. -d '{"healthchecks.active.healthy.http_statuses":[200]}' is ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/ready This PR is considered ready and can be merged at anytime (given it received no subsequent changes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants