-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(balancer) fix upstreams reload every 10s #8974
fix(balancer) fix upstreams reload every 10s #8974
Conversation
b759b95
to
17f1b59
Compare
Hello, |
Hello, |
Hey @marc-charpentier, sorry for the late check-in on this (we're all a little bit swamped at the moment prepping the next release). That test failure looks very similar to one we've been tracking internally that is known to be flaky, so I think it's okay for you to disregard for now. This changeset looks good to me (thanks again for your efforts--much appreciated!). When I have the time I'll probably do a little exploratory work to see if adding a targeted integration test is feasible, though it might not be necessary in the end. Aside from that, just awaiting a second opinion from somebody else on the team. |
Hi @flrgh , thank you for your answer, and no problem for the late check-in. |
Difficult to rebase so I'm trying to cherry-pick it. Trying to find some way to test this. |
17f1b59
to
f766517
Compare
7476061
to
b3eb8cb
Compare
b3eb8cb
to
fb30687
Compare
spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua
Show resolved
Hide resolved
spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua
Outdated
Show resolved
Hide resolved
Hmm. I was
So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore. Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here: kong/kong/runloop/balancer/upstreams.lua Lines 96 to 99 in f38b38e
This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single -- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)
local function callback(arg1, arg2, arg3)
-- I/O lookup logic
-- ...
-- value: the value to cache (Lua scalar or table)
-- err: if not `nil`, will abort get(), which will return `value` and `err`
-- ttl: override ttl for this value
-- If returned as `ttl >= 0`, it will override the instance
-- (or option) `ttl` or `neg_ttl`.
-- If returned as `ttl < 0`, `value` will be returned by get(),
-- but not cached. This return value will be ignored if not a number.
return value, err, ttl
end In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error: if err then
log(CRIT, "could not obtain list of upstreams: ", err)
- return nil
+ return nil, err
end |
4511db3
to
2012430
Compare
Nice finding. Since we have now corrected the handling probably we could remove that option? |
Interestingly, the hardcoded negative TTL came from #4301. This thread is the explanation I was looking for. Many of the concerns raised there seem to still be somewhat valid, though I tend to think we should trust that I think a valuable change would be to detect the case where there are zero upstreams and no error is encountered and add a debug log entry in this case. That way this scenario is at least observable if/when there's a reason to suspect that something isn't working correctly. @locao what do you think about all of this? |
78394e2
to
16374d1
Compare
spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua
Outdated
Show resolved
Hide resolved
d06fcdc
to
8480a69
Compare
The upstreams module's load_upstreams_dict_into_memory returned non-cacheable value when upstreams table is empty, causing empty table reload in request context after 10s negative TTL's expiration.
Co-authored-by: Michael Martin <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
8480a69
to
8d56772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but as I'm committing to this, we need someone else to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me 👍 @locao care to give this a quick review?
Reviewing this is in my backlog, sorry for the long delay. I'll check it asap. |
The upstreams module's load_upstreams_dict_into_memory returned
non-cacheable value when upstreams table is empty, causing empty
table reload in request context after 10s negative TTL's expiration.
Summary
See #8970 (comment).
To fix this, empty table may be considered a valid value to cache.
Full changelog
Issue reference
Fix #8970