fix(balancer) fix upstreams reload every 10s #8974

marc-charpentier · 2022-06-17T14:29:40Z

The upstreams module's load_upstreams_dict_into_memory returned
non-cacheable value when upstreams table is empty, causing empty
table reload in request context after 10s negative TTL's expiration.

Summary

See #8970 (comment).

To fix this, empty table may be considered a valid value to cache.

Full changelog

[Fix empty upstreams table reload every 10s]

Issue reference

Fix #8970

marc-charpentier · 2022-06-23T14:33:31Z

Hello,
Sorry for last commit come and go, I was wondering if cache invalidation delay caused this unit test to fail.

marc-charpentier · 2022-06-24T13:59:31Z

Hello,
I think I managed to run the failing unit test on my local machine, but the result is the same with both branches (release/2.8.x and fix/empty-upstreams-reload-every-10-seconds). I also added logging instructions on error level in both modified functions in kong/runloop/balancer/upstreams.lua but they didn't appear on unit test run.
I tend to think the proposed fix isn't causing the unit test failure.
Would someone more experienced than me have a look ?
Thanks.

flrgh · 2022-06-24T18:39:06Z

Hey @marc-charpentier, sorry for the late check-in on this (we're all a little bit swamped at the moment prepping the next release).

That test failure looks very similar to one we've been tracking internally that is known to be flaky, so I think it's okay for you to disregard for now.

This changeset looks good to me (thanks again for your efforts--much appreciated!). When I have the time I'll probably do a little exploratory work to see if adding a targeted integration test is feasible, though it might not be necessary in the end. Aside from that, just awaiting a second opinion from somebody else on the team.

marc-charpentier · 2022-06-26T14:31:12Z

Hi @flrgh , thank you for your answer, and no problem for the late check-in.
Now all tests passed, although I don't know if someone triggered them again, nor why the previously failing one succeeded this time.

StarlightIbuki · 2022-11-10T09:09:28Z

Difficult to rebase so I'm trying to cherry-pick it. Trying to find some way to test this.
Sorry to mess up with the tags... I did not notice this is targeting 2.8.x.
This fix should also work for the master.

spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua

flrgh · 2022-11-15T22:01:13Z

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez

Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

kong/kong/runloop/balancer/upstreams.lua

Lines 96 to 99 in f38b38e

    
           if err then 
        
             log(CRIT, "could not obtain list of upstreams: ", err) 
        
             return nil 
        
           end

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:

-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end

In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:

    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

StarlightIbuki · 2022-11-16T09:24:57Z

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

fix(balancer) don't cache an empty upstream name dict #5831

fix(balancer) don't cache an empty upstream name dict #7002

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez

Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

kong/kong/runloop/balancer/upstreams.lua

Lines 96 to 99 in f38b38e

if err then

log(CRIT, "could not obtain list of upstreams: ", err)

return nil

end

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:
-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end
In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:
    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

Nice finding. Since we have now corrected the handling probably we could remove that option?

flrgh · 2022-11-16T22:59:30Z

Since we have now corrected the handling probably we could remove that option?

Interestingly, the hardcoded negative TTL came from #4301. This thread is the explanation I was looking for.

Many of the concerns raised there seem to still be somewhat valid, though I tend to think we should trust that upstreams:each() will always return an error on an unexpected failure.

I think a valuable change would be to detect the case where there are zero upstreams and no error is encountered and add a debug log entry in this case. That way this scenario is at least observable if/when there's a reason to suspect that something isn't working correctly.

@locao what do you think about all of this?

spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua

The upstreams module's load_upstreams_dict_into_memory returned non-cacheable value when upstreams table is empty, causing empty table reload in request context after 10s negative TTL's expiration.

Co-authored-by: Michael Martin <[email protected]>

StarlightIbuki

LGTM, but as I'm committing to this, we need someone else to review it.

flrgh

looks good to me 👍 @locao care to give this a quick review?

locao · 2022-12-09T13:32:55Z

Reviewing this is in my backlog, sorry for the long delay. I'll check it asap.

pull-request-size bot added the size/XS label Jun 17, 2022

github-actions bot added the core/balancer label Jun 17, 2022

kikito requested a review from locao June 20, 2022 20:45

marc-charpentier force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from b759b95 to 17f1b59 Compare June 23, 2022 14:24

StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 17f1b59 to f766517 Compare November 10, 2022 09:11

pull-request-size bot added size/XXL and removed size/XS labels Nov 10, 2022

StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 7476061 to b3eb8cb Compare November 14, 2022 07:02

bungle force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from b3eb8cb to fb30687 Compare November 15, 2022 10:10

flrgh reviewed Nov 15, 2022

View reviewed changes

spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua Show resolved Hide resolved

flrgh reviewed Nov 15, 2022

View reviewed changes

spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua Outdated Show resolved Hide resolved

StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 4511db3 to 2012430 Compare November 16, 2022 09:21

ADD-SP force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 78394e2 to 16374d1 Compare November 18, 2022 05:57

flrgh reviewed Nov 18, 2022

View reviewed changes

spec/02-integration/06-invalidations/04-balancer_cache_correctness_spec.lua Outdated Show resolved Hide resolved

StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from d06fcdc to 8480a69 Compare November 22, 2022 03:38

StarlightIbuki self-assigned this Nov 22, 2022

marc-charpentier and others added 7 commits November 22, 2022 15:41

fix(balancer) fix upstreams reload every 10s

f694733

The upstreams module's load_upstreams_dict_into_memory returned non-cacheable value when upstreams table is empty, causing empty table reload in request context after 10s negative TTL's expiration.

add test

2bd4ff4

fix lint

f24b490

apply suggestion

f67f770

Co-authored-by: Michael Martin <[email protected]>

seems now it's fine to remove the ttl

eb6df4d

apply suggestion

4043915

Co-authored-by: Michael Martin <[email protected]>

apply suggestion

8d56772

Co-authored-by: Michael Martin <[email protected]>

StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 8480a69 to 8d56772 Compare November 22, 2022 07:48

StarlightIbuki requested a review from flrgh November 22, 2022 07:48

StarlightIbuki approved these changes Dec 7, 2022

View reviewed changes

StarlightIbuki added pr/please review pr/ready This PR is considered ready and can be merged at anytime (given it received no subsequent changes) labels Dec 7, 2022

StarlightIbuki requested a review from a team December 7, 2022 10:45

flrgh approved these changes Dec 9, 2022

View reviewed changes

StarlightIbuki added pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... and removed pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Dec 14, 2022

hanshuebner merged commit e5d9235 into Kong:master Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(balancer) fix upstreams reload every 10s #8974

fix(balancer) fix upstreams reload every 10s #8974

marc-charpentier commented Jun 17, 2022

marc-charpentier commented Jun 23, 2022

marc-charpentier commented Jun 24, 2022

flrgh commented Jun 24, 2022

marc-charpentier commented Jun 26, 2022

StarlightIbuki commented Nov 10, 2022 •

edited

Loading

flrgh commented Nov 15, 2022 •

edited

Loading

StarlightIbuki commented Nov 16, 2022

flrgh commented Nov 16, 2022

StarlightIbuki left a comment

flrgh left a comment

locao commented Dec 9, 2022

fix(balancer) fix upstreams reload every 10s #8974

fix(balancer) fix upstreams reload every 10s #8974

Conversation

marc-charpentier commented Jun 17, 2022

Summary

Full changelog

Issue reference

marc-charpentier commented Jun 23, 2022

marc-charpentier commented Jun 24, 2022

flrgh commented Jun 24, 2022

marc-charpentier commented Jun 26, 2022

StarlightIbuki commented Nov 10, 2022 • edited Loading

flrgh commented Nov 15, 2022 • edited Loading

StarlightIbuki commented Nov 16, 2022

flrgh commented Nov 16, 2022

StarlightIbuki left a comment

Choose a reason for hiding this comment

flrgh left a comment

Choose a reason for hiding this comment

locao commented Dec 9, 2022

StarlightIbuki commented Nov 10, 2022 •

edited

Loading

flrgh commented Nov 15, 2022 •

edited

Loading