Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(balancer) fix upstreams reload every 10s #8974

Conversation

marc-charpentier
Copy link
Contributor

The upstreams module's load_upstreams_dict_into_memory returned
non-cacheable value when upstreams table is empty, causing empty
table reload in request context after 10s negative TTL's expiration.

Summary

See #8970 (comment).

To fix this, empty table may be considered a valid value to cache.

Full changelog

  • [Fix empty upstreams table reload every 10s]

Issue reference

Fix #8970

@kikito kikito requested a review from locao June 20, 2022 20:45
@marc-charpentier marc-charpentier force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from b759b95 to 17f1b59 Compare June 23, 2022 14:24
@marc-charpentier
Copy link
Contributor Author

Hello,
Sorry for last commit come and go, I was wondering if cache invalidation delay caused this unit test to fail.

@marc-charpentier
Copy link
Contributor Author

Hello,
I think I managed to run the failing unit test on my local machine, but the result is the same with both branches (release/2.8.x and fix/empty-upstreams-reload-every-10-seconds). I also added logging instructions on error level in both modified functions in kong/runloop/balancer/upstreams.lua but they didn't appear on unit test run.
I tend to think the proposed fix isn't causing the unit test failure.
Would someone more experienced than me have a look ?
Thanks.

@flrgh
Copy link
Contributor

flrgh commented Jun 24, 2022

Hey @marc-charpentier, sorry for the late check-in on this (we're all a little bit swamped at the moment prepping the next release).

That test failure looks very similar to one we've been tracking internally that is known to be flaky, so I think it's okay for you to disregard for now.

This changeset looks good to me (thanks again for your efforts--much appreciated!). When I have the time I'll probably do a little exploratory work to see if adding a targeted integration test is feasible, though it might not be necessary in the end. Aside from that, just awaiting a second opinion from somebody else on the team.

@marc-charpentier
Copy link
Contributor Author

Hi @flrgh , thank you for your answer, and no problem for the late check-in.
Now all tests passed, although I don't know if someone triggered them again, nor why the previously failing one succeeded this time.

@StarlightIbuki
Copy link
Contributor

StarlightIbuki commented Nov 10, 2022

Difficult to rebase so I'm trying to cherry-pick it. Trying to find some way to test this.
Sorry to mess up with the tags... I did not notice this is targeting 2.8.x.
This fix should also work for the master.

@StarlightIbuki StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 7476061 to b3eb8cb Compare November 14, 2022 07:02
@bungle bungle force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from b3eb8cb to fb30687 Compare November 15, 2022 10:10
@flrgh
Copy link
Contributor

flrgh commented Nov 15, 2022

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez


Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

if err then
log(CRIT, "could not obtain list of upstreams: ", err)
return nil
end

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:

-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end

In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:

    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

@StarlightIbuki StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 4511db3 to 2012430 Compare November 16, 2022 09:21
@StarlightIbuki
Copy link
Contributor

Hmm. I was git blame-ing my way around looking for the reason why the upstream negative TTL is hardcoded to 10s, and I found these:

So it seems there at least was a reason to not cache an empty table because it was exacerbating another DNS balancer problem. Maybe it's not safe to revert this change? On the other hand, the original bugfix is also over 2 years old now, so it's also semi-plausible that this is not a problem anymore.

CC @javierguerragiraldez

Edit: I am now very tempted to believe that the bug fixed by #5831 might be due to this logic here:

if err then
log(CRIT, "could not obtain list of upstreams: ", err)
return nil
end

This doesn't appear to be correct usage of lua-resty-mlcache. According to the docs, a single nil return value will be treated as negative cache if not accompanied by a second (error) return value:

-- arg1, arg2, and arg3 are arguments forwarded to the callback from the
-- `get()` variadic arguments, like so:
-- cache:get(key, opts, callback, arg1, arg2, arg3)

local function callback(arg1, arg2, arg3)
    -- I/O lookup logic
    -- ...

    -- value: the value to cache (Lua scalar or table)
    -- err: if not `nil`, will abort get(), which will return `value` and `err`
    -- ttl: override ttl for this value
    --      If returned as `ttl >= 0`, it will override the instance
    --      (or option) `ttl` or `neg_ttl`.
    --      If returned as `ttl < 0`, `value` will be returned by get(),
    --      but not cached. This return value will be ignored if not a number.
    return value, err, ttl
end

In light of that I think this change should be accompanied by fixing the return semantics of the function when there's a DB-related error:

    if err then
      log(CRIT, "could not obtain list of upstreams: ", err)
-      return nil
+      return nil, err
    end

Nice finding. Since we have now corrected the handling probably we could remove that option?

@flrgh
Copy link
Contributor

flrgh commented Nov 16, 2022

Since we have now corrected the handling probably we could remove that option?

Interestingly, the hardcoded negative TTL came from #4301. This thread is the explanation I was looking for.

Many of the concerns raised there seem to still be somewhat valid, though I tend to think we should trust that upstreams:each() will always return an error on an unexpected failure.

I think a valuable change would be to detect the case where there are zero upstreams and no error is encountered and add a debug log entry in this case. That way this scenario is at least observable if/when there's a reason to suspect that something isn't working correctly.

@locao what do you think about all of this?

@ADD-SP ADD-SP force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 78394e2 to 16374d1 Compare November 18, 2022 05:57
@StarlightIbuki StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from d06fcdc to 8480a69 Compare November 22, 2022 03:38
@StarlightIbuki StarlightIbuki self-assigned this Nov 22, 2022
marc-charpentier and others added 7 commits November 22, 2022 15:41
The upstreams module's load_upstreams_dict_into_memory returned
non-cacheable value when upstreams table is empty, causing empty
table reload in request context after 10s negative TTL's expiration.
Co-authored-by: Michael Martin <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
@StarlightIbuki StarlightIbuki force-pushed the fix/empty-upstreams-reload-every-10-seconds branch from 8480a69 to 8d56772 Compare November 22, 2022 07:48
@StarlightIbuki StarlightIbuki requested a review from flrgh November 22, 2022 07:48
Copy link
Contributor

@StarlightIbuki StarlightIbuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but as I'm committing to this, we need someone else to review it.

@StarlightIbuki StarlightIbuki added pr/please review pr/ready This PR is considered ready and can be merged at anytime (given it received no subsequent changes) labels Dec 7, 2022
@StarlightIbuki StarlightIbuki requested a review from a team December 7, 2022 10:45
Copy link
Contributor

@flrgh flrgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me 👍 @locao care to give this a quick review?

@locao
Copy link
Contributor

locao commented Dec 9, 2022

Reviewing this is in my backlog, sorry for the long delay. I'll check it asap.

@StarlightIbuki StarlightIbuki added pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... and removed pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Dec 14, 2022
@hanshuebner hanshuebner merged commit e5d9235 into Kong:master Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/balancer pr/ready This PR is considered ready and can be merged at anytime (given it received no subsequent changes) size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Balancer/upstreams: Kong loads upstreams table every 10s in request context when no upstream exist in database
5 participants