Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backport -> release/2.8.x] fix(dns): Eliminate asynchronous timer in syncQuery() to prevent hang #12041

Merged
merged 8 commits into from
Nov 24, 2023

Conversation

chobits
Copy link
Contributor

@chobits chobits commented Nov 16, 2023

backport #11900 and #11386 to 2.8.x
backport some commits about pr/do not merge label to 2.8.x

Summary

  1. fix(dns): eliminate asynchronous timer in syncQuery() to prevent deadlock risk (fix(dns): Eliminate asynchronous timer in syncQuery() to prevent hang risk #11900)
    * Revert "fix(conf): set default value of `dns_no_sync` to `on` (#11869)"
     
    This reverts commit 3be2513a60b9f5f0a89631ff17c202e6113981c0.

    * fix(dns): introduce the synchronous query in syncQuery() to prevent hang risk

    Originally the first request to `syncQuery()` will trigger an asynchronous timer
    event, which added the risk of thread pool hanging.

    With this patch, cold synchronously DNS query will always happen in the current
    thread if current phase supports yielding.
  1. fix(dns): fix retry and timeout handling (fix(dns): fix retry and timeout handling #11386)
    - Stop retrying in dns/client.lua, let the resolver handle this.  This
       change also makes it possible to disable retries, which previously
       was not possible
     - Be more faithful to the timeouts set by the user.  Previously, the
       timeout configured was used only for the ultimate request sent to
       the DNS server, but asynchronous requests allowed longer timeouts
       which was not transparent.
     - When the DNS server fails, stop trying other query types.  Previously,
       the behavior was such that after an (intermediate) failure to query
       for one record type (say "SRV"), the client would try the next record
       type (say "A") and succeed with that.  It would then return the
       contents of the "A" record even if the "SRV" record pointed to a
       different address.
     - Change domain names used for testing the DNS client into the
       kong-gateway-testing.link zone, which is controlled by the Kong Gateway
       team.

    Fixes https://github.com/Kong/kong/issues/10182
    KAG-2300

Checklist

Full changelog

  • [Implement ...]

Issue reference

Fix #[issue number]

- Stop retrying in dns/client.lua, let the resolver handle this.  This
   change also makes it possible to disable retries, which previously
   was not possible
 - Be more faithful to the timeouts set by the user.  Previously, the
   timeout configured was used only for the ultimate request sent to
   the DNS server, but asynchronous requests allowed longer timeouts
   which was not transparent.
 - When the DNS server fails, stop trying other query types.  Previously,
   the behavior was such that after an (intermediate) failure to query
   for one record type (say "SRV"), the client would try the next record
   type (say "A") and succeed with that.  It would then return the
   contents of the "A" record even if the "SRV" record pointed to a
   different address.
 - Change domain names used for testing the DNS client into the
   kong-gateway-testing.link zone, which is controlled by the Kong Gateway
   team.

Fixes #10182
KAG-2300

Signed-off-by: Xiaochen Wang <[email protected]>
@chobits chobits changed the title Backport 11900 to release/2.8.x [backport -> release/2.8.x] fix(dns): Eliminate asynchronous timer in syncQuery() to prevent hang Nov 16, 2023
@chobits chobits requested review from hanshuebner and dndx November 16, 2023 05:46
@ADD-SP ADD-SP self-requested a review November 16, 2023 06:33
Copy link
Contributor

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is read

@chobits chobits force-pushed the backport-11900-to-release/2.8.x branch 2 times, most recently from c348210 to a09b97f Compare November 16, 2023 06:54
@outsinre
Copy link
Contributor

Please double-check this on 2.8.x.x. According to customer reports, 2.8 does not have DNS issue. Pay attention to code base diff.

…adlock risk (#11900)

* fix(dns): introduce the synchronous query in syncQuery() to prevent hang risk

Originally the first request to `syncQuery()` will trigger an asynchronous timer
event, which added the risk of thread pool hanging.

With this patch, cold synchronously DNS query will always happen in the current
thread if current phase supports yielding.

Fix FTI-5348

---------

Co-authored-by: Datong Sun <[email protected]>
@chobits chobits force-pushed the backport-11900-to-release/2.8.x branch from a09b97f to cdc7b0d Compare November 16, 2023 07:04
@chobits chobits requested a review from ADD-SP November 16, 2023 07:16
@chobits
Copy link
Contributor Author

chobits commented Nov 17, 2023

Please double-check this on 2.8.x.x. According to customer reports, 2.8 does not have DNS issue. Pay attention to code base diff.

Talked with datong in person, this is a potential critical problem in 2.8.x, so we need to advance the merging process.

mayocream and others added 6 commits November 24, 2023 14:50
With this patch, CI will notify a Kong Inc internal slack channel on
every PR that performs a schema change.
It seems that if the do not merge label job is skipped then the second
job doesn't run either:
https://github.com/Kong/kong/actions/runs/4307151445/jobs/7511859202

This change splits the job into two and narrows down the events on which
these jobs are triggered since the only meaninful input are the labels
on the PR.
This is a bad practice which could cause merge conflicts and is against our backport policy.
It seems that Github Actions is not running these jobs even once even
though the PRs are labelled at least once. This patch runs these jobs
on other related PR activity.
@github-actions github-actions bot added the chore Not part of the core functionality of kong, but still needed label Nov 24, 2023
@dndx dndx merged commit e1778b9 into release/2.8.x Nov 24, 2023
@dndx dndx deleted the backport-11900-to-release/2.8.x branch November 24, 2023 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Not part of the core functionality of kong, but still needed size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants