Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS responses are cached #135888

Closed
zhenyavinogradov opened this issue Aug 27, 2021 · 29 comments
Closed

DNS responses are cached #135888

zhenyavinogradov opened this issue Aug 27, 2021 · 29 comments
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@zhenyavinogradov
Copy link
Contributor

According to #89274, NixOS uses nscd only for dispatching nss modules, and caching functionality of nscd is disabled by default. But when I run any application that resolves the same DNS name in a loop on a clean NixOS system, I observe that DNS packets are not sent on each request, they are only sent after ttl elapses. It means that other requests are served from some local cache. Only if I stop nscd service I see the packets being sent on each request.

A simple script to reproduce this: while true; do getent ahosts github.com; sleep 1; done

Is there some component other than nscd that does this caching, or does nscd itself needs some extra configuration to actually disable caching?

cc @flokli

@zhenyavinogradov zhenyavinogradov added the 0.kind: bug Something is broken label Aug 27, 2021
@flokli
Copy link
Contributor

flokli commented Aug 27, 2021 via email

@zhenyavinogradov
Copy link
Contributor Author

I'm testing it in a qemu VM with a minimal NixOS config, with no systemd-resolved/unbound:

{ lib, ... }: {
  boot.initrd.availableKernelModules = [ "virtio_net" "virtio_pci" "virtio_mmio" "virtio_blk" "virtio_scsi" ];
  boot.initrd.kernelModules = [ "virtio_balloon" "virtio_console" "virtio_rng" ];
  boot.growPartition = true;
  boot.loader.grub.device = "/dev/vda";
  fileSystems."/".device = "/dev/disk/by-label/nixos";
  fileSystems."/".fsType = "ext4";
  fileSystems."/".autoResize = true;
  services.getty.autologinUser = lib.mkForce "root";
}

/etc/nsswitch.conf:

passwd:    files systemd
group:     files systemd
shadow:    files

hosts:     mymachines files myhostname dns
networks:  files

ethers:    files
services:  files
protocols: files
rpc:       files

/etc/resolv.conf:

# Generated by resolvconf
nameserver 10.0.99.1
options edns0

@flokli
Copy link
Contributor

flokli commented Aug 27, 2021

cc @arianvp (who did dig into nscd and its caching behaviour)

@veprbl veprbl added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Aug 28, 2021
@3nprob
Copy link

3nprob commented Sep 9, 2021

Confirmed.

When changing nameserver in /etc/resolv.conf, nslookup and dig resolve through new server but other applications return a cached result. No systemd-resolved/local DNS services or entries in /etc/hosts

Restarting nscd yields the new entry.

# echo "nameserver $NS1" > /etc/resolv.conf
# nslookup $HOST
Server:         $NS1
Address:        $NS1#53

Name:   $HOST
Address: $IP1

# getent hosts $HOST
$IP1  $HOST

# echo "nameserver $NS2" > /etc/resolv.conf
# nslookup $HOST
Server:         $NS2
Address:        $NS2#53

Name:   $HOST
Address: $IP2

# getent hosts $HOST
$IP1  $HOST

# grep hosts /etc/nsswitch.conf
hosts:     files mymachines dns myhostname

# grep hosts /etc/nscd.conf
enable-cache            hosts           yes
positive-time-to-live   hosts           0
negative-time-to-live   hosts           0
shared                  hosts           yes

# systemctl restart nscd
# getent hosts $HOST
$IP2  $HOST

@flokli
Copy link
Contributor

flokli commented Sep 16, 2021

Thanks for the digging! In that case, one more reason to work on #55276 :-)

flokli added a commit to flokli/nixpkgs that referenced this issue Sep 21, 2021
NSS modules are now globally provided (by providing a `/run/nss-modules`
symlink), similar to how we handle OpenGL drivers.

This removes the need for nscd as a proxy for all NSS requests, and avoids
DNS requests leaking across network namespaces.

While doing this upgrade, existing applications need to be restarted, so
they know how to pick up NSS modules from `/run/nss-modules`.

If you want to defer application restart to a later time, explicitly enable
`nscd` via `services.nscd.enable` until the application restart.

We can mix NSS modules from any version of glibc according to
https://sourceware.org/legacy-ml/libc-help/2016-12/msg00008.html,
so glibc upgrades shouldn't break old userland loading more recent NSS
modules (and most likely, NSS modules are already loaded)

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)
flokli added a commit to flokli/nixpkgs that referenced this issue Sep 22, 2021
NSS modules are now globally provided (by providing a `/run/nss-modules`
symlink).

See the text added to `rl-2111.section.md` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)
erikarvstedt added a commit to erikarvstedt/nixpkgs that referenced this issue Oct 18, 2021
NSS modules are now globally provided by a symlink in `/run`.

See the description in `add-extra-module-load-path.patch` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)

Co-authored-by: Erik Arvstedt <[email protected]>
erikarvstedt added a commit to erikarvstedt/nixpkgs that referenced this issue Oct 24, 2021
NSS modules are now globally provided by a symlink in `/run`.

See the description in `add-extra-module-load-path.patch` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)

Co-authored-by: Erik Arvstedt <[email protected]>
@ctheune
Copy link
Contributor

ctheune commented Nov 16, 2021

A quick note as we have a similar issue: the caching is so bad that we're affected by negative caches with an unknown but longish (at least multiple minutes) TTL and have to restart nscd. For us, we're debugging some annoyance where one upstream DNS server sometimes (but rarely) does something bad and responds either with a negative answer or times out and I'm a bit suspicious that nscd is caching that timeout as a negative entry (although that doesn't correspond with the timeout glibc fallback behaviour).

@ctheune
Copy link
Contributor

ctheune commented Nov 16, 2021

I spent some time reading the glibc/nscd code. There was a "recent" change (5e74e6f85842892bc25da8e8c70d8dadd485941a) where the shared cache made a problem. I'm going to try running with a disabled shared cache...

@ctheune
Copy link
Contributor

ctheune commented Nov 16, 2021

Enabling and disabling the shared flag did not change anything. However, I noticed that doing a negative lookup actually is not cached, but positive values are (based on our current nscd.conf). I wonder what value it is assuming. The code internally has some defaults like 3600 for positive values. Digging deeper.

@flokli
Copy link
Contributor

flokli commented Nov 16, 2021

Check nixos/modules/services/system/nscd.conf and the git history around it.

The semantics might have changed recently, though…

@ctheune
Copy link
Contributor

ctheune commented Nov 16, 2021

Yeah, I'm aware of that. Any specific thing that you think I'm missing?

I'm a bit worried that we do not have tests for this behaviour and that either the previous change was bogus or glibc changed its behaviour without us noticing.

@flokli
Copy link
Contributor

flokli commented Nov 16, 2021 via email

@ctheune
Copy link
Contributor

ctheune commented Nov 17, 2021

Alright. I set up a test case that shows how to reproduce this and I based it on the original commit where nscd caching was supposedly disabled. It doesn't work even back then: f9a5a65801889df5848eff0d90b2edeee0fe390a

I guess next step would be debugging nscd?!? Le sigh.

Anyone got a better idea?

@flokli
Copy link
Contributor

flokli commented Nov 17, 2021

I've been iterating a bit with @erikarvstedt on how to accomplish #55276 in a non-breaking fashion, which would put nscd out of the loop for most of the requests. This is still WIP though.

@ctheune
Copy link
Contributor

ctheune commented Nov 17, 2021

Yeah, I've seen that. We're seeing some relevant breakage and I need to come up with a short term fix, though.

@flokli
Copy link
Contributor

flokli commented Nov 17, 2021 via email

@ctheune
Copy link
Contributor

ctheune commented Nov 18, 2021

Yeah, unfortunately we just started using the container integration with mymachines ... perfect timing ;)

@ctheune
Copy link
Contributor

ctheune commented Nov 18, 2021

I did some more digging and the whole dance of how nscd works with timeouts and pruning the cache just seems off. I found a piece of code in the cache_add method that handles the case of running out of memory and just ignores things added to the cache. I'm currently running a test (waiting for a large system rebuild) that patches this method to always choose the 'ignore the new cache entry' path so nothing will ever be added to the cache. That should solve this reasonably well for now.

@ctheune
Copy link
Contributor

ctheune commented Nov 18, 2021

Ok, so here's a patch to glibc to just simply deactivate the cache function in NSCD completely. We likely would not want to ship it this way to the general userbase, but I could run it this way on our platform and if we're interested to use this upstream until you r work for ripping nscd out is done then we could add this as a configuration option to nscd or so.

@ctheune
Copy link
Contributor

ctheune commented Nov 18, 2021

Here's the commit, I forgot that this is in a separate repo and won't be picked up through the issue id references:
flyingcircusio@39abefe

@tomfitzhenry
Copy link
Contributor

tomfitzhenry commented Dec 5, 2021

https://udrepper.livejournal.com/16362.html suggests:

For all getaddrinfo lookups the TTL value from DNS replies takes precedence over the TTL value from /etc/nscd.conf. The latter is used for services which do not provide a TTL themselves (today all other services).

i.e. perhaps modifying the positive-time-to-live has had no impact because that value is overridden by the DNS TTL.

As an aside, the default reload-count of 5 means each DNS query will be queried 5 times. I was surprised to see this query amplification.

@ctheune
Copy link
Contributor

ctheune commented Dec 5, 2021

From my perspective (I'm interested in the negative-time-to-live case) this looks like it doesn't differentiate between SERVFAIL and NXDOMAIN.

erikarvstedt added a commit to erikarvstedt/nixpkgs that referenced this issue Jan 19, 2022
NSS modules are now globally provided by a symlink in `/run`.

See the description in `add-extra-module-load-path.patch` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)

Co-authored-by: Erik Arvstedt <[email protected]>
erikarvstedt added a commit to erikarvstedt/nixpkgs that referenced this issue Jan 19, 2022
NSS modules are now globally provided by a symlink in `/run`.

See the description in `add-extra-module-load-path.patch` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)

Co-authored-by: Erik Arvstedt <[email protected]>
erikarvstedt pushed a commit to erikarvstedt/nixpkgs that referenced this issue Jan 19, 2022
NSS modules are now globally provided (by providing a `/run/nss-modules`
symlink).

See the text added to `rl-2111.section.md` for further details.

Fixes: NixOS#55276
Fixes: NixOS#135888
Fixes: NixOS#105353
Cc:    NixOS#52411 (comment)
@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 19, 2022
@3nprob
Copy link

3nprob commented Aug 16, 2022

@Stale not stale

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Aug 16, 2022
@flokli
Copy link
Contributor

flokli commented Oct 13, 2022

Me and @NinjaTrappeur took a closer look at the nscd protocol and codebase. It's not really possible to run it in a pure "no caching mode.

However, we found a good replacement: nsncd.
It's written in rust and was only missing host related lookups to be usable for NixOS, or other distros running nix-built binaries, so we sent PRs to add this upstream, they are awaiting review.

Try the following snippet to switch your system(s) to a version containing all the PRs:

https://gist.github.com/flokli/b1b0a1d2c0b7ba6e73101e1447812114

I hope this gets included soon upstream.

On top of #194916, I also added an integration test for nsncd to nixosTests.nscd, see my nixpkgs nsncd branch for details. It validates dynamic user lookup, host lookups via glibc-internal nss modules and external nss modules work, like they do with nscd.

It would be nice if more people could test this!

@picnoir
Copy link
Member

picnoir commented Oct 14, 2022

I've been running this patch today. So far I did not hit any bug. I don't use any dodgy/segfaulting NSS module that being said.

The post-boot/resume firefox name resolution issues I was experiencing are gone.

@flokli
Copy link
Contributor

flokli commented Oct 20, 2022

I sent a proper PR to nixpkgs, see #196917.

@fgaz
Copy link
Member

fgaz commented Dec 4, 2022

The post-boot/resume firefox name resolution issues I was experiencing are gone.

Enabling nsncd through the new option fixed it for me too! I hated that issue...

@flokli
Copy link
Contributor

flokli commented Dec 5, 2022

Good to hear! Let's see if we here any negative reports, otherwise we should probably default to this after some more testing…

@Tungsten842
Copy link
Member

I think that this issue can be closed, nsncd is now used by default: #214153

@flokli
Copy link
Contributor

flokli commented Feb 2, 2023

Yes, thanks for the ping.

@flokli flokli closed this as completed Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
9 participants