Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dns_cache for server addresses as in pgbouncer #249

Merged
merged 3 commits into from
May 2, 2023

Conversation

magec
Copy link
Collaborator

@magec magec commented Dec 7, 2022

Hello!

We are considering using PgCat as an alternative to PgBouncer in our infrastructure. One thing that we noticed is that it lacks support for dns cache, a thing that we rely on.

In Pgbouncer (from its docu):

[...] Host names are resolved at connection time, the result is cached per dns_max_ttl parameter. When a host name’s resolution changes, existing server connections are automatically closed when they are released (according to the pooling mode), and new server connections immediately use the new resolution. If DNS returns several results, they are used in a round-robin manner [...].

Unfortunately this is not the case in PgCat so we decided to implement it ourselves.

This change:

  • Adds two new general config parameters to enable/disable dns_cache feature and configure its 'refresh time'.
  • Adds a new global Lazy evaluated CACHED_RESOLVER
  • Adds some code to initialize CACHED_RESOLVER
  • Adds a new field to Server struct that saves the value of the address resolution when the connection is first established.
  • Modifies the 'is_bad' method of Server to check whether the underlying address has changed and if so, returns true.

TODO:

  • Reloading dns config is not (yet) possible.
  • DNS zone serial check to simplify (and optimize dns cache validation).
  • Fill up stats that have to do with this feature, currently they are set to zero.
res.put(data_row(&vec!["dns_names".to_string(), "0".to_string()]));
res.put(data_row(&vec!["dns_zones".to_string(), "0".to_string()]));
res.put(data_row(&vec!["dns_queries".to_string(), "0".to_string()]));
res.put(data_row(&vec!["dns_pending".to_string(), "0".to_string()]));

@magec magec changed the title Add dns_cache so server addresses are cached and invalidated when DNS… Add dns_cache for server addresses as in pgbouncer Dec 7, 2022
@levkk
Copy link
Contributor

levkk commented Dec 8, 2022

Hey there,

Thanks for the PR, I really like the way it's implemented. I've seen that feature in PgBouncer, but I've never used it personally or really understood why it's there. Could you explain to me the use case for it? It seems like it's overriding the TTL on the DNS records.

@magec
Copy link
Collaborator Author

magec commented Dec 8, 2022

We use DNS as our single source of truth for database servers. Our services use server names instead of Ip addresses for connecting to databases. This way, when we do cutovers of replicas (because of hw maintenance, etc), we simply have to update the server record, in this situation, pgbouncer detects the change and starts sending traffic to the new server. Apart from that, we also have services that connect directly to the database, and usually, when they loose connection to Db, they reconnect (so they get the new ip). We want this feature because it simplifies the instrumentation needed when dealing with our DB infrastructure. As we have 16 shards plus 1 common db, each one with a (at least) 1 read only replica, hw maintenance and cutovers are frequent. That said, I saw in the code (doc is outated BTW) that pool conf can be changed without restarting, that could be used to do cutovers, but it is a bit more impactful because the entire pool needs to be reconnected, also the dns feature ensures "compatibility" with the rest of the services that do not use db poolers for connecting to databases. I actually saw the possibility of reloading pools on the fly, while reading the code for implementing this feature. Still, I found this "complementary" to "pool reloading" and fault tolerance features.

@levkk
Copy link
Contributor

levkk commented Dec 8, 2022

Thank you for explaining your use case to me, I appreciate it!

Based on my understanding, you don't need a DNS cache feature. We already support using DNS in the config, for example localhost here is a DNS entry in /etc/hosts, but it can be any other value, e.g. my-db.amazonaws.com

You can change the A or CNAME value of those records in your DNS server and PgCat will use those new values when both of these things are true:

  1. The TTL has expired
  2. A new server connection is created

PgCat uses tokio::net::TcpStream to connect to databases, so whatever that interface will do, PgCat will as well, including correct DNS resolution.

It's my understanding that PgBouncer DNS cache is used to override whatever the TTL is on the DNS records that are coming from the DNS server. I think it helps in cases when the DNS server is temporarily unavailable (broken) or is reporting incorrect TTLs for some reason. This certainly happens in cloud environments and on prem, for sure, e.g. Route53 has intermittent outages sometimes, but they are exceedingly rare.

So overall, I'm still not sure why you'd want this feature.

Thanks!

@magec
Copy link
Collaborator Author

magec commented Dec 8, 2022

As you mention, without this feature, new connections to servers (provided TTLs has expired), will end up in the new server, but the whole point of the feature is not new connections but current ones.

From PGbouncer doc:

When a host name’s resolution changes, existing server connections are automatically closed when they are released (according to the pooling mode), and new server connections immediately use the new resolution.

And this is the whole point, by just changing the DNS record, you have a way of redirecting traffic to a new server (usually a replica) in a zero disruptive manner. Old connections will be freed as the transactions/sessions finishes and new transactions/sessions will use the new ip.

Maybe the naming here is a bit misleading I use dns_max_ttl because that's the name pgbouncer chose, but is more of a "refresh time" for checking "when a host name's resolution changes".

@levkk
Copy link
Contributor

levkk commented Dec 8, 2022

but is more of a "refresh time" for checking "when a host name's resolution changes".

That makes way more sense to me now.

And this is the whole point, by just changing the DNS record, you have a way of redirecting traffic to a new server (usually a replica) in a zero disruptive manner. Old connections will be freed as the transactions/sessions finishes and new transactions/sessions will use the new ip.

Very interesting approach. I've never used DNS changes to trigger a failover like that. The way I always did it in the past was by changing the config to point to new DNS records for the new primary/replica - this way, I control the exact timing of the failover and can immediately check it's effect, without worrying about DNS cache & propagation. Since config reloads are zero downtime, this was a good enough approach for me.

I've mostly used Pgb in AWS with RDS, where we had no control over the DNS of the databases, so I think that's the main reason I never had to deal with this before.

Please allow me a few days to think about your proposal. Since I've never used this PgBouncer feature and I need to consider it. At this moment, I'm thinking this is a good feature to have, but perhaps it should be better named. One issue I've always had with Pgb documentation and features is they were powerful yet sometimes had unforeseen interactions and consequences, for example: pgbouncer/pgbouncer#647

My goal with pgcat is to have the best way to run Postgres at scale, implemented in code. So while having many configuration options and features is often seen as sign of maturity and completeness of a project, they can lead to user confusion and production incidents, when used incorrectly.

@magec magec force-pushed the cached-resolver branch 3 times, most recently from 96cf5c5 to dfe1d0e Compare January 23, 2023 13:11
@magec magec force-pushed the cached-resolver branch 2 times, most recently from a8b3ad8 to 7a2d5e4 Compare February 2, 2023 12:09
@magec magec force-pushed the cached-resolver branch 2 times, most recently from 5026387 to 99871c6 Compare February 23, 2023 09:25
@magec magec force-pushed the cached-resolver branch from 99871c6 to ab932a9 Compare March 13, 2023 09:14
@magec magec force-pushed the cached-resolver branch from ab932a9 to 87a0d0f Compare March 28, 2023 15:45
Copy link
Contributor

@levkk levkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just make sure CI passes before merging.

@magec magec force-pushed the cached-resolver branch 2 times, most recently from 3e0fbbd to 9a4eac1 Compare April 24, 2023 10:34
magec added 2 commits May 2, 2023 09:56
… changes.

Adds a module to deal with dns_cache feature. It's
main struct is CachedResolver, which is a simple thread safe
hostname <-> Ips cache with the ability to refresh resolutions
every `dns_max_ttl` seconds. This way, a client can check whether its
ip address has changed.
@magec magec force-pushed the cached-resolver branch from 9a4eac1 to ac39ea0 Compare May 2, 2023 07:57
@magec magec merged commit 7dfbd99 into postgresml:main May 2, 2023
@nicolasvan nicolasvan deleted the cached-resolver branch March 18, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants