CASSGO-5 `system.peers` queried even if `DisableInitialHostLookup` set to `true` #1665

leesio · 2022-11-17T11:35:54Z

What version of Cassandra are you using?

We're actually using AWS Keyspaces

What version of Gocql are you using?

The issue exists in 1.2.1

What version of Go are you using?

1.19

What did you do?

Set the DisableInitialHostLookup to true

What did you expect to see?

The driver would never query system.peers for host information

What did you see instead?

On heartbeat failure, the driver tries to query system.peers.

Hi there,

Apologies in advance if this has been answered before - I searched through the issues and couldn't see anything related.

The docs for the DisableInitialHostLookup config flag state:

If DisableInitialHostLookup then the driver will not attempt to get host info from the system.peers table, this will mean that the driver will connect to hosts supplied and will not attempt to lookup the hosts information

It's true that the library avoids quering system.peers on session initialization however, in the event of a heartbeat failure on the control connection, we call (*controlConn).reconnect with the refreshring arg true. Which (as far as I can tell) will ultimately query system.peers.

It seems to me that this is true to the name of the config flag DisableInitialHostLookup but is not quite what the documentation suggests. I'm interested to know the maintainers' view on this and whether they'd consider a patch to change the reconnection logic to consider the DisableInitialHostLookup configuration. Perhaps something like this:

reconn:
	// try to connect a bit faster
	sleepTime = 1 * time.Second
	refreshRing := !c.session.cfg.DisableInitialHostLookup
	c.reconnect(refreshRing)
	continue
}

Why do we care?

A little bit more context to explain why this matters to us.

We're using gocql to speak to AWS Keyspaces, and, for various reasons, we don't want to grant the driver access to the system.peers table. We set the DisableInitialHostLookup flag to false but it seems that if we get a heartbeat timeout on a control connection, we'll try and refresh the ring and immediately run into this problem.

The text was updated successfully, but these errors were encountered:

martin-sucha · 2022-11-18T13:21:52Z

Thanks for raising the issue. It seems the behavior was like this since commit 83932d6 that introduced the option. It seems reasonable to me to treat this as a bug since the documentation says that the driver won't try to discover the hosts. Although DisableHostLookup would indeed be better name for a config option like that. And it is not clear to me how users use the DisableInitialHostLookup option, resp. if disabling the lookup always can break someone's program.

Currently, gocql queries system.peers in these cases:

During Session.init if DisableInitialHostLookup is false.
When reconnecting control connection.
When calling Session.AwaitSchemaAgreement explicitly.
When a schema change query successfully executes.
When a keyspace change event is received.
When a node is added or removed from the cluster. (Session.addNewNewNode, Session.handleNewNode, Session.handleRemovedNode)

We would need to disable it in all these cases, not just during reconnection.

Could you elaborate on the reasons why you don't want to grant access to system.peers table? Note that features like token-aware routing don't work if the driver does not know the datacenter/rack for nodes as it can't build the ring topology. So far I've only noticed users using DisableInitialHostLookup to speedup session initialization, for example in cases when only a single query is executed. Also, do you allow access to system.local table?

Does the issue affect executing queries for you or the only effect is that some error messages appear in logs?

leesio · 2022-11-18T14:02:09Z

Thanks for getting back to me 🙇

We would need to disable it in all these cases, not just during reconnection.

Yes, that's a good point, we don't use those particular features so I neglected to include them in my issue but ofcourse a complete fix would need to address those call sites too.

Could you elaborate on the reasons why you don't want to grant access to system.peers table?

Yeah sure. It's primarily to do with a quirk in AWS Keyspaces rather than us specifically wanting to deny access. We're using VPC endpoints and a query system.peers seems to ultimately call through to some AWS APIs which have fairly aggressively low rate limits

Note that features like token-aware routing don't work if the driver does not know the datacenter/rack for nodes as it can't build the ring topology.

Yeah, understood. I don't think that's such a problem with Keyspaces since the ring is abstracted away - we get a number of fixed endpoints which don't ever change!

Does the issue affect executing queries for you or the only effect is that some error messages appear in logs?

I should have mentioned this in the initial issue 🤦🏻‍♂️

It does impact query execution, if a heartbeat fails we try and query system.peers which returns 0 rows. This effectively empties the ring and all query executions (understandably) result in a no hosts available in the pool error.

Also, do you allow access to system.local table?

That's a good question. We don't disallow access but I'm not sure off the top of my head what AWS returns from system.local

For what it's worth, I recognise that this is issue is pretty specific to the AWS service - which is why I was reticent to call it a bug in the first instance 😅

Update: After a bit more debugging, I'm now less convinced that this is what caused our issues.

Update 2: I think I've found the specific scenario that led to the "no hosts in pool" error, it's a bit of a doozy - sharing in case it's of any interest!

Specific issue outline
Let's say we have 3 hosts: 10.0.0.1, 10.0.0.2, 10.0.0.3

We're having intermittent network problems to 10.0.0.1, so we decide to add the to a blocklist which drives a HostFilter.
This creates a new session which spawns a new control connection. (*controlConn).connect shuffles the unfiltered host list then dials the first one.
By chance, we end up creating our control connection to 10.0.0.1 - the connection with intermittent problems
We miss a heartbeat and hit the reconnect loop
We successfully connect to the same host again (again, issues are intermittent)
We refresh the ring
First we get the hosts. The first host in the ring is the host we have our control connection to, we then add all of the hosts from system.peers (which in our case is none!)
Then we remove the filtered hosts
Since the only host in our host list is filtered, we end up with no hosts
Every query receives Error handling request: gocql: no hosts available in the pool
Since we've wiped out our hosts list we have no chance of reconnecting to a different host

So it seems like this problem was caused by the fact that:

We always refresh the ring, even if DisableHostLookup is set
and
We randomly choose a control connection from the unfiltered host list

RostislavPorohnya · 2024-06-12T10:35:59Z

I would like to work on this issue

leesio changed the title ~~system.peers queried event if DisableInitialHostLookup set to true~~ system.peers queried even if DisableInitialHostLookup set to true Nov 17, 2022

This was referenced Aug 1, 2024

Fix disableHostLookup flag logic #1789

Open

CASSGO-5 Fix DisableInitialHostLookup flag ignored when querying system.peers #1790

Open

joao-r-reis changed the title ~~system.peers queried even if DisableInitialHostLookup set to true~~ CASSGO-5 system.peers queried even if DisableInitialHostLookup set to true Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSGO-5 `system.peers` queried even if `DisableInitialHostLookup` set to `true` #1665

CASSGO-5 `system.peers` queried even if `DisableInitialHostLookup` set to `true` #1665

leesio commented Nov 17, 2022 •

edited

Loading

martin-sucha commented Nov 18, 2022

leesio commented Nov 18, 2022 •

edited

Loading

RostislavPorohnya commented Jun 12, 2024

CASSGO-5 system.peers queried even if DisableInitialHostLookup set to true #1665

CASSGO-5 system.peers queried even if DisableInitialHostLookup set to true #1665

Comments

leesio commented Nov 17, 2022 • edited Loading

What version of Cassandra are you using?

What version of Gocql are you using?

What version of Go are you using?

What did you do?

What did you expect to see?

What did you see instead?

Why do we care?

martin-sucha commented Nov 18, 2022

leesio commented Nov 18, 2022 • edited Loading

RostislavPorohnya commented Jun 12, 2024

CASSGO-5 `system.peers` queried even if `DisableInitialHostLookup` set to `true` #1665

CASSGO-5 `system.peers` queried even if `DisableInitialHostLookup` set to `true` #1665

leesio commented Nov 17, 2022 •

edited

Loading

leesio commented Nov 18, 2022 •

edited

Loading