Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSGO-5 system.peers queried even if DisableInitialHostLookup set to true #1665

Open
leesio opened this issue Nov 17, 2022 · 3 comments
Open

Comments

@leesio
Copy link

leesio commented Nov 17, 2022

What version of Cassandra are you using?

We're actually using AWS Keyspaces

What version of Gocql are you using?

The issue exists in 1.2.1

What version of Go are you using?

1.19

What did you do?

Set the DisableInitialHostLookup to true

What did you expect to see?

The driver would never query system.peers for host information

What did you see instead?

On heartbeat failure, the driver tries to query system.peers.


Hi there,

Apologies in advance if this has been answered before - I searched through the issues and couldn't see anything related.

The docs for the DisableInitialHostLookup config flag state:

If DisableInitialHostLookup then the driver will not attempt to get host info from the system.peers table, this will mean that the driver will connect to hosts supplied and will not attempt to lookup the hosts information

It's true that the library avoids quering system.peers on session initialization however, in the event of a heartbeat failure on the control connection, we call (*controlConn).reconnect with the refreshring arg true. Which (as far as I can tell) will ultimately query system.peers.

It seems to me that this is true to the name of the config flag DisableInitialHostLookup but is not quite what the documentation suggests. I'm interested to know the maintainers' view on this and whether they'd consider a patch to change the reconnection logic to consider the DisableInitialHostLookup configuration. Perhaps something like this:

reconn:
	// try to connect a bit faster
	sleepTime = 1 * time.Second
	refreshRing := !c.session.cfg.DisableInitialHostLookup
	c.reconnect(refreshRing)
	continue
}

Why do we care?

A little bit more context to explain why this matters to us.

We're using gocql to speak to AWS Keyspaces, and, for various reasons, we don't want to grant the driver access to the system.peers table. We set the DisableInitialHostLookup flag to false but it seems that if we get a heartbeat timeout on a control connection, we'll try and refresh the ring and immediately run into this problem.

@leesio leesio changed the title system.peers queried event if DisableInitialHostLookup set to true system.peers queried even if DisableInitialHostLookup set to true Nov 17, 2022
@martin-sucha
Copy link
Contributor

Thanks for raising the issue. It seems the behavior was like this since commit 83932d6 that introduced the option. It seems reasonable to me to treat this as a bug since the documentation says that the driver won't try to discover the hosts. Although DisableHostLookup would indeed be better name for a config option like that. And it is not clear to me how users use the DisableInitialHostLookup option, resp. if disabling the lookup always can break someone's program.

Currently, gocql queries system.peers in these cases:

We would need to disable it in all these cases, not just during reconnection.

Could you elaborate on the reasons why you don't want to grant access to system.peers table? Note that features like token-aware routing don't work if the driver does not know the datacenter/rack for nodes as it can't build the ring topology. So far I've only noticed users using DisableInitialHostLookup to speedup session initialization, for example in cases when only a single query is executed. Also, do you allow access to system.local table?

Does the issue affect executing queries for you or the only effect is that some error messages appear in logs?

@leesio
Copy link
Author

leesio commented Nov 18, 2022

Thanks for getting back to me 🙇

We would need to disable it in all these cases, not just during reconnection.

Yes, that's a good point, we don't use those particular features so I neglected to include them in my issue but ofcourse a complete fix would need to address those call sites too.

Could you elaborate on the reasons why you don't want to grant access to system.peers table?

Yeah sure. It's primarily to do with a quirk in AWS Keyspaces rather than us specifically wanting to deny access. We're using VPC endpoints and a query system.peers seems to ultimately call through to some AWS APIs which have fairly aggressively low rate limits

Note that features like token-aware routing don't work if the driver does not know the datacenter/rack for nodes as it can't build the ring topology.

Yeah, understood. I don't think that's such a problem with Keyspaces since the ring is abstracted away - we get a number of fixed endpoints which don't ever change!

Does the issue affect executing queries for you or the only effect is that some error messages appear in logs?

I should have mentioned this in the initial issue 🤦🏻‍♂️

It does impact query execution, if a heartbeat fails we try and query system.peers which returns 0 rows. This effectively empties the ring and all query executions (understandably) result in a no hosts available in the pool error.

Also, do you allow access to system.local table?

That's a good question. We don't disallow access but I'm not sure off the top of my head what AWS returns from system.local

For what it's worth, I recognise that this is issue is pretty specific to the AWS service - which is why I was reticent to call it a bug in the first instance 😅

Update: After a bit more debugging, I'm now less convinced that this is what caused our issues.

Update 2: I think I've found the specific scenario that led to the "no hosts in pool" error, it's a bit of a doozy - sharing in case it's of any interest!

Specific issue outline
Let's say we have 3 hosts: 10.0.0.1, 10.0.0.2, 10.0.0.3

  • We're having intermittent network problems to 10.0.0.1, so we decide to add the to a blocklist which drives a HostFilter.
  • This creates a new session which spawns a new control connection. (*controlConn).connect shuffles the unfiltered host list then dials the first one.
  • By chance, we end up creating our control connection to 10.0.0.1 - the connection with intermittent problems
  • We miss a heartbeat and hit the reconnect loop
  • We successfully connect to the same host again (again, issues are intermittent)
  • We refresh the ring
  • First we get the hosts. The first host in the ring is the host we have our control connection to, we then add all of the hosts from system.peers (which in our case is none!)
  • Then we remove the filtered hosts
  • Since the only host in our host list is filtered, we end up with no hosts
  • Every query receives Error handling request: gocql: no hosts available in the pool
  • Since we've wiped out our hosts list we have no chance of reconnecting to a different host

So it seems like this problem was caused by the fact that:

  • We always refresh the ring, even if DisableHostLookup is set
    and
  • We randomly choose a control connection from the unfiltered host list

@RostislavPorohnya
Copy link

I would like to work on this issue

@joao-r-reis joao-r-reis changed the title system.peers queried even if DisableInitialHostLookup set to true CASSGO-5 system.peers queried even if DisableInitialHostLookup set to true Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants