-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch /catalog/services optimization: X-Consul-Index changes less often #3934
Watch /catalog/services optimization: X-Consul-Index changes less often #3934
Conversation
A previous PR hashicorp#3899 was adding optimization of watches for a given service name. This optimization does the same for /catalog/services Thus, it will dramatically change the bandwidth usage of consul-template and all tools (example prometheus) watching for changes in /catalog/services What it does is: * Only apply changes in /catalog/services when tags OR node meta do change (since it is possible to search using node-meta, this is needed) On a typical large cluster of more than 5k nodes, it will decrease the number of watches from around 3 per second to only a few changes every seconds/minutes.
if idx := s.maxIndex("services"); idx != 50 { | ||
t.Fatalf("bad index: %d", idx) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be missing it due to just skimming on GitHub but there doesn't seem to be a test case to cover only tags changing and ensuring that does get a new index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you are right... adding it now
agent/consul/state/catalog.go
Outdated
@@ -639,8 +643,11 @@ func (s *Store) ensureServiceTxn(tx *memdb.Txn, idx uint64, node string, svc *st | |||
if err := tx.Insert("services", entry); err != nil { | |||
return fmt.Errorf("failed inserting service: %s", err) | |||
} | |||
if err := tx.Insert("index", &IndexEntry{"services", idx}); err != nil { | |||
return fmt.Errorf("failed updating index: %s", err) | |||
if !hasSameTags { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't assume that tags are the only thing that can change. The catalog/services
end point is also used for external service registration where the node could be different. In the future we may add service metadata and then we have to remember to come here and change the diffing logic.
What's also not clear to me is why we even need this logic. There's a PR from back in January 2017 that addressed the issue of excessive churn by introducing logic to compare the existing service to the one being registered even before ensureServiceTxn
is called f2d9da2 . It uses logic defined here to do the comparision.
The other part of this PR around using maxIndexForService
as the modify index in the reply looks reasonable, but this conditional !changedTags is too risky to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@preetapan the point is about X-Consul-Index when using watches as outlined here: #3890 (comment) basically, when doing a watch on /catalog/services for instances, if your X-Consul-Index is a bit older (let's say the current raft index of services -1), then it immediately returns the value. (since many changes are changing the "services" index (such as HealthChecks, Services...)
Why all of this is a concern ?
On a large cluster, we have around 4-5 modifications per second on the "services" index. If the latency between the client is a bit high OR if the client is not performing the watch immediately (for instance, because the implementation is using this result to create other watches), then all the data is downloaded once again and this thing never stops. This is less a problem than for maxIndexForService
as "services" index if used by far less endpoints (I mean less not in terms of REST endpoints, but in terms of clients watching all those endpoints for changes), but fewer clients are looking at all the services of a node / catalog, but it is still a problem.
I agree with you about tags optimization (and that is why I did not put this optimization when filtering using NodeMeta as well) risk, so I can propose the following:
Add a new index called "services-tags", use it on catalog and similar endpoints, apply the same logic, but with a clear comment in the code.
What do you think ?
Best Regards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay we talked about this internally. We think that its best to add a new index, "service-catalog" for this that's updated when the catalog changes according to the logic mentioned above, and with a clear comment documenting why. You will also need to handle deletions similar to what you did in PR #3899 - if the last remaining instance of a service is deregistered it should affect the index value for the service-catalog
index so watches of the /catalog/services endpoint can see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@preetapan Ok ! Thank you, I'll do it asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@preetapan DONE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have some doubts about the risk involved with this change.
I'd like a few extra tests for edge cases I've not manually checked the code path for like services changing name (but not ID).
The higher-level concerns are:
- Brittleness and Maintenance Complexity: in the future any changes to either the /catalog/services output would have to remember to come in and add additional code like this. We have thoughts/requests about implementing proper per-service-instance metadata which would certainly need to have this treatment. Worse than the risk of us just forgetting and breaking semantics of this endpoint in a release is the risk that a future change like adding metadata or additional query filters would actually be very costly or impossible to maintain in the same way and we'd effectively be blocked from making the change without introducing a performance regression or breaking change or just leaving a janky API that only works well for some sets of arguments.
- Setting a precedent: there are other compound endpoints that probably don't manage invalidating index optimally now because to do so tightly couples index accounting and endpoint specifics. If we get future requests to optimize those too it's hard to argue against them once we've done it here, yet every one brings maintenance overhead, code complexity and increases the tightness of the coupling between API response and options to some code deep inside the state machine in a non-obvious way. For example we had a request just yesterday to add a way to list
/health/services
for all healthy instances across all services filtered by tag. There are other reasons that might be challenging, but if we set the precedent that we also attempt to optimize the index changing minimally for actual output changes it gets even more thorny.
To be clear I'm not saying that any of those things will certainly happen, but it seems to be a realistic risk that we can't get back from once this is merged.
I wonder if there is a more general solution that reduces this tight coupling between storing indexes with subtle semantics and API output. You could imagine a new mechanism that is more like a regular HTTP etag. For example:
- initial request gets full output of endpoint including "etag" header
- subsequent requests submit "etag" instead of
index
, (header or query string) - API handler in agent submits that tag to servers in RPC upcall
- Servers initially build full RPC response as if it was not a blocking call, calculate tag, and check match
a. If it's a match, RPC enters blocking/watching state as with no (based on the raft index it found in that initial response)
b. If not a match return immediately with new tag etc.
This is off the top of my head, it may not be a good idea, it's certainly more work and it may not be approved by other team members. But I wonder if something like that is a better general solution rather than implementing increasingly subtle logic to couple index manipulation on writes with different consumption semantics.
There is a comment fix to change regardless. Any thoughts on my concerns above welcome and I'll discuss with the team again how we feel about this direction.
agent/consul/state/catalog.go
Outdated
// In some cases, it would be possible to avoid updating this index if | ||
// the tags of this service instance are all present in other instances | ||
// But the optimization might be costly since we would have to run thru the | ||
// whole list of services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment a tiny bit misleading - even if there were no unique tags on the service that is deregistering, the /catalog/services endpoint output might still need to change if this is the last instance of the named service. We already check that below from your previous PR but I agree we should just always update here for now as it's simpler to reason about and we'd need to check tags too like you said.
Could we just change the comment to include that case that last-service instance would still need to bump this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Paul,
Regarding eTags and friends, I could not agree more, I even proposed this a few weeks ago: #3871
Regarding the eTag, there are a few issues however, since for instance the order is not deterministic for now (thus basic shasum or similar cannot be done on string value for instance - but of course definitely solveable)
Well, the main optimisation was definitely on the index per service. This is a bit the cherry on the icing, but I still think it worth the optimisation especially for consul-template like tools (and in our case, prometheus).
Discuss this internally, but the first optimisation (per service name really "destroyed" all of our scalability issues with Consul - agreed that we might be using Consul differently from several users, but still), and this one might finish the Job.
This PR also do not handle all cases (it does not optimize the case where node meta is used for filtering), so allowing to optimize only common use-cases is a first nice step, but it is unlikely that this optimisation might be used for something else than /catalog/services for now, thus the comment. I update my comment in the source code right away. All of this is purely internal, so I doubt this kind of optimisation cannot be removed or changed to something else in the future anyway.
Heh I remember that now. It's actually a little different though: I don't intend this to be useful for caching necessarily so it would be done on the Consul servers to prevent sending the response over the wire to the agent in the first place (which is not HTTP but internal RPC protocol). So I mentioned it as "etag" but the purpose is not preventing traffic from Consul agent (HTTP server) to browser/client app, but on having a cleaner semantic around watches. I really don't know if it's even worth talking about more, I was just trying to flip the problem around to see if there is a solution that "just works" into the future without imposing extra maintenance burden. Anyway we will discuss this and see. |
For paper trail, this bug found in the previous accepted PR to optimise service watches is a reasonable example of the kinds of subtle issues it's pretty hard to reason about that I feel will get significantly worse with additional optimisations like this: #3970 (review) While that might be the only issue there, it shows none of us had a clear enough understanding while reviewing to spot that case where some state change may impact the output of another API and so need to touch it's service index. Discussion on this PR still pending due to a few absences last week, hopefully this week. |
@pierresouchay yep that's totally fair and I didn't realise when I made the previous comment :). We discussed this at length today on a team meeting and we came to a consensus that while this specific PR seems like it would work, we are not really happy with the overall direction of adding increasingly complex and subtle index manipulation to incrementally optimize watches on each end point. Your last PR that was merged arguably set off in that direction already, but we see that as both somewhat simpler than this and much more of an obvious performance issue affecting you and others who have large clusters. It's simpler because it is reasoning only about the identity or the resources being touched and not the semantics of whether they changed in a way that matters to specific endpoint responses or not. We do acknowledge that getting watches to scale really well is important but we think it's better long-term to approach that problem holistically and come up with a general solution that we can apply everywhere without incrementally increasing coupling and maintenance burden with each endpoint we add or optimize. Two very high level approaches we will consider are:
We're really sorry not to accept the work that you've put into this, especially since we steered you this direction in the first place! After a lot of reflection though we don't think this is the right pattern to follow going forward. We think we have a much clearer understanding of the problem and the potential solutions though from this discussion. Thanks for all your work on Consul and help improving the tool for others at scale. |
Ok, I understand, this is less an issue than the per service optimisation anyway and I understand your reasons (bit sad, but life goes on :-) What you might consider however is #3920 : since this endpoint will be heavily changed (~2/3 times per sec), all discovery tools such as Prometheus will request data on it very frequently => and hit only the leader of Cluster. Globally, this applies ONLY to tools performing discovery of services naively, but actually, I figured out that most of the tools that allows discovering services with Consul do not specify stale requests when performing discovery. Thanks for your reviews Regards |
Below is an issue I'm encountering using consul 1.4.0
Here is my definition of kafka-0.kafka.service.consul using https://www.nomadproject.io/docs/job-specification/service.html#tags
in a for loop nslookup, it will intermittently fail with
a) 45% of the result
b) the other 45% of the result
c) and 5% just one Ip A where this loop was invoked.
|
Aplogies, discovered the source of the issue to disable ipv6 https://www.admintome.com/blog/disable-ipv6-on-ubuntu-18-04/ Now behavior is working as expected.
|
A previous PR #3899 was adding optimization of watches for a given service name (see also #3890)
This optimization does the same for
/catalog/services
.Thus, it will dramatically change the bandwidth usage of consul-template and all tools (example prometheus) watching for changes in /catalog/services
What it does is:
On a typical large cluster of more than 5k nodes with services with the same kind of tags - meaning no host specific tags -, it will decrease the number of watches from around 3 per second to only a few changes every seconds/minutes. This change will benefit a lot to prometheus performing discovery on /catalog/services (see prometheus/prometheus#3814 )