Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch /catalog/services optimization: X-Consul-Index changes less often #3934

Conversation

pierresouchay
Copy link
Contributor

A previous PR #3899 was adding optimization of watches for a given service name (see also #3890)

This optimization does the same for /catalog/services.
Thus, it will dramatically change the bandwidth usage of consul-template and all tools (example prometheus) watching for changes in /catalog/services

What it does is:

  • Only apply changes in /catalog/services when index of services do change
  • Change less often /catalog/services (only when tags do change when declaring / unregistering services)

On a typical large cluster of more than 5k nodes with services with the same kind of tags - meaning no host specific tags -, it will decrease the number of watches from around 3 per second to only a few changes every seconds/minutes. This change will benefit a lot to prometheus performing discovery on /catalog/services (see prometheus/prometheus#3814 )

A previous PR hashicorp#3899 was adding
optimization of watches for a given service name.

This optimization does the same for /catalog/services
Thus, it will dramatically change the bandwidth usage of consul-template
and all tools (example prometheus) watching for changes in /catalog/services

What it does is:

* Only apply changes in /catalog/services when tags OR node meta do change (since
  it is possible to search using node-meta, this is needed)

On a typical large cluster of more than 5k nodes, it will decrease the number of
watches from around 3 per second to only a few changes every seconds/minutes.
if idx := s.maxIndex("services"); idx != 50 {
t.Fatalf("bad index: %d", idx)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be missing it due to just skimming on GitHub but there doesn't seem to be a test case to cover only tags changing and ensuring that does get a new index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you are right... adding it now

@@ -639,8 +643,11 @@ func (s *Store) ensureServiceTxn(tx *memdb.Txn, idx uint64, node string, svc *st
if err := tx.Insert("services", entry); err != nil {
return fmt.Errorf("failed inserting service: %s", err)
}
if err := tx.Insert("index", &IndexEntry{"services", idx}); err != nil {
return fmt.Errorf("failed updating index: %s", err)
if !hasSameTags {
Copy link
Contributor

@preetapan preetapan Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't assume that tags are the only thing that can change. The catalog/services end point is also used for external service registration where the node could be different. In the future we may add service metadata and then we have to remember to come here and change the diffing logic.

What's also not clear to me is why we even need this logic. There's a PR from back in January 2017 that addressed the issue of excessive churn by introducing logic to compare the existing service to the one being registered even before ensureServiceTxn is called f2d9da2 . It uses logic defined here to do the comparision.

The other part of this PR around using maxIndexForService as the modify index in the reply looks reasonable, but this conditional !changedTags is too risky to merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preetapan the point is about X-Consul-Index when using watches as outlined here: #3890 (comment) basically, when doing a watch on /catalog/services for instances, if your X-Consul-Index is a bit older (let's say the current raft index of services -1), then it immediately returns the value. (since many changes are changing the "services" index (such as HealthChecks, Services...)

Why all of this is a concern ?

On a large cluster, we have around 4-5 modifications per second on the "services" index. If the latency between the client is a bit high OR if the client is not performing the watch immediately (for instance, because the implementation is using this result to create other watches), then all the data is downloaded once again and this thing never stops. This is less a problem than for maxIndexForService as "services" index if used by far less endpoints (I mean less not in terms of REST endpoints, but in terms of clients watching all those endpoints for changes), but fewer clients are looking at all the services of a node / catalog, but it is still a problem.

I agree with you about tags optimization (and that is why I did not put this optimization when filtering using NodeMeta as well) risk, so I can propose the following:

Add a new index called "services-tags", use it on catalog and similar endpoints, apply the same logic, but with a clear comment in the code.

What do you think ?

Best Regards

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay we talked about this internally. We think that its best to add a new index, "service-catalog" for this that's updated when the catalog changes according to the logic mentioned above, and with a clear comment documenting why. You will also need to handle deletions similar to what you did in PR #3899 - if the last remaining instance of a service is deregistered it should affect the index value for the service-catalog index so watches of the /catalog/services endpoint can see it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preetapan Ok ! Thank you, I'll do it asap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preetapan DONE

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some doubts about the risk involved with this change.

I'd like a few extra tests for edge cases I've not manually checked the code path for like services changing name (but not ID).

The higher-level concerns are:

  1. Brittleness and Maintenance Complexity: in the future any changes to either the /catalog/services output would have to remember to come in and add additional code like this. We have thoughts/requests about implementing proper per-service-instance metadata which would certainly need to have this treatment. Worse than the risk of us just forgetting and breaking semantics of this endpoint in a release is the risk that a future change like adding metadata or additional query filters would actually be very costly or impossible to maintain in the same way and we'd effectively be blocked from making the change without introducing a performance regression or breaking change or just leaving a janky API that only works well for some sets of arguments.
  2. Setting a precedent: there are other compound endpoints that probably don't manage invalidating index optimally now because to do so tightly couples index accounting and endpoint specifics. If we get future requests to optimize those too it's hard to argue against them once we've done it here, yet every one brings maintenance overhead, code complexity and increases the tightness of the coupling between API response and options to some code deep inside the state machine in a non-obvious way. For example we had a request just yesterday to add a way to list /health/services for all healthy instances across all services filtered by tag. There are other reasons that might be challenging, but if we set the precedent that we also attempt to optimize the index changing minimally for actual output changes it gets even more thorny.

To be clear I'm not saying that any of those things will certainly happen, but it seems to be a realistic risk that we can't get back from once this is merged.

I wonder if there is a more general solution that reduces this tight coupling between storing indexes with subtle semantics and API output. You could imagine a new mechanism that is more like a regular HTTP etag. For example:

  1. initial request gets full output of endpoint including "etag" header
  2. subsequent requests submit "etag" instead of index, (header or query string)
  3. API handler in agent submits that tag to servers in RPC upcall
  4. Servers initially build full RPC response as if it was not a blocking call, calculate tag, and check match
    a. If it's a match, RPC enters blocking/watching state as with no (based on the raft index it found in that initial response)
    b. If not a match return immediately with new tag etc.

This is off the top of my head, it may not be a good idea, it's certainly more work and it may not be approved by other team members. But I wonder if something like that is a better general solution rather than implementing increasingly subtle logic to couple index manipulation on writes with different consumption semantics.

There is a comment fix to change regardless. Any thoughts on my concerns above welcome and I'll discuss with the team again how we feel about this direction.

// In some cases, it would be possible to avoid updating this index if
// the tags of this service instance are all present in other instances
// But the optimization might be costly since we would have to run thru the
// whole list of services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment a tiny bit misleading - even if there were no unique tags on the service that is deregistering, the /catalog/services endpoint output might still need to change if this is the last instance of the named service. We already check that below from your previous PR but I agree we should just always update here for now as it's simpler to reason about and we'd need to check tags too like you said.

Could we just change the comment to include that case that last-service instance would still need to bump this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Paul,

Regarding eTags and friends, I could not agree more, I even proposed this a few weeks ago: #3871

Regarding the eTag, there are a few issues however, since for instance the order is not deterministic for now (thus basic shasum or similar cannot be done on string value for instance - but of course definitely solveable)

Well, the main optimisation was definitely on the index per service. This is a bit the cherry on the icing, but I still think it worth the optimisation especially for consul-template like tools (and in our case, prometheus).

Discuss this internally, but the first optimisation (per service name really "destroyed" all of our scalability issues with Consul - agreed that we might be using Consul differently from several users, but still), and this one might finish the Job.

This PR also do not handle all cases (it does not optimize the case where node meta is used for filtering), so allowing to optimize only common use-cases is a first nice step, but it is unlikely that this optimisation might be used for something else than /catalog/services for now, thus the comment. I update my comment in the source code right away. All of this is purely internal, so I doubt this kind of optimisation cannot be removed or changed to something else in the future anyway.

@banks
Copy link
Member

banks commented Mar 14, 2018

Regarding eTags and friends, I could not agree more, I even proposed this a few weeks ago: #3871

Regarding the eTag, there are a few issues however, since for instance the order is not deterministic for now (thus basic shasum or similar cannot be done on string value for instance - but of course definitely solveable)

Heh I remember that now. It's actually a little different though: I don't intend this to be useful for caching necessarily so it would be done on the Consul servers to prevent sending the response over the wire to the agent in the first place (which is not HTTP but internal RPC protocol).

So I mentioned it as "etag" but the purpose is not preventing traffic from Consul agent (HTTP server) to browser/client app, but on having a cleaner semantic around watches.

I really don't know if it's even worth talking about more, I was just trying to flip the problem around to see if there is a solution that "just works" into the future without imposing extra maintenance burden.

Anyway we will discuss this and see.

@banks
Copy link
Member

banks commented Mar 19, 2018

For paper trail, this bug found in the previous accepted PR to optimise service watches is a reasonable example of the kinds of subtle issues it's pretty hard to reason about that I feel will get significantly worse with additional optimisations like this: #3970 (review)

While that might be the only issue there, it shows none of us had a clear enough understanding while reviewing to spot that case where some state change may impact the output of another API and so need to touch it's service index.

Discussion on this PR still pending due to a few absences last week, hopefully this week.

@pierresouchay
Copy link
Contributor Author

@banks to be fair, the bug fixed in #3970 did already exist without those optimizations :-)

The optimization just made it easier to highlight on a real-life clustrer

@banks
Copy link
Member

banks commented Mar 19, 2018

@pierresouchay yep that's totally fair and I didn't realise when I made the previous comment :).

We discussed this at length today on a team meeting and we came to a consensus that while this specific PR seems like it would work, we are not really happy with the overall direction of adding increasingly complex and subtle index manipulation to incrementally optimize watches on each end point.

Your last PR that was merged arguably set off in that direction already, but we see that as both somewhat simpler than this and much more of an obvious performance issue affecting you and others who have large clusters. It's simpler because it is reasoning only about the identity or the resources being touched and not the semantics of whether they changed in a way that matters to specific endpoint responses or not.

We do acknowledge that getting watches to scale really well is important but we think it's better long-term to approach that problem holistically and come up with a general solution that we can apply everywhere without incrementally increasing coupling and maintenance burden with each endpoint we add or optimize.

Two very high level approaches we will consider are:

  1. Something like the hash-based mechanism I described previously: internally, a blocking RPC can generate full output and then hash it to compare to the hash value passed from last response to determine if the result has changed since last watch query. If hash doesn't match, return right away and start again, if it does then start blocking and execution continues just as it does now.
  2. If not that for some reason, we could at least explicitly model the abstract "resources" each API endpoint represents rather than couple them from a distance. We'd have these resource definitions live alongside the APIs that define them in code and somehow register hooks for the actual DB models that they are concerned with. They would encapsulate the decision about whether or not to update any particular index locally in code that lives next to the API whose response depends on the update being correct. The central state store would then just arrange to call into all interested resources on any change allowing them the opportunity to update any indexes they want within the transaction.

We're really sorry not to accept the work that you've put into this, especially since we steered you this direction in the first place! After a lot of reflection though we don't think this is the right pattern to follow going forward. We think we have a much clearer understanding of the problem and the potential solutions though from this discussion.

Thanks for all your work on Consul and help improving the tool for others at scale.

@banks banks closed this Mar 19, 2018
@pierresouchay
Copy link
Contributor Author

Ok, I understand, this is less an issue than the per service optimisation anyway and I understand your reasons (bit sad, but life goes on :-)

What you might consider however is #3920 : since this endpoint will be heavily changed (~2/3 times per sec), all discovery tools such as Prometheus will request data on it very frequently => and hit only the leader of Cluster. Globally, this applies ONLY to tools performing discovery of services naively, but actually, I figured out that most of the tools that allows discovering services with Consul do not specify stale requests when performing discovery.

Thanks for your reviews

Regards

@vvitayau
Copy link

vvitayau commented Dec 13, 2018

Below is an issue I'm encountering using consul 1.4.0

  1. Is this what the discussion above regarding implementation related to my bug?

Here is my definition of kafka-0.kafka.service.consul using https://www.nomadproject.io/docs/job-specification/service.html#tags

      service {
        name = "kafka"
        tags = [
          "kafka-${NOMAD_ALLOC_INDEX}",
        ]
      }

in a for loop nslookup, it will intermittently fail with

** server can't find kafka-0.kafka.service.consul: NXDOMAIN
  1. It never fails in a for loop nslookup of kafka.service.consul.
    However, I do notice that in performing this 5000 iteration over span of 40s on machine A.
    I have yet to see all three consul servers listed during lookup. Why is that ?

  2. Instead, I see that list varies. Why does it toggle back and forth so much in a short duration ?

a) 45% of the result

Name: kafka.service.consul
Address: A
Name: kafka.service.consul
Address: B

b) the other 45% of the result

Name: kafka.service.consul
Address: A
Name: kafka.service.consul
Address: C

c) and 5% just one Ip A where this loop was invoked.

Name: kafka.service.consul
Address: A

@vvitayau
Copy link

Aplogies, discovered the source of the issue to disable ipv6

https://www.admintome.com/blog/disable-ipv6-on-ubuntu-18-04/

Now behavior is working as expected.

n=0; time for i in {1..5000}; do n=$((n+1)); echo $n && dig @127.0.0.1 8600 kafka-0.kafka.service.consul | grep -A 1 '^;; ANSWER SECTION:' || exit 1; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants