Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service registration idempotence regarding HealthChecks #4903

Closed
pierresouchay opened this issue Nov 6, 2018 · 17 comments
Closed

Service registration idempotence regarding HealthChecks #4903

pierresouchay opened this issue Nov 6, 2018 · 17 comments
Labels
needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release
Milestone

Comments

@pierresouchay
Copy link
Contributor

Issue with Service registration and checks

When registering several times a service on a local Agent using /agent/service/register, if checks are removed, those checks are not removed from the service.

So, if at t0, we register service redis with checks:[ check0, check1] and then, at t1 we register checks:[check2], the service will have the following services:

redis: [check0, check1, check2]

Ideally, in a RESTfull way, it should be redis: [check2]

It means the PUT registration is not idempotent. This is a bit problematic sometimes.

However, it probably means that some people are relying on this behavior, so changing the behaviour might break clients.

We thus propose to add an optional parameter for /agent/service/register?idempotent=true to enable this idempotent behavior.

@Aestek will soon provide a patch.

Issue with checks registration

Another bug can be found in /agent/check/register: when registering several times (in a idempotent way) a check, the state will override the state of check. Consider this:

Register check1 on Service2

{
  "Name": "check1",
  "ServiceID": "service2",
  "HTTP": "localhost:5050",
  "Interval": "30s",
}

At first run, the check will be initialized => Ok
At second registration time, the check (which was passing) will be reset to Critical (default value from registration) and will flap.

@Aestek will also propose another patch to address this: if a check already exists and is not modified, do not change the current status of the check.

Fixing those 2 issues will ease deployment of services using idempotent systems such as chef.

@pearkes Would you consider this for inclusion in upstream?

Kind Regards

@banks banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 8, 2018
@banks banks added this to the Upcoming milestone Nov 8, 2018
@banks
Copy link
Member

banks commented Nov 8, 2018

Seems reasonable, I've labelled it so we can discuss with team at next available opportunity.

@gertcuykens
Copy link

I am not the smartest so I am an advocate for simplicity, when I read ?idempotent=true my mind is already saying 'what?' I believe one scenario is a bug the other the correct way to handle the checks. Please don't fragment behavior like this, just discus which one is idiomatic and if it breaks stuf version bump and move on, thx

@pierresouchay
Copy link
Contributor Author

@gertcuykens agree, but I do know that some people rely on this behavior (by inserting some checks outside service definition and applying service definition on each chef run). So not doing that way will break some users.

We could however recommend using this query parameter in the documentation

God built Earth in 7 days between there were no legacy ;)

@gertcuykens
Copy link

Can we settle on a version bump and the reverse ?idempotent=false instead. Including a deprecated warning for does who rely on this non idempotent bug :P

@banks
Copy link
Member

banks commented Nov 12, 2018

and if it breaks stuf version bump and move on, thx

Sadly version bumps on our API are way to drastic a thing to get around minor breaking changes like this. A version bump that changes the URL basically breaks every Consul client in the world and causes massive upgrade pain.

We appreciate how ugly it is trying to maintain "legacy" behaviour - there is tons of it in Consul as something that's been around a while, but any change that breaks existing clients effectively prevents many of our largest users from being able to upgrade at all and leaves them stranded on older versions so we have to be extremely pragmatic with our approach.

If you mean version bump Consul that's even worse - even if we make breaking changes we still have to give loads of warning of the change and ensure that the upgrade is smooth so old clients can still use new servers etc without any semantics being confused etc. How would one roll out an upgrade of all clients if they have a breaking API change? You can't upgrade the clients first since then they will do the wrong thing against old API and you can't upgrade the consul clients first for same reason etc.

That said, we all agreed that this was unpleasant and that just "fixing" it might be an option - expecting Check append behaviour is pretty odd and not something we ever documented.

BUT there is a major use-case that is relied upon all the time by existing tooling that would be broken if we change the default behaviour here. In many cases people register services with all their checks etc either via Config or API, and then use this endpoint from some totally separate automation to do things like update the service tags or metadata. When they do that, they rely on the fact that when they PUT a change to the service with it's new tags, they don't need to perfectly redefine all it's Checks too. They just leave those blank and it doesn't update them. If we change the default then all such integrations would remove all their checks on next tag update!

We could special case empty checks but then it still isn't really idempotent for Chef as there would be no way to remove all checks.

I do agree @gertcuykens that ?idempotent=true is not a great API choice, but I also agree with Pierre that we do need to make this opt-in somehow and it's not easy to come up with a better name than that! Versioning is hard!

@gertcuykens
Copy link

Then I suggest we leave current api v1 behavior, document that part better and use github v2 tags to mark all legacy code for when it's time to clean things up on a api level so we can revisit this later.

I strongly believe we should follow docker and kubernetes version / config filosophie here as in /v2/agent/service/register It's the only acceptable way to change it.

As far as i can tell you will not find any ?... workaround in docker and kubernetes api right?

https://docs.docker.com/registry/spec/api/
https://kubernetes.io/docs/concepts/overview/kubernetes-api/

Also who said a tool can't mix v1 and v2 api? Kubernetes does it all the time. Maybe we can introduce a api level in our consul config files also where the default is v1 or do we have that already?

@banks
Copy link
Member

banks commented Nov 13, 2018

@gertcuykens thanks for the input, I agree those are potential options.

In general HashiCorp tools don't always make the same choices as others like Docker and Kubernetes when it comes to versioning. Partly because there is more than one way to organise, partly because some of them pre-dated other popular projects.

It's a hard problem in general because it is much more than a technical decision -- it's super important that we evolve our software in ways that our current users expect and that cause a minimum of disruption and extra work for them to do. Having HashiCorp tools be somewhat consistent with themselves historically in how they handle backwards incompatibility and consistent with each other is probably more important that doing the same thing as Kubernetes for example.

We could add a /v2/ but every time we "bump" version like that it requires a massive amount of work to rewrite all clients that are using it. For a minor "bug" like this that isn't affecting anyone much for the last 4 years it seems, it's hard to justify asking people to change stuff. Sure both can coexist a while and would have to but that increases maintenance burden. Imagine ending up in a situation where we have v1,2,3,4 because each one only changed one minor bug with backwards compatibility ramifications in one endpoint. That would be a nightmare to maintain and document as well as horrible for clients to keep up with.

In general URL versioning is a pretty heated topic (Google API versioning if you're interested in some viewpoints!) but most people agree it's pretty horrible to change the URL of every resource in your API to introduce a single breaking change somewhere.

That said, adding an query param might not be the best way either - there is a spectrum of possible solutions between that and making a whole new /v2/ and changing every single API endpoint and client that exists!

We'll consider the simplest way that will work least badly for users 😄.

@pearkes
Copy link
Contributor

pearkes commented Nov 30, 2018

We just had a chance to discuss this and aren't decided but I wanted to take some notes:

  • If we were to take the query param idea, replace_checks=1 would feel the best / most explicit, but we'd need to validate the assumption that tags/service meta wouldn't have similar problem, if you want to look into that it'd be welcome.
  • We talked about a new endpoint that didn't duplicate much other than modifying this behavior but aren't sold on that idea.
  • The 2nd PR here (Prevent status flap when re-registering a check #4904) we want to prioritize getting in and have triaged it as such. The query param PR needs more thought as mentioned above.

Thanks!

@ShimmerGlass
Copy link
Contributor

@pearkes thx for discussing this :)

Tags and meta are replaced when re-registering a service, only checks are left behind.

As for the param, another option would be to introduce a new config file param like replace_checks_on_service_register that would default to false. This new param could be removed in an upcoming non-backward compatible release. This would keep this kind of "bugfix" param centralized. What do you think ?

@pierresouchay
Copy link
Contributor Author

@Aestek I am not a fan of the agent config, or in that case, it should be possible to put: replace_checks=0 if default behavior is set to 1.

We currently have some agents (I mean, now, only a single case known to me) were one our beloved users did mix behaviors (service created by chef, additional healthcheck added later by a script)

@glakshmeepathi
Copy link

I have a question.
Say the check is changed from "timeout 5s docker ps" to "timeout 10s docker ps" and then re-registered.
There is a slight change in the content of the health check here. Will it fail momentarily ?

If so, why not "fail" only if the re-registered health check actually fails ? ( retain the state of the old content until the result of the new content is available )

@ShimmerGlass
Copy link
Contributor

@glakshmeepathi no it wont fail momentarily, the state (passing, warning...) will stay the same until the check is re-ran with the new parameters.

mkeeler pushed a commit that referenced this issue Jan 7, 2019
Fixes point `#2` of: #4903

When registering a service each healthcheck status is saved and restored (https://github.com/hashicorp/consul/blob/master/agent/agent.go#L1914) to avoid unnecessary flaps in health state.
This change extends this feature to single check registration by moving this protection in `AddCheck()` so that both `PUT /v1/agent/service/register` and `PUT /v1/agent/check/register` behave in the same idempotent way.

#### Steps to reproduce
1. Register a check :
```
curl -X PUT \
  http://127.0.0.1:8500/v1/agent/check/register \
  -H 'Content-Type: application/json' \
  -d '{
  "Name": "my_check",
  "ServiceID": "srv",
  "Interval": "10s",
  "Args": ["true"]
}'
```
2. The check will initialize and change to `passing`
3. Run the same request again
4. The check status will quickly go from `critical` to `passing` (the delay for this transission is determined by https://github.com/hashicorp/consul/blob/master/agent/checks/check.go#L95)
@hanshasselberg
Copy link
Member

Having read the discussion it seems to me that we want to support idempotent service registrations. I think a parameter is the best way to do it for now, since we don't want to start api v2 and we hesitate to create a new endpoint just for that. Which doesn't mean that we won't do that in the future. I can see how v2 of this endpoint will default to the idempotent behaviour.

I think the parameter name idempotent leaves lots of room for interpretation though, what do you think about replace_existing_checks which defaults to false?

Other than that I think #4905 is good to go.

@stale
Copy link

stale bot commented Oct 21, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

@stale stale bot added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 21, 2019
@stale
Copy link

stale bot commented Nov 20, 2019

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍

@stale stale bot closed this as completed Nov 20, 2019
@banks
Copy link
Member

banks commented Nov 21, 2019

Stale bot closed this but AFAICT it was actually fixed by #4905 and release in Consul 1.6.1.

@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
@ghost ghost removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release
Projects
None yet
Development

No branches or pull requests

7 participants