Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Alertmanager v0.15.0 #1340

Closed
mxinden opened this issue Apr 23, 2018 · 43 comments
Closed

Release Alertmanager v0.15.0 #1340

mxinden opened this issue Apr 23, 2018 · 43 comments

Comments

@mxinden
Copy link
Member

mxinden commented Apr 23, 2018

I would like to start the discussion of doing either a v0.15.0 or a v0.15.0-rc.2 release.

@stuartnelson3 @simonpasquier @fabxc are there any blockers that I am not aware of?

If there is consensus I am more than happy to prepare the next release.

@stuartnelson3
Copy link
Contributor

There were two issues I wanted to fix before a next release, #1331 and #1330. #1330 is blocking @TheTincho from making a debian release, but I believe that still doesn't resolve the issue of our elm frontend .. i.e. even with #1330, he won't be able to release AM as an official debian package.

I haven't had a chance to test the new mesh lib at soundcloud, which I would like to do before making an official release, but I'm also not aware of the level of testing you've done with it over at redhat. @simonpasquier commented on a PR that he thought it was stable, I would be interested in hearing his opinion.

@NightTsarina
Copy link
Contributor

Hi,

Yes, #1330 is indeed a blocker for me now.

The Elm issue has not gone away, but I have basically abandoned any hope of getting Elm into Debian (React does not look better in this aspect). So I have started working last week of producing an official release without any web frontend.

My plan would be to upload this to experimental so users can avail of the newer AM, and meanwhile work on backporting the old simpler frontend to AM 0.14.

@simonpasquier
Copy link
Member

Correct, my own tests with 0.15 are conclusive (see my original comment). That being said, it is limited to my local environment and although I've played a bit with ambench, it can't be compared to any (pre-)production setup. We also had reports from users deploying successfully 0.15-RC versions with the Prometheus operator which is encouraging. Maybe @iksaif has done some testing too?

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

@stuartnelson3
Copy link
Contributor

stuartnelson3 commented Apr 24, 2018

In that case, my chief concern then is that the release makes it explicit that mesh configuration requires FQDNs.

EDIT:
There are also breaking changes in amtool.

@iksaif
Copy link
Contributor

iksaif commented Apr 24, 2018 via email

@mxinden
Copy link
Member Author

mxinden commented Apr 24, 2018

@stuartnelson3 we have been testing the new Alertmanager with the Prometheus Operator test suit and deployed it in some dev environments. No production experience so far.

There are also breaking changes in amtool.

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

@simonpasquier I aggree.

@stuartnelson3
Copy link
Contributor

I just started running a cluster at soundcloud and would like to investigate it more thoroughly. For example, it appears that even after having run for a couple hours, all instances are sending notifications. Given that we are using the default peer timeout (15s), and it doesn't appear that we're surpassing that timeout, I would expect just a single instance to be sending notifications based on how clusterWait is defined (https://github.com/prometheus/alertmanager/blob/master/cmd/alertmanager/main.go#L408-L412). I'm dubious that our mesh is so unstable that all instances are sending notifications so regularly.

image

Perhaps I'm misunderstanding something about the mesh and someone can clarify for me what I'm seeing.

@stuartnelson3
Copy link
Contributor

stuartnelson3 commented Apr 24, 2018

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

@mxinden What's your reasoning? It feels arbitrary not to release changes made to amtool, especially when they're preventing @TheTincho from releasing AM as a debian package, and instead release the new clustering lib then wait some period of time and then release the amtool changes with whatever other changes that have been merged since then.

@mxinden
Copy link
Member Author

mxinden commented Apr 24, 2018

@stuartnelson3 In regards to all instances sending notifications: Would you mind creating a new issue with your configuration and status page? I would test it out on one of our dev clusters.

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

If the current release-0.15 amtool version is not compatible with the current release-0.15 alertmanager binary, I would see those new amtool features as bug fixes.

This is not a strong opinion just a suggestion. How have we been handling this before?

@stuartnelson3
Copy link
Contributor

@mxinden will send the issue soon, it's half-way written on my other laptop :)

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

I have to check the changelog, but I believe all non-bugfixes have been in amtool.

Alertmanager's API hasn't changed, so amtool should be compatible. The changes are a couple flag names, moving to a stable underlying CLI library, and fixing a bug.

This is not a strong opinion just a suggestion. How have we been handling this before?

I'm not sure, unfortunately. I don't think we have a set policy. I would prefer to get the amtool changes out as those won't affect running alertmanager, and whoever does the release can check for changes to alertmanager itself and decide if they're a risk to stability.

@stuartnelson3
Copy link
Contributor

issue created: #1341

@tzz
Copy link
Contributor

tzz commented Apr 27, 2018

Can 0.15 please include #1339 as a trivial fix?

I've been running rc1 and it's been absolutely great. Thank you, developers.

@stuartnelson3
Copy link
Contributor

There is still the generally confusing issue explained in #1341, but having run a version of master that includes all the changes in v0.15.0-rc.1 and some bug fixes, I think we can release this. In general, the multi-send seems to be an issue with pipeline creation that only occurs rarely, and fixing that (i.e. syncing pipeline execution) will take more time to test and verify.

@mxinden do you have time to work on the release?

@mxinden
Copy link
Member Author

mxinden commented May 4, 2018

@stuartnelson3 Sounds good. I will try my best in the upcoming two days.

@mxinden
Copy link
Member Author

mxinden commented May 5, 2018

I have created a release-0.15 branch. #1363 cherry-picks discussed commits from master into release-0.15. #1364 bumps VERSION and adds changes to CHANGELOG.md.

Let me know what you think.

@stuartnelson3
Copy link
Contributor

According to @grobie, there seems to be a higher rate of more than one peer sending alerts than with the previous mesh library.

I think we should address this before releasing a version that doesn't support some form of synchronization of pipeline execution between peers.

@grobie
Copy link
Member

grobie commented May 8, 2018

I don't know whether it's multiple peers sending. We definitely see a significantly increased number of duplicated notifications sent with 0.15, while the network and instances are healthy.

@simonpasquier
Copy link
Member

Is there anything that looks suspicious when checking at the cluster metrics (eg alertmanager_cluster_.+)?

@grobie
Copy link
Member

grobie commented May 9, 2018

Nope. We spent some hours again this morning investigating the situation, added more debug logging and released the new version. Will work on that again this Friday.

@stuartnelson3
Copy link
Contributor

We experienced a network partition over the weekend during which every alertmanager was isolated. after the partition, the mesh did not recover. at the time of this writing, the 4 instances have been running independently for over 48 hours. alertmanager_cluster_health_score and alertmanager_cluster_messages_queued reflect the current bad state of the mesh.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

@mxinden is there a test case for this in the prometheus operator acceptance tests?

@stuartnelson3
Copy link
Contributor

Confirmed in a test using iptables between two machines in the same datacenter, once nodes are marked dead they will not rejoin without restarting memberlist.

@simonpasquier
Copy link
Member

Indeed once a node has left the cluster (eg on connection time-out), memberlist on its own will never try to reconnect to it. Serf (which is heavily using memberlist) handles the reconnection with a background task:
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1426-L1437
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1484-L1523

AIUI AlertManager needs to deal with it in a similar fashion.

@mxinden
Copy link
Member Author

mxinden commented May 14, 2018

@mxinden is there a test case for this in the prometheus operator acceptance tests?

@stuartnelson3 there is none at the moment, sorry. I will look into improving that for the future.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

I agree. Thanks a lot for looking into this.

@stuartnelson3
Copy link
Contributor

stuartnelson3 commented May 17, 2018

Investigating another duplicate (where the pipelines are firing at the correct time):

- peer0 sends correctly
- peer2 and peer3 receive the nflog notification
- peer1 does not receive the nflog notification. 15seconds pass, and it sends a duplicate.

I'm trying to understand why the peer1 doesn't receive the updated state within 15s time, which is leading me to reading about how memberlist works. I'm trying to figure out:

- how many times a gossip message is sent (TransmitLimitedQueue#GetBroadcasts)
- how to force all nodes to be gossiped to (utils.go#kRandomNodes)

I would prefer not having to do a full state sync every 10s to ensure the messages have propagated, but that would probably stop the duplicates. The issue then is we're not gossiping, but maybe that's fine 🤷‍♂️.

edit:
This has been moved to its own issue: #1387

@anandsinghkunwar
Copy link

Any ETA on the new release?

@stuartnelson3
Copy link
Contributor

I've been addressing what I see as the two main issues to a new release:

I've just returned from a weeks vacation and need to pick the work back up again. I hope to finish these in the next two weeks and get a release out.

@bforbis
Copy link

bforbis commented Jun 11, 2018

How long will v0.15.0-rc.2 be monitored for stability before an official v0.15.0 is released? There's a lot of great stuff in here, so I'm very excited to start using it in an official release.

@stuartnelson3
Copy link
Contributor

We're currently running it at SoundCloud. It's looking stable, so I'm hoping we can release it on Wednesday or Thursday.

@mxinden unless you have any objections, would you like to do the release for 0.15.0?

@mxinden
Copy link
Member Author

mxinden commented Jun 11, 2018

@stuartnelson3 Wednesday or Thursday sounds good.

would you like to do the release for 0.15.0?

For sure. Glad to do so in case there are no issues reported till then. 🎉

@stuartnelson3

This comment has been minimized.

@brancz

This comment has been minimized.

@simonpasquier

This comment has been minimized.

@brancz

This comment has been minimized.

@stuartnelson3

This comment has been minimized.

@stuartnelson3
Copy link
Contributor

alertmanager with the latest changes for sending large messages via tcp seems to be doing the right thing.

I'm gone without internet access for the next two weeks. I propose that @mxinden releases the final rc, @grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly, but then @mxinden can release 0.15.0 early next week? How does this sound

@mxinden
Copy link
Member Author

mxinden commented Jun 15, 2018

@stuartnelson3 That sounds good to me.

@mxinden
Copy link
Member Author

mxinden commented Jun 20, 2018

@grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly

@grobie did you ran into any issues on the new v0.15.0-rc3 release candidate in the last couple of days?

I don't see any blocking issues reported by the community so far. 🎉

@grobie
Copy link
Member

grobie commented Jun 20, 2018 via email

@Techcadia
Copy link

+1 Sounds great

@brian-brazil
Copy link
Contributor

Who's planning on doing the release?

@mxinden
Copy link
Member Author

mxinden commented Jun 22, 2018

@brian-brazil I will prepare it today.

@brian-brazil
Copy link
Contributor

Great!

@mxinden mxinden mentioned this issue Jun 22, 2018
@mxinden
Copy link
Member Author

mxinden commented Jun 22, 2018

With #1429 merged, I will close here. Feel free to reopen if there are any questions.

@mxinden mxinden closed this as completed Jun 22, 2018
hh pushed a commit to ii/alertmanager that referenced this issue May 10, 2019
According to the golang docs, the syscall package is deprecated.
https://golang.org/pkg/syscall
This updates collectors to use the x/sys/unix package instead.
Also updates the vendored x/sys/unix module to latest.

Signed-off-by: Paul Gier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests