Release Alertmanager v0.15.0 #1340

mxinden · 2018-04-23T15:12:07Z

I would like to start the discussion of doing either a v0.15.0 or a v0.15.0-rc.2 release.

@stuartnelson3 @simonpasquier @fabxc are there any blockers that I am not aware of?

If there is consensus I am more than happy to prepare the next release.

The text was updated successfully, but these errors were encountered:

stuartnelson3 · 2018-04-23T16:01:32Z

There were two issues I wanted to fix before a next release, #1331 and #1330. #1330 is blocking @TheTincho from making a debian release, but I believe that still doesn't resolve the issue of our elm frontend .. i.e. even with #1330, he won't be able to release AM as an official debian package.

I haven't had a chance to test the new mesh lib at soundcloud, which I would like to do before making an official release, but I'm also not aware of the level of testing you've done with it over at redhat. @simonpasquier commented on a PR that he thought it was stable, I would be interested in hearing his opinion.

NightTsarina · 2018-04-23T17:01:32Z

Hi,

Yes, #1330 is indeed a blocker for me now.

The Elm issue has not gone away, but I have basically abandoned any hope of getting Elm into Debian (React does not look better in this aspect). So I have started working last week of producing an official release without any web frontend.

My plan would be to upload this to experimental so users can avail of the newer AM, and meanwhile work on backporting the old simpler frontend to AM 0.14.

simonpasquier · 2018-04-24T08:25:03Z

Correct, my own tests with 0.15 are conclusive (see my original comment). That being said, it is limited to my local environment and although I've played a bit with ambench, it can't be compared to any (pre-)production setup. We also had reports from users deploying successfully 0.15-RC versions with the Prometheus operator which is encouraging. Maybe @iksaif has done some testing too?

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

stuartnelson3 · 2018-04-24T08:42:49Z

In that case, my chief concern then is that the release makes it explicit that mesh configuration requires FQDNs.

EDIT:
There are also breaking changes in amtool.

iksaif · 2018-04-24T15:10:11Z

I have 0.15rc1 running in a limited environment and it seems to work. But it doesn't get much traffic. Maybe would be nice to have an integration test that generates a few hundred of alerts and silences for ~20min.

…

On Tue, Apr 24, 2018 at 10:42 AM, stuart nelson ***@***.***> wrote: In that case, my chief concern then is that the release makes it explicit that mesh configuration requires FQDNs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1340 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_DA5tRJsJljJcEBGuXOjvKkQTFPPZNks5truWSgaJpZM4TgFkH> .

-- Corentin Chary http://xf.iksaif.net

mxinden · 2018-04-24T15:22:50Z

@stuartnelson3 we have been testing the new Alertmanager with the Prometheus Operator test suit and deployed it in some dev environments. No production experience so far.

There are also breaking changes in amtool.

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

@simonpasquier I aggree.

stuartnelson3 · 2018-04-24T15:55:21Z

I just started running a cluster at soundcloud and would like to investigate it more thoroughly. For example, it appears that even after having run for a couple hours, all instances are sending notifications. Given that we are using the default peer timeout (15s), and it doesn't appear that we're surpassing that timeout, I would expect just a single instance to be sending notifications based on how clusterWait is defined (https://github.com/prometheus/alertmanager/blob/master/cmd/alertmanager/main.go#L408-L412). I'm dubious that our mesh is so unstable that all instances are sending notifications so regularly.

Perhaps I'm misunderstanding something about the mesh and someone can clarify for me what I'm seeing.

stuartnelson3 · 2018-04-24T15:57:37Z

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

@mxinden What's your reasoning? It feels arbitrary not to release changes made to amtool, especially when they're preventing @TheTincho from releasing AM as a debian package, and instead release the new clustering lib then wait some period of time and then release the amtool changes with whatever other changes that have been merged since then.

mxinden · 2018-04-24T16:19:45Z

@stuartnelson3 In regards to all instances sending notifications: Would you mind creating a new issue with your configuration and status page? I would test it out on one of our dev clusters.

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

If the current release-0.15 amtool version is not compatible with the current release-0.15 alertmanager binary, I would see those new amtool features as bug fixes.

This is not a strong opinion just a suggestion. How have we been handling this before?

stuartnelson3 · 2018-04-25T07:51:45Z

@mxinden will send the issue soon, it's half-way written on my other laptop :)

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

I have to check the changelog, but I believe all non-bugfixes have been in amtool.

Alertmanager's API hasn't changed, so amtool should be compatible. The changes are a couple flag names, moving to a stable underlying CLI library, and fixing a bug.

This is not a strong opinion just a suggestion. How have we been handling this before?

I'm not sure, unfortunately. I don't think we have a set policy. I would prefer to get the amtool changes out as those won't affect running alertmanager, and whoever does the release can check for changes to alertmanager itself and decide if they're a risk to stability.

stuartnelson3 · 2018-04-25T09:34:05Z

issue created: #1341

tzz · 2018-04-27T12:29:53Z

Can 0.15 please include #1339 as a trivial fix?

I've been running rc1 and it's been absolutely great. Thank you, developers.

stuartnelson3 · 2018-05-03T07:24:33Z

There is still the generally confusing issue explained in #1341, but having run a version of master that includes all the changes in v0.15.0-rc.1 and some bug fixes, I think we can release this. In general, the multi-send seems to be an issue with pipeline creation that only occurs rarely, and fixing that (i.e. syncing pipeline execution) will take more time to test and verify.

@mxinden do you have time to work on the release?

mxinden · 2018-05-04T18:48:24Z

@stuartnelson3 Sounds good. I will try my best in the upcoming two days.

mxinden · 2018-05-05T10:18:23Z

I have created a release-0.15 branch. #1363 cherry-picks discussed commits from master into release-0.15. #1364 bumps VERSION and adds changes to CHANGELOG.md.

Let me know what you think.

stuartnelson3 · 2018-05-08T09:45:10Z

According to @grobie, there seems to be a higher rate of more than one peer sending alerts than with the previous mesh library.

I think we should address this before releasing a version that doesn't support some form of synchronization of pipeline execution between peers.

grobie · 2018-05-08T09:48:28Z

I don't know whether it's multiple peers sending. We definitely see a significantly increased number of duplicated notifications sent with 0.15, while the network and instances are healthy.

simonpasquier · 2018-05-09T10:03:33Z

Is there anything that looks suspicious when checking at the cluster metrics (eg alertmanager_cluster_.+)?

grobie · 2018-05-09T16:44:03Z

Nope. We spent some hours again this morning investigating the situation, added more debug logging and released the new version. Will work on that again this Friday.

stuartnelson3 · 2018-05-14T09:00:05Z

We experienced a network partition over the weekend during which every alertmanager was isolated. after the partition, the mesh did not recover. at the time of this writing, the 4 instances have been running independently for over 48 hours. alertmanager_cluster_health_score and alertmanager_cluster_messages_queued reflect the current bad state of the mesh.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

@mxinden is there a test case for this in the prometheus operator acceptance tests?

stuartnelson3 · 2018-05-14T13:04:16Z

Confirmed in a test using iptables between two machines in the same datacenter, once nodes are marked dead they will not rejoin without restarting memberlist.

simonpasquier · 2018-05-14T15:32:22Z

Indeed once a node has left the cluster (eg on connection time-out), memberlist on its own will never try to reconnect to it. Serf (which is heavily using memberlist) handles the reconnection with a background task:
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1426-L1437
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1484-L1523

AIUI AlertManager needs to deal with it in a similar fashion.

mxinden · 2018-05-14T21:01:33Z

@mxinden is there a test case for this in the prometheus operator acceptance tests?

@stuartnelson3 there is none at the moment, sorry. I will look into improving that for the future.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

I agree. Thanks a lot for looking into this.

stuartnelson3 · 2018-05-17T13:23:59Z

~~Investigating another duplicate (where the pipelines are firing at the correct time):~~

~~- peer0 sends correctly~~
~~- peer2 and peer3 receive the nflog notification~~
~~- peer1 does not receive the nflog notification. 15seconds pass, and it sends a duplicate.~~

~~I'm trying to understand why the peer1 doesn't receive the updated state within 15s time, which is leading me to reading about how memberlist works. I'm trying to figure out:~~

~~- how many times a gossip message is sent (TransmitLimitedQueue#GetBroadcasts)~~
~~- how to force all nodes to be gossiped to (utils.go#kRandomNodes)~~

I would prefer not having to do a full state sync every 10s to ensure the messages have propagated, but that would probably stop the duplicates. The issue then is we're not gossiping, but maybe that's fine 🤷‍♂️.

edit:
This has been moved to its own issue: #1387

anandsinghkunwar · 2018-05-30T11:38:55Z

Any ETA on the new release?

stuartnelson3 · 2018-05-30T12:53:26Z

I've been addressing what I see as the two main issues to a new release:

I've just returned from a weeks vacation and need to pick the work back up again. I hope to finish these in the next two weeks and get a release out.

bforbis · 2018-06-11T16:33:40Z

How long will v0.15.0-rc.2 be monitored for stability before an official v0.15.0 is released? There's a lot of great stuff in here, so I'm very excited to start using it in an official release.

stuartnelson3 · 2018-06-11T16:39:03Z

We're currently running it at SoundCloud. It's looking stable, so I'm hoping we can release it on Wednesday or Thursday.

@mxinden unless you have any objections, would you like to do the release for 0.15.0?

mxinden · 2018-06-11T16:42:51Z

@stuartnelson3 Wednesday or Thursday sounds good.

would you like to do the release for 0.15.0?

For sure. Glad to do so in case there are no issues reported till then. 🎉

stuartnelson3 · 2018-06-15T14:04:08Z

alertmanager with the latest changes for sending large messages via tcp seems to be doing the right thing.

I'm gone without internet access for the next two weeks. I propose that @mxinden releases the final rc, @grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly, but then @mxinden can release 0.15.0 early next week? How does this sound

mxinden · 2018-06-15T14:05:53Z

@stuartnelson3 That sounds good to me.

mxinden · 2018-06-20T18:13:48Z

@grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly

@grobie did you ran into any issues on the new v0.15.0-rc3 release candidate in the last couple of days?

I don't see any blocking issues reported by the community so far. 🎉

grobie · 2018-06-20T18:35:33Z

I mentioned in it #prometheus-dev earlier today, looks all good from what I can see. Go for it!

…

On Wed, Jun 20, 2018, 20:13 Max Inden ***@***.***> wrote: @grobie <https://github.com/grobie> keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly @grobie <https://github.com/grobie> did you ran into any issues on the new v0.15.0-rc3 release candidate in the last couple of days? I don't see any blocking issues reported by the community so far. 🎉 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1340 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaPZzGgq0SduF2LcXsBbe26uKQpRcks5t-pDigaJpZM4TgFkH> .

Techcadia · 2018-06-20T21:46:46Z

+1 Sounds great

brian-brazil · 2018-06-22T10:11:40Z

Who's planning on doing the release?

mxinden · 2018-06-22T10:40:34Z

@brian-brazil I will prepare it today.

brian-brazil · 2018-06-22T10:46:48Z

Great!

mxinden · 2018-06-22T17:04:08Z

With #1429 merged, I will close here. Feel free to reopen if there are any questions.

According to the golang docs, the syscall package is deprecated. https://golang.org/pkg/syscall This updates collectors to use the x/sys/unix package instead. Also updates the vendored x/sys/unix module to latest. Signed-off-by: Paul Gier <[email protected]>

This comment has been minimized.

Sign in to view

stuartnelson3 mentioned this issue Jun 12, 2018

Large nflogs/silences can never be gossiped #1412

Closed

prometheus deleted a comment from mxinden Jun 12, 2018

mxinden mentioned this issue Jun 22, 2018

*: Cut 0.15.0 #1429

Merged

mxinden closed this as completed Jun 22, 2018

Release Alertmanager v0.15.0 #1340

Release Alertmanager v0.15.0 #1340

Comments

mxinden commented Apr 23, 2018

stuartnelson3 commented Apr 23, 2018

NightTsarina commented Apr 23, 2018

simonpasquier commented Apr 24, 2018

stuartnelson3 commented Apr 24, 2018 • edited Loading

iksaif commented Apr 24, 2018 via email

mxinden commented Apr 24, 2018

stuartnelson3 commented Apr 24, 2018

stuartnelson3 commented Apr 24, 2018 • edited Loading

mxinden commented Apr 24, 2018

stuartnelson3 commented Apr 25, 2018

stuartnelson3 commented Apr 25, 2018

tzz commented Apr 27, 2018

stuartnelson3 commented May 3, 2018

mxinden commented May 4, 2018

mxinden commented May 5, 2018

stuartnelson3 commented May 8, 2018

grobie commented May 8, 2018

simonpasquier commented May 9, 2018

grobie commented May 9, 2018

stuartnelson3 commented May 14, 2018

stuartnelson3 commented May 14, 2018

simonpasquier commented May 14, 2018

mxinden commented May 14, 2018

stuartnelson3 commented May 17, 2018 • edited Loading

anandsinghkunwar commented May 30, 2018

stuartnelson3 commented May 30, 2018

bforbis commented Jun 11, 2018

stuartnelson3 commented Jun 11, 2018

mxinden commented Jun 11, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

stuartnelson3 commented Jun 15, 2018

mxinden commented Jun 15, 2018

mxinden commented Jun 20, 2018

grobie commented Jun 20, 2018 via email

Techcadia commented Jun 20, 2018

brian-brazil commented Jun 22, 2018

mxinden commented Jun 22, 2018

brian-brazil commented Jun 22, 2018

mxinden commented Jun 22, 2018

stuartnelson3 commented Apr 24, 2018 •

edited

Loading

stuartnelson3 commented Apr 24, 2018 •

edited

Loading

stuartnelson3 commented May 17, 2018 •

edited

Loading