-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Alertmanager v0.15.0 #1340
Comments
There were two issues I wanted to fix before a next release, #1331 and #1330. #1330 is blocking @TheTincho from making a debian release, but I believe that still doesn't resolve the issue of our elm frontend .. i.e. even with #1330, he won't be able to release AM as an official debian package. I haven't had a chance to test the new mesh lib at soundcloud, which I would like to do before making an official release, but I'm also not aware of the level of testing you've done with it over at redhat. @simonpasquier commented on a PR that he thought it was stable, I would be interested in hearing his opinion. |
Hi, Yes, #1330 is indeed a blocker for me now. The Elm issue has not gone away, but I have basically abandoned any hope of getting Elm into Debian (React does not look better in this aspect). So I have started working last week of producing an official release without any web frontend. My plan would be to upload this to experimental so users can avail of the newer AM, and meanwhile work on backporting the old simpler frontend to AM 0.14. |
Correct, my own tests with 0.15 are conclusive (see my original comment). That being said, it is limited to my local environment and although I've played a bit with ambench, it can't be compared to any (pre-)production setup. We also had reports from users deploying successfully 0.15-RC versions with the Prometheus operator which is encouraging. Maybe @iksaif has done some testing too? My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this). |
In that case, my chief concern then is that the release makes it explicit that mesh configuration requires FQDNs. EDIT: |
I have 0.15rc1 running in a limited environment and it seems to work. But
it doesn't get much traffic.
Maybe would be nice to have an integration test that generates a few
hundred of alerts and silences for ~20min.
…On Tue, Apr 24, 2018 at 10:42 AM, stuart nelson ***@***.***> wrote:
In that case, my chief concern then is that the release makes it explicit
that mesh configuration requires FQDNs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1340 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA_DA5tRJsJljJcEBGuXOjvKkQTFPPZNks5truWSgaJpZM4TgFkH>
.
--
Corentin Chary
http://xf.iksaif.net
|
@stuartnelson3 we have been testing the new Alertmanager with the Prometheus Operator test suit and deployed it in some dev environments. No production experience so far.
@stuartnelson3 Are these changes in the
@simonpasquier I aggree. |
I just started running a cluster at soundcloud and would like to investigate it more thoroughly. For example, it appears that even after having run for a couple hours, all instances are sending notifications. Given that we are using the default peer timeout (15s), and it doesn't appear that we're surpassing that timeout, I would expect just a single instance to be sending notifications based on how Perhaps I'm misunderstanding something about the mesh and someone can clarify for me what I'm seeing. |
@mxinden What's your reasoning? It feels arbitrary not to release changes made to |
@stuartnelson3 In regards to all instances sending notifications: Would you mind creating a new issue with your configuration and status page? I would test it out on one of our dev clusters. In regards to not including new features in release candidates: This strategy is not If the current This is not a strong opinion just a suggestion. How have we been handling this before? |
@mxinden will send the issue soon, it's half-way written on my other laptop :)
I have to check the changelog, but I believe all non-bugfixes have been in Alertmanager's API hasn't changed, so amtool should be compatible. The changes are a couple flag names, moving to a stable underlying CLI library, and fixing a bug.
I'm not sure, unfortunately. I don't think we have a set policy. I would prefer to get the amtool changes out as those won't affect running alertmanager, and whoever does the release can check for changes to alertmanager itself and decide if they're a risk to stability. |
issue created: #1341 |
Can 0.15 please include #1339 as a trivial fix? I've been running rc1 and it's been absolutely great. Thank you, developers. |
There is still the generally confusing issue explained in #1341, but having run a version of master that includes all the changes in v0.15.0-rc.1 and some bug fixes, I think we can release this. In general, the multi-send seems to be an issue with pipeline creation that only occurs rarely, and fixing that (i.e. syncing pipeline execution) will take more time to test and verify. @mxinden do you have time to work on the release? |
@stuartnelson3 Sounds good. I will try my best in the upcoming two days. |
According to @grobie, there seems to be a higher rate of more than one peer sending alerts than with the previous mesh library. I think we should address this before releasing a version that doesn't support some form of synchronization of pipeline execution between peers. |
I don't know whether it's multiple peers sending. We definitely see a significantly increased number of duplicated notifications sent with 0.15, while the network and instances are healthy. |
Is there anything that looks suspicious when checking at the cluster metrics (eg |
Nope. We spent some hours again this morning investigating the situation, added more debug logging and released the new version. Will work on that again this Friday. |
We experienced a network partition over the weekend during which every alertmanager was isolated. after the partition, the mesh did not recover. at the time of this writing, the 4 instances have been running independently for over 48 hours. we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause. @mxinden is there a test case for this in the prometheus operator acceptance tests? |
Confirmed in a test using iptables between two machines in the same datacenter, once nodes are marked dead they will not rejoin without restarting memberlist. |
Indeed once a node has left the cluster (eg on connection time-out), memberlist on its own will never try to reconnect to it. Serf (which is heavily using memberlist) handles the reconnection with a background task: AIUI AlertManager needs to deal with it in a similar fashion. |
@stuartnelson3 there is none at the moment, sorry. I will look into improving that for the future.
I agree. Thanks a lot for looking into this. |
edit: |
Any ETA on the new release? |
I've been addressing what I see as the two main issues to a new release: I've just returned from a weeks vacation and need to pick the work back up again. I hope to finish these in the next two weeks and get a release out. |
How long will |
We're currently running it at SoundCloud. It's looking stable, so I'm hoping we can release it on Wednesday or Thursday. @mxinden unless you have any objections, would you like to do the release for 0.15.0? |
@stuartnelson3 Wednesday or Thursday sounds good.
For sure. Glad to do so in case there are no issues reported till then. 🎉 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
alertmanager with the latest changes for sending large messages via tcp seems to be doing the right thing. I'm gone without internet access for the next two weeks. I propose that @mxinden releases the final rc, @grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly, but then @mxinden can release 0.15.0 early next week? How does this sound |
@stuartnelson3 That sounds good to me. |
I mentioned in it #prometheus-dev earlier today, looks all good from what I
can see. Go for it!
…On Wed, Jun 20, 2018, 20:13 Max Inden ***@***.***> wrote:
@grobie <https://github.com/grobie> keeps an eye on the alertmanager
dashboard over the weekend to make sure it's working correctly
@grobie <https://github.com/grobie> did you ran into any issues on the
new v0.15.0-rc3 release candidate in the last couple of days?
I don't see any blocking issues reported by the community so far. 🎉
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1340 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAANaPZzGgq0SduF2LcXsBbe26uKQpRcks5t-pDigaJpZM4TgFkH>
.
|
+1 Sounds great |
Who's planning on doing the release? |
@brian-brazil I will prepare it today. |
Great! |
With #1429 merged, I will close here. Feel free to reopen if there are any questions. |
According to the golang docs, the syscall package is deprecated. https://golang.org/pkg/syscall This updates collectors to use the x/sys/unix package instead. Also updates the vendored x/sys/unix module to latest. Signed-off-by: Paul Gier <[email protected]>
I would like to start the discussion of doing either a
v0.15.0
or av0.15.0-rc.2
release.@stuartnelson3 @simonpasquier @fabxc are there any blockers that I am not aware of?
If there is consensus I am more than happy to prepare the next release.
The text was updated successfully, but these errors were encountered: