Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

ReleaseController application picks the tip of the release chain as contender #166

Merged
merged 1 commit into from
Sep 13, 2019

Conversation

osdrv
Copy link
Contributor

@osdrv osdrv commented Aug 27, 2019

The current implementation has a slightly relaxed algorithm of finding
contender and incumbent in the chain of releases: in particular, the
latest (sorted by generation) scheduled release is identified as a
contender and the latest complete one is our incumbent.

This commit changes the approach and forces release controller to start
using releaseutil methods for finding incumbent and contender. The
incumbent finding is still very naive as it's relying on release status
to figure out whether it's complete or not. On the improvement side, we
got rid of using the most dangerous condition in release object:
scheduled criteria. From now on, we always pick the tip of the release
chain.

Signed-off-by: Oleg Sidorov [email protected]

@osdrv osdrv self-assigned this Aug 27, 2019
@osdrv osdrv added the bug Something isn't working label Aug 27, 2019
@osdrv osdrv added this to the release-0.6 milestone Aug 27, 2019
@osdrv
Copy link
Contributor Author

osdrv commented Aug 29, 2019

@kanatohodets @juliogreff Hey folks! Mind dropping a line on this one as we all had concerns regarding this change, it would be great to keep it tracked. Thanks!

pkg/util/application/releases.go Outdated Show resolved Hide resolved
@juliogreff
Copy link
Contributor

My original worry on this change was actually unfounded (I took a code comment to its word, but that has since been addressed). I'm actually quite comfortable with the spirit of the change now that my misunderstanding has been cleared up.

There is a quite unlikely scenario that might be worth considering though. Imagine that a healthy release A is being replaced by release B. If B has its target step advanced all the way to the last step in the strategy, but a new release C gets created before B has time to complete (say, because of an emergency hotfix), the incumbent will remain being A, even though it might've lost quite some capacity during the rollout, and maybe in this particular case B could've been a better candidate.

Honestly I can't think of a good way around this (maybe preventing new releases when there's an incomplete contender?), so this is just putting it out there. As I said earlier, the idea behind the current state of this PR is sound to me now.

osdrv pushed a commit that referenced this pull request Sep 3, 2019
This change addresses the concern raised in
#166 (comment)
and enforces release equality check by their name, not pointer-based.

Signed-off-by: Oleg Sidorov <[email protected]>
pkg/controller/release/release_controller.go Outdated Show resolved Hide resolved
pkg/controller/release/release_controller.go Outdated Show resolved Hide resolved
pkg/controller/release/release_controller.go Outdated Show resolved Hide resolved
pkg/controller/release/release_controller.go Outdated Show resolved Hide resolved
@osdrv osdrv force-pushed the olegs/fix-get-release-pair branch from 416cb0e to 6f30009 Compare September 4, 2019 14:14
@osdrv osdrv marked this pull request as ready for review September 4, 2019 14:17
@osdrv osdrv force-pushed the olegs/fix-get-release-pair branch 3 times, most recently from e8bec08 to a580c1a Compare September 5, 2019 09:02
pkg/chart/repo/repo.go Outdated Show resolved Hide resolved
pkg/chart/repo/repo.go Outdated Show resolved Hide resolved
@osdrv osdrv force-pushed the olegs/fix-get-release-pair branch 2 times, most recently from 3897928 to 66b122d Compare September 5, 2019 14:19
juliogreff
juliogreff previously approved these changes Sep 12, 2019
The current implementation has a slightly relaxed algorithm of finding
contender and incumbent in the chain of releases: in particular, the
latest (sorted by generation) scheduled release is identified as a
contender and the latest complete one is our incumbent.

This commit changes the approach and forces release controller to start
using releaseutil methods for finding incumbent and contender. The
incumbent finding is still very naive as it's relying on release status
to figure out whether it's complete or not. On the improvement side, we
got rid of using the most dangerous condition in release object:
scheduled criteria. From now on, we always pick the tip of the release
chain.

Some extra fixes in release controller and chart repo:

* Chart repo returns ChartFetchFailureError if fetch failed
* Release controller: moved away from 2-step cluster selection and
scheduling: 1 pass now
* ChartRepoInternalError is a recognised error now
* Application utils: incumbent equality to contender is name-based

Signed-off-by: Oleg Sidorov <[email protected]>
@osdrv osdrv force-pushed the olegs/fix-get-release-pair branch from d1d5d79 to 3ef70e6 Compare September 12, 2019 15:38
@osdrv osdrv merged commit c2c2742 into master Sep 13, 2019
@osdrv osdrv deleted the olegs/fix-get-release-pair branch September 13, 2019 12:18
juliogreff added a commit that referenced this pull request Oct 1, 2019
This may or may not make conceptual sense (it does to me, but I could
not find out why choosing clusters and scheduling a release was two
separate steps in the first place, although after #166[1] I'm fairly
convinced this was just a technical artifact), but it sure is
convenient: we move all the error handling during the scheduling step
to a single chunk of code. This fixes an issue where errors in
ChooseClusters were not reflected in any condition, making the Release
object not change during the sync, and therefore not triggering any
events, being essentially invisible to users.

As a bonus, I restored the actual testing part of this in the unit
tests. We were previously just checking that ChooseClusters didn't
trigger any updates, without actually checking if it was doing the right
thing (choosing clusters).

[1] https://github.com/bookingcom/shipper/pull/166/files#diff-caffe52421149f1f8d77a0e7c749867dR327-R341
juliogreff added a commit that referenced this pull request Oct 1, 2019
This may or may not make conceptual sense (it does to me, but I could
not find out why choosing clusters and scheduling a release was two
separate steps in the first place, although after #166[1] I'm fairly
convinced this was just a technical artifact), but it sure is
convenient: we move all the error handling during the scheduling step
to a single chunk of code. This fixes an issue where errors in
ChooseClusters were not reflected in any condition, making the Release
object not change during the sync, and therefore not triggering any
events, being essentially invisible to users.

As a bonus, I restored the actual testing part of this in the unit
tests. We were previously just checking that ChooseClusters didn't
trigger any updates, without actually checking if it was doing the right
thing (choosing clusters).

[1] https://github.com/bookingcom/shipper/pull/166/files#diff-caffe52421149f1f8d77a0e7c749867dR327-R341
juliogreff added a commit that referenced this pull request Oct 4, 2019
This may or may not make conceptual sense (it does to me, but I could
not find out why choosing clusters and scheduling a release was two
separate steps in the first place, although after #166[1] I'm fairly
convinced this was just a technical artifact), but it sure is
convenient: we move all the error handling during the scheduling step
to a single chunk of code. This fixes an issue where errors in
ChooseClusters were not reflected in any condition, making the Release
object not change during the sync, and therefore not triggering any
events, being essentially invisible to users.

As a bonus, I restored the actual testing part of this in the unit
tests. We were previously just checking that ChooseClusters didn't
trigger any updates, without actually checking if it was doing the right
thing (choosing clusters).

[1] https://github.com/bookingcom/shipper/pull/166/files#diff-caffe52421149f1f8d77a0e7c749867dR327-R341
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants