Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Add support for hand-over update strategy #8836

Merged
1 commit merged into from
Aug 12, 2022
Merged

Add support for hand-over update strategy #8836

1 commit merged into from
Aug 12, 2022

Conversation

ramirogm
Copy link
Contributor

@ramirogm ramirogm commented Jul 8, 2022

Change-type: patch
Signed-off-by: Ramiro Gonzalez [email protected]


Please remember to write tests for your changes. We aim to automatically
deploy master to production, and we can't safely do this without a solid
test suite!

@ghost
Copy link

ghost commented Jul 8, 2022

An error occurred whilst building your landr site preview:

{
  "name": "Error",
  "message": "Command failed with code undefined: /usr/src/app/node_modules/gatsby/cli.js build",
  "stack": "Error: Command failed with code undefined: /usr/src/app/node_modules/gatsby/cli.js build\n    at Object.exports.run (/usr/src/app/lib/build-runner.js:257:11)\n    at async build (/usr/src/app/bot/index.js:132:19)\n    at async /usr/src/app/bot/index.js:210:25\n    at async Promise.all (index 0)\n    at async middleware (/usr/src/app/node_modules/@octokit/webhooks/dist-node/index.js:355:5)"
}

@ramirogm ramirogm marked this pull request as draft July 8, 2022 03:25
Copy link
Contributor

@LucianBuzzo LucianBuzzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool @ramirogm !
Would you be able to extract some of this logic out to a separate file?
I think this might be a nice utility to make available via npm i the future, for when we start deploying more of our services to balenaCloud+AWS

@ramirogm
Copy link
Contributor Author

@LucianBuzzo I've refactored the logic into a HandoverPeer component that we can extract later if useful.

@ramirogm ramirogm marked this pull request as ready for review July 12, 2022 12:59
@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from 9c79ccf to 6285ecc Compare July 15, 2022 17:41
@ramirogm
Copy link
Contributor Author

There's a test repo for the library on https://github.com/ramirogm/handover-lib-test

I haven't extracted it to its own repo yet - for now, is a copy-paste to test it while developing

@dfunckt
Copy link

dfunckt commented Jul 26, 2022

Interesting stuff. Just wanted to point out that this seems to assume a single instance and if we ever need to scale horizontally the handover algorithm will need to be revised.

@ramirogm
Copy link
Contributor Author

ramirogm commented Jul 26, 2022 via email

@dfunckt
Copy link

dfunckt commented Jul 26, 2022

As it is, the handover algorithm can only handle one instance succeeding another because it's based on instance startup time which is impossible to control precisely. When you consider multiple instances, you quickly see that during deployment the system will never reach a steady state, because every instance will have slightly different startup time, and even if it did, it would re-enter that unstable state as soon as an instance crashed and restarted.

@ab77
Copy link
Member

ab77 commented Aug 2, 2022

How would this work around starting up multiple haproxy containers with host port mappings?

@ramirogm
Copy link
Contributor Author

ramirogm commented Aug 4, 2022

As it is, the handover algorithm can only handle one instance succeeding another because it's based on instance startup time which is impossible to control precisely. When you consider multiple instances, you quickly see that during deployment the system will never reach a steady state, because every instance will have slightly different startup time, and even if it did, it would re-enter that unstable state as soon as an instance crashed and restarted.

Hi @dfunckt I'd like to try this to understand the issue better. Initially, I thought we were talking of instances as multiple devices inside a fleet, but now I think you mean multiple instances of the same app in a docker-compose file? Using different service names on the docker-compose file ( like api1, api2 )? Or something else?

@ramirogm
Copy link
Contributor Author

ramirogm commented Aug 4, 2022

Hi @ab77

How would this work around starting up multiple haproxy containers with host port mappings?

I'm currently testing a reduced version of the JF setup: haproxy with no specific handover strategy fronting an app that uses the handover strategy with the improvements here that provide a) clean shutdown and b) handover when the new instance is ready to handle connections.

In this setup the supervisor replaces the haproxy instance for the new one as the first task, and then when the handover is done the new haproxy picks up the IP address change. In the logs it shows up as:

03.08.22 18:11:14 (-0300)  api  2022-08-03T21:11:14.347Z [build/handover-peer/index] [info] [SERVER-PID-18-localhost-0.0.1-primary]: PID: 18. Waiting for shutdown
03.08.22 18:11:15 (-0300) Service exited 'api sha256:f5f1767ee504868390746ed8a266b3166ed5bcfa96a5371cce0cc2afd1cd2f1e'
03.08.22 18:11:15 (-0300) Killed service 'api sha256:f5f1767ee504868390746ed8a266b3166ed5bcfa96a5371cce0cc2afd1cd2f1e'
[...]
03.08.22 18:11:15 (-0300)  haproxy  [WARNING]  (10) : api-backend/api changed its IP from 172.18.0.4 to 172.18.0.3 by docker-bridge-resolver/docker-resolver.
03.08.22 18:11:15 (-0300)  haproxy  api-backend/api changed its IP from 172.18.0.4 to 172.18.0.3 by docker-bridge-resolver/docker-resolver.
03.08.22 18:11:15 (-0300)  haproxy  api-backend/api changed its IP from 172.18.0.4 to 172.18.0.3 by docker-bridge-resolver/docker-resolver.

I'm pinging from another host, similar to what nodeping does, and I see no downtime.

@ramirogm
Copy link
Contributor Author

ramirogm commented Aug 4, 2022

re-Correction. Found out there's a conflict between our goals of zero-downtime and orderly shutdown.
There's a downtime of about 2 secs but it's not related to haproxy, but to the fact that we're doing an orderly shutdown. There's a lapse ( in this example those 2secs ) when the original service is shutting down and is not handling new requirements, the new service is already there to handle requessts, but the supervisor ( the docker-bridge-resolver/docker-resolver ) hasn't switched the service from from the old IP to the new one. It only does that after killing the old service. During this lapse haproxy is sending requests to the service that is shutting down, so you get a 503
For JF I'll keep the io.balena.update.handover-timeout to a low value ( 1 sec ) and speed up the shutdown to give priority to the "almos-zero downtime" goal.
Note that currently because of the download-then-kill default strategy JF is being killed merciless anyway . Not sure if this is causing some resource leakage ( redis? db? ) or if they fix themselves

@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from f84b5e0 to e218f53 Compare August 10, 2022 12:23
@ramirogm
Copy link
Contributor Author

ramirogm commented Aug 10, 2022

This is now working with a library that implements the handover logic; See https://github.com/balena-io-playground/handover-lib

@ramirogm
Copy link
Contributor Author

I've done some testing on a test fleet with two devices: https://dashboard.balena-cloud.com/fleets/1949587

Some notes:

  • The handover is done after the new service starts up. This may take up to 20secs since the startup code upserts a lot of contracts etc
  • There is maybe a 1 sec of downtime while haproxy restarts
  • While the two instances are running BOTH handle requests, since the two are alive and the docker embedded DNS server reports BOTH IP addresses ( the old and new one ) to haproxy ( which uses this DNS server ).
  • When the original instance shuts down, there's a 1-3 sec period where haproxy may report the backend-api as down; during this time it is sending requests to the old IP which is not listening and waiting to be killed and its entry removed from the DNS round robin

@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from e218f53 to 6d05da8 Compare August 10, 2022 12:52
apps/server/lib/index.ts Outdated Show resolved Hide resolved
apps/server/lib/index.ts Outdated Show resolved Hide resolved
apps/server/package.json Outdated Show resolved Hide resolved
@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from 6d05da8 to 62f76d6 Compare August 12, 2022 01:11
@joshbwlng
Copy link
Collaborator

@ramirogm Thanks for the fixes, LGTM

@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from 0997c1f to bc135c2 Compare August 12, 2022 02:00
Change-type: minor
Signed-off-by: Ramiro Gonzalez <[email protected]>
@ramirogm ramirogm force-pushed the ramirogm/hand-over branch from bc135c2 to 8764f06 Compare August 12, 2022 02:01
@ramirogm
Copy link
Contributor Author

@balena-ci I self-certify!

@ghost ghost merged commit e681b0b into master Aug 12, 2022
@ghost ghost deleted the ramirogm/hand-over branch August 12, 2022 02:37
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants