-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart handler strategy behaviour #231
Conversation
8f3fe9a
to
7349e42
Compare
Co-authored-by: Helmut Wolf <[email protected]> Co-authored-by: Guido Grazioli <[email protected]>
7349e42
to
2d573c2
Compare
@hwo-wd please review; if we agree on this coauthored PR, I'll find another pair of eye to do the formal review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nitpicking
throttle: 1 | ||
loop: "{{ ansible_play_hosts }}" | ||
block: | ||
- name: "Restart and enable {{ keycloak.service_name }} service on first host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... service on {{ item }}"
when: inventory_hostname != ansible_play_hosts | first | ||
- name: "Wait until {{ keycloak.service_name }} service becomes active {{ keycloak.health_url }}" | ||
ansible.builtin.uri: | ||
url: "{{ keycloak.health_url }}" |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent here was indeed to contact localhost (localhost from delegate_to being the restarted node); do you mean in production scenario keycloak_quarkus_host would change from localhost? Ideally it should be keycloak_quarkus_frontend_url that takes the load balanced / reverse proxied domainname (no?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, I implied that keycloak_quarkus_host
(=localhost, by default) ==rhbk_frontend_url
(=load balanced) which is not the case -- all good, sry for the confusion
@@ -425,6 +431,20 @@ argument_specs: | |||
default: true | |||
description: "Allow the option to ignore invalid certificates when downloading JDBC drivers from a custom URL" | |||
type: "bool" | |||
keycloak_quarkus_restart_health_check: |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
type: "bool" | ||
keycloak_quarkus_restart_strategy: | ||
description: > | ||
Strategy task file for restarting in HA, one of [ 'serial', 'none', 'verify_first' ] below, or path to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strategy task file for restarting if keycloak_quarkus_restart_health_check==True
,...
block: | ||
- name: "Restart and enable {{ keycloak.service_name }} service on first host" | ||
ansible.builtin.include_tasks: ../restart.yml | ||
delegate_to: "{{ item }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we might need the pause task here as well, otherwise the health check returns 200 OK quite early (since the 2nd node is up) -> we restart the 2nd service -> ispn cache lost; considering this is the default strategy (if ha enabled), we should make this the most failsafe, ableit the slowest
- name: Pause to give distributed ispn caches time to (re-)replicate back onto first host
ansible.builtin.pause:
seconds: "{{ keycloak_quarkus_restart_pause }}"
when:
- keycloak_quarkus_ha_enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very true, I'll move the sleeping out of the strategy and directly inside the main restart.yml block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about verify_first
naming, the longer I think about it the more it confuses me;
how about restart_first_verify_restart_rest
- not a nice name either, but more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restart is implied by the restart/ subdirectory... how about serial_then_parallel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me expand on this, I probably jumped to fast to code with out design notes.
- for HA clusters with embedded infinispan (on defaults, so with distributed-caches with 2 owners), we can tolerate just one node to go down at a time, so serial is the really only one strategy that works
- restarting a single node, verifying its liveliness, then parallelize all other node restarts is for remote caches. When the first node goes down, the others preserve the service; when the first is up again, it will take the traffic and all others can restart. Ideally that "first" should become a parameter itself, could be "2", "owners-1" or even, one day, "one per region".
--- | ||
- name: Verify first restarted service with health URL, then rest restart in parallel | ||
block: | ||
- name: "Restart and enable {{ keycloak.service_name }} service on first host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should re-use restart.yml
here with delegate_to: "{{ ansible_play_hosts | first }}"
and run_once: true
, i.e. replacing the next three tasks with this include instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really; restart.yml can have the health check turned off, while in there is enforced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your right, but still: e.g. introducing keycloak_quarkus_internal_enforce_health_check
and setting it to true here would lower the maintenance burden on the long run -- your call, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh after second thought, yeah you're right, let me amend that
@@ -7,7 +7,8 @@ | |||
ansible.builtin.include_tasks: bootstrapped.yml | |||
listen: bootstrapped | |||
- name: "Restart {{ keycloak.service_name }}" | |||
ansible.builtin.include_tasks: restart.yml | |||
ansible.builtin.include_tasks: | |||
file: "{{ keycloak_quarkus_restart_strategy if keycloak_quarkus_ha_enabled else 'restart.yml' }}" | |||
listen: "restart keycloak" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the linter seems to prefer Restart keycloak
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's new.. and kind of non-sense to me ansible/ansible-lint#4168
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@hwo-wd I believe all your notes were addressed, and we got the green flag from Andy; if you have other commits to push or comments please do; or, I'll proceed and merge. This has been pretty epic :) |
lgtm, thanks for the journey :) |
Builds on and supersedes #230; a directory for restart strategies is provided with three implementations:
none
(nothing is restarted),serial
(restart hosts in a serial fashion),serial_then_parallel
(restart first node, verify health url, proceed with the rest in parallel). Restart health check can be mix-and-matched with the wait_for parameters.The default strategy file, imported from the restart handler, is 'serial_then_parallel'; user can use that, choose one of the other too, or specify their custom taskfile (in a relative path of the calling playbook).
serial
is the only viable restart behaviour on keycloak with default embedded infinispan caches (distributed-caches with owners = 2 implies a single point of failure).serial_then_parallel
is for the case of remote infinispan caches, where liveliness of the first node to be restarted allows the other keycloak node to be restarted in parallelThe molecule scenario
quarkus_ha
tests a cluster of two keycloak instances with a shared postgresql instance, restart in serial strategyNew parameters:
keycloak_quarkus_restart_strategy
restart/serial.yml
keycloak_quarkus_restart_health_check
{{ keycloak_quarkus_ha_enabled }}
keycloak_quarkus_restart_health_check_delay
10
keycloak_quarkus_restart_health_check_reries
25
keycloak_quarkus_restart_pause
15
Fix #182
Fix #221