Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart handler strategy behaviour #231

Merged
merged 8 commits into from
May 16, 2024
Merged

Conversation

guidograzioli
Copy link
Member

@guidograzioli guidograzioli commented May 15, 2024

Builds on and supersedes #230; a directory for restart strategies is provided with three implementations: none (nothing is restarted), serial (restart hosts in a serial fashion), serial_then_parallel (restart first node, verify health url, proceed with the rest in parallel). Restart health check can be mix-and-matched with the wait_for parameters.

The default strategy file, imported from the restart handler, is 'serial_then_parallel'; user can use that, choose one of the other too, or specify their custom taskfile (in a relative path of the calling playbook).

  • serial is the only viable restart behaviour on keycloak with default embedded infinispan caches (distributed-caches with owners = 2 implies a single point of failure).
  • serial_then_parallel is for the case of remote infinispan caches, where liveliness of the first node to be restarted allows the other keycloak node to be restarted in parallel

The molecule scenario quarkus_ha tests a cluster of two keycloak instances with a shared postgresql instance, restart in serial strategy

New parameters:

Variable Description Default
keycloak_quarkus_restart_strategy Strategy task file for restarting in HA (one of provided restart/['serial.yml','none.yml','serial_then_parallel.yml']) or path to file when providing custom strategy restart/serial.yml
keycloak_quarkus_restart_health_check Whether to wait for successful health check after restart {{ keycloak_quarkus_ha_enabled }}
keycloak_quarkus_restart_health_check_delay Seconds to let pass before starting healch checks 10
keycloak_quarkus_restart_health_check_reries Number of attempts for successful health check before failing 25
keycloak_quarkus_restart_pause Seconds to wait between restarts in HA strategy 15

Fix #182
Fix #221

@guidograzioli guidograzioli force-pushed the feature/182_restart_handler branch from 8f3fe9a to 7349e42 Compare May 15, 2024 11:42
Co-authored-by: Helmut Wolf <[email protected]>
Co-authored-by: Guido Grazioli <[email protected]>
@guidograzioli guidograzioli force-pushed the feature/182_restart_handler branch from 7349e42 to 2d573c2 Compare May 15, 2024 11:48
@guidograzioli guidograzioli requested a review from hwo-wd May 15, 2024 11:54
@guidograzioli guidograzioli added the major_changes Major changes mean the user can CHOOSE to make a change when they update but do not have to label May 15, 2024
@guidograzioli
Copy link
Member Author

@hwo-wd please review; if we agree on this coauthored PR, I'll find another pair of eye to do the formal review

Copy link
Collaborator

@hwo-wd hwo-wd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicking

throttle: 1
loop: "{{ ansible_play_hosts }}"
block:
- name: "Restart and enable {{ keycloak.service_name }} service on first host"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... service on {{ item }}"

when: inventory_hostname != ansible_play_hosts | first
- name: "Wait until {{ keycloak.service_name }} service becomes active {{ keycloak.health_url }}"
ansible.builtin.uri:
url: "{{ keycloak.health_url }}"

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent here was indeed to contact localhost (localhost from delegate_to being the restarted node); do you mean in production scenario keycloak_quarkus_host would change from localhost? Ideally it should be keycloak_quarkus_frontend_url that takes the load balanced / reverse proxied domainname (no?).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I implied that keycloak_quarkus_host (=localhost, by default) ==rhbk_frontend_url (=load balanced) which is not the case -- all good, sry for the confusion

@@ -425,6 +431,20 @@ argument_specs:
default: true
description: "Allow the option to ignore invalid certificates when downloading JDBC drivers from a custom URL"
type: "bool"
keycloak_quarkus_restart_health_check:

This comment was marked as outdated.

type: "bool"
keycloak_quarkus_restart_strategy:
description: >
Strategy task file for restarting in HA, one of [ 'serial', 'none', 'verify_first' ] below, or path to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategy task file for restarting if keycloak_quarkus_restart_health_check==True,...

block:
- name: "Restart and enable {{ keycloak.service_name }} service on first host"
ansible.builtin.include_tasks: ../restart.yml
delegate_to: "{{ item }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might need the pause task here as well, otherwise the health check returns 200 OK quite early (since the 2nd node is up) -> we restart the 2nd service -> ispn cache lost; considering this is the default strategy (if ha enabled), we should make this the most failsafe, ableit the slowest

    - name: Pause to give distributed ispn caches time to (re-)replicate back onto first host
      ansible.builtin.pause:
        seconds: "{{ keycloak_quarkus_restart_pause }}"
      when:
        - keycloak_quarkus_ha_enabled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very true, I'll move the sleeping out of the strategy and directly inside the main restart.yml block

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about verify_first naming, the longer I think about it the more it confuses me;
how about restart_first_verify_restart_rest - not a nice name either, but more descriptive

Copy link
Member Author

@guidograzioli guidograzioli May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart is implied by the restart/ subdirectory... how about serial_then_parallel ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me expand on this, I probably jumped to fast to code with out design notes.

  • for HA clusters with embedded infinispan (on defaults, so with distributed-caches with 2 owners), we can tolerate just one node to go down at a time, so serial is the really only one strategy that works
  • restarting a single node, verifying its liveliness, then parallelize all other node restarts is for remote caches. When the first node goes down, the others preserve the service; when the first is up again, it will take the traffic and all others can restart. Ideally that "first" should become a parameter itself, could be "2", "owners-1" or even, one day, "one per region".

@guidograzioli guidograzioli changed the title Feature/182 restart handler Restart handler strategy behaviour May 15, 2024
@guidograzioli guidograzioli requested a review from sabre1041 May 15, 2024 18:03
@hwo-wd hwo-wd self-requested a review May 16, 2024 07:48
---
- name: Verify first restarted service with health URL, then rest restart in parallel
block:
- name: "Restart and enable {{ keycloak.service_name }} service on first host"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should re-use restart.yml here with delegate_to: "{{ ansible_play_hosts | first }}" and run_once: true, i.e. replacing the next three tasks with this include instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really; restart.yml can have the health check turned off, while in there is enforced

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your right, but still: e.g. introducing keycloak_quarkus_internal_enforce_health_check and setting it to true here would lower the maintenance burden on the long run -- your call, of course

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh after second thought, yeah you're right, let me amend that

@@ -7,7 +7,8 @@
ansible.builtin.include_tasks: bootstrapped.yml
listen: bootstrapped
- name: "Restart {{ keycloak.service_name }}"
ansible.builtin.include_tasks: restart.yml
ansible.builtin.include_tasks:
file: "{{ keycloak_quarkus_restart_strategy if keycloak_quarkus_ha_enabled else 'restart.yml' }}"
listen: "restart keycloak"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linter seems to prefer Restart keycloak here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's new.. and kind of non-sense to me ansible/ansible-lint#4168

Copy link
Contributor

@sabre1041 sabre1041 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guidograzioli
Copy link
Member Author

@hwo-wd I believe all your notes were addressed, and we got the green flag from Andy; if you have other commits to push or comments please do; or, I'll proceed and merge. This has been pretty epic :)

@hwo-wd
Copy link
Collaborator

hwo-wd commented May 16, 2024

lgtm, thanks for the journey :)

@guidograzioli guidograzioli merged commit 1519d46 into main May 16, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major_changes Major changes mean the user can CHOOSE to make a change when they update but do not have to
Projects
None yet
3 participants