-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health checks to upgrade playbook #4372
Add health checks to upgrade playbook #4372
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/juanvallejo/openshift-ansible/tree/a4728f1335dc4f5d90ff3c5db976e83a41df571d/playbooks/common/openshift-cluster/upgrades/pre, there are some precondition checking there already.
roles: | ||
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: "upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, we need to handle this in
openshift-ansible/roles/openshift_health_checker/callback_plugins/zz_failure_summary.py
Line 98 in 2d4709b
if context in ['pre-install', 'health']: |
a4728f1
to
4b4e609
Compare
Does |
I bet it doesn't, I grepped for |
@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context): | |||
'Variables can be set in the inventory or passed on the\n' | |||
'command line using the -e flag to ansible-playbook.\n' | |||
).format(playbook=self._playbook_file, checks=checks) | |||
if context in ['pre-install', 'health']: | |||
if context in ['pre-install', 'pre-upgrade', 'health']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme when you designed this, do you remember what were the "contexts" in which the summary message below should not be printed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho @juanvallejo I guess a comment would have helped. So what prompted this was that if you're running a playbook like byo/config.yml then these checks running may come as a surprise and it needs a little more explanation. If you're running a playbook that's specifically for running checks, it's not about "failing early" as they're the whole point, and so the default preamble above is a little out of place. I could imagine other cases in which we'd like to customize the message or even the outcome based on the user intent/expectations. But for now, there are really just two cases: you're trying to run checks, or you're trying to get something else done and the checks are "in the way".
So a "pre-upgrade" context would mean you're running a playbook specifically to see if your upgrade is gonna work out. An "upgrade" context would mean you're actually trying to run the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify_memory_and_diskspace.yml should be called from an include:
at the bottom of common/openshift-cluster/upgrades/init.yml. This will call the verification checks after the groups and facts have been established.
@@ -0,0 +1,12 @@ | |||
- name: Verify Host Requirements | |||
hosts: OSEv3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be oo_all_hosts
. OSEv3
should not be called in common
playbooks.
4b4e609
to
0621520
Compare
@mtnbikenc @rhcarvalho thanks for the feedback, review comments addressed |
0621520
to
1278b26
Compare
roles: | ||
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: "pre-upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this is being included in an actual upgrade playbook above, I would say that should be just "upgrade" context. If you want to reuse the playbook it should probably not specify a context, instead relying on including playbooks to set context.
Also I don't like calling this verify_memory_and_diskspace.yml
-- I assume we're going to add more checks to the same play in later iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update
a11a0e5
to
73e0a2e
Compare
roles: | ||
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: "pre-upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can drop the quotes in this case.
args: | ||
checks: | ||
- disk_availability | ||
- memory_availability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline character.
roles: | ||
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: "upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could safely drop the quotes here, just like in all other places in this YAML file.
@@ -23,3 +23,5 @@ | |||
set_fact: | |||
os_firewall_use_firewalld: false | |||
when: "'Active: active' in service_iptables_status.stdout" | |||
|
|||
- include: ./pre/verify_memory_and_diskspace.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reference is outdated.
- action: openshift_health_check | ||
args: | ||
checks: | ||
- disk_availability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory requirements in an upgrade should match the requirements of the newer version... but for disk requirements, we probably have at least two scenarios:
In-place upgrade
The free space requirement should likely be different than for an install, after all there is already data in the cluster (e.g., Docker images and installed RPM packages).
I couldn't find any specific guidance about free disk space in the docs.
Blue-green
I believe the free space requirement for the new hosts should be the same as for a fresh install.
We may need a way to tell when an upgrade is in-place and when it is blue-green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could propagate a context
to our checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What context
do you have in mind? We do have r_openshift_health_checker_playbook_context
, but that's manually set from playbooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're just configuring how the checks run per playbook, you can just add another variable (like r_openshift_health_checker_playbook_context
) on the playbook and read that from the check. The only thing that's special about that context var is that it gets recorded in results so that the callback plugin can see it and adjust output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme Thanks, although I don't suppose there is a straightforward way of determining the type of upgrade that's taking place? All I have found was https://docs.openshift.org/latest/install_config/upgrading/automated_upgrades.html#install-config-upgrading-automated-upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtnbikenc wondering if you have any thoughts on this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juanvallejo Does a different type of upgrade use a different playbook, or just different parameters? Either way you have something to key off of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand https://docs.openshift.org/latest/install_config/upgrading/blue_green_deployments.html#blue-green-deployments-preparing-for-upgrade correctly, it looks like in the case of a blue-green
upgrade, an in-place
upgrade is still run on existing masters
and etcd
hosts, but not on existing nodes
. Instead, existing nodes
are labeled with color=blue
(or whatever value a user decides on). Then, entirely new nodes are created with the new version of OpenShift using a playbook. Once green nodes are ready, pods are evacuated from blue nodes to green.
Therefore, I think that if the upgrade context
is set, and the host is in the nodes
group, it is safe to assume that it is going through an in-place
upgrade (since a blue-green upgrade would be creating entirely new nodes instead). masters
and etcd
hosts can be treated as in-place
for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds right.
So essentially there are just two cases, install or upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will push updated disk_availability
check that takes the upgrade
context into account; halving the required disk space in that case
73e0a2e
to
c56f15d
Compare
@sosiouxme @rhcarvalho @mtnbikenc wondering your thoughts on 7bd04ce |
# in use by the existing OpenShift deployment. | ||
context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="") | ||
if context == "upgrade": | ||
min_free_bytes /= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems... kinda arbitrary. I think in this context, if we run the check at all, it should just be checking that they have enough to run the upgrade. Not sure what that should be, maybe like 1GB? Or, this could be re-purposed as a "health" check when it's not in a fresh install context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not able to find a definite requirement in the docs for disk size during an upgrade.
Or, this could be re-purposed as a "health" check when it's not in a fresh install context.
That could also work. We could fail the check in this context if disk usage is something like > 90%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percentage is not good. 90% usage of a 1TB disk means 100GB free -- more than enough for an install/upgrade. 90% usage of 20GB means 2GB free, risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unclear on a minimum disk availability value under "upgrade" context. Updated check to require at least 5.0 GB
for in-place upgrades, for now
@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context): | |||
'Variables can be set in the inventory or passed on the\n' | |||
'command line using the -e flag to ansible-playbook.\n' | |||
).format(playbook=self._playbook_file, checks=checks) | |||
if context in ['pre-install', 'health']: | |||
if context in ['pre-install', 'upgrade', 'health']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the summary for playbooks where the user is trying to run checks. I wouldn't include "upgrade" here so they can have the previous msg with a little more explanation. If there were a "pre-upgrade" playbook just for running checks then I would put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will remove
0ca870e
to
44fef09
Compare
'failed': True, | ||
'msg': ( | ||
'Available disk space ({:.1f} GB) for the volume containing ' | ||
'"/var" is below minimum recommended space for an upgrade ({:.1f} GB).' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces some duplication with lines coming right below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. The return
statement inside this block prevents the text below from being displayed if this one is.
# in use by the existing OpenShift deployment. | ||
context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="") | ||
if context == "upgrade": | ||
upgrade_min_required_diskspace = 5.0 * 10**9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have in mind that both this and #4436 change the disk availability check. In an upgrade, this change of minimum requirements, whatever it end up being, will possibly only apply to /var
, while /usr/local/bin
and /tmp
requirements would be left unchanged, agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with waiting for #4436 to merge, then adding the decreased upgrade
requirement on top of that
@@ -0,0 +1,13 @@ | |||
--- | |||
- name: Verify Host Requirements | |||
hosts: oo_all_hosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme shall we limit this also to run only on masters and nodes, or is it enough to filter it on the action plugin level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho sorry i missed this. no, it looks like oo_all_hosts
comes from g_all_hosts
which is masters/nodes/etcd only so this should be fine. all our playbooks basically need to work the same as byo/config.yml
with respect to groups.
eeac970
to
0b4d699
Compare
@sosiouxme @rhcarvalho rebased with changes introduced to disk_availability.py in #4436 |
0b4d699
to
e840183
Compare
aos-ci-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may still be able to simplify a bit, but otherwise looking good.
# if not able to resolve a recommended value, default to non-upgrade recommendation | ||
min_upgrade_rec = max(recommendation.get(name, 0) for name in group_names) | ||
|
||
if free_bytes < min_upgrade_rec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this duplicates logic unnecessarily. Couldn't we simple use the upgrade values when it's an upgrade and then use the same code paths to compare free / required?
tempfile.gettempdir(): { | ||
'masters': 0.5 * 10**9, | ||
'nodes': 0.5 * 10**9, | ||
'etcd': 0.5 * 10**9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted the 1 GB for /usr/local/bin and temp were both originally a rough guess, why do we make this different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to just reduce space requirements for /var
["20.0 GB", "below minimum"], | ||
), | ||
]) | ||
def test_min_required_space_decreases_with_upgrade_context(group_names, context, ansible_mounts, failed, extra_words): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and other places we say "decrease" or otherwise give suggestions as to what the values are compared to "non-upgrade". I'd avoid doing that type of comment or naming, because what governs the relationship of the numbers is actual data. Change the data and all the names become misleading.
Likewise, on an extreme, it is considered bad practice to name variables like two = 2
only to see two = 3
some day in the future...
071069a
to
577fdc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will be refactoring init.yml for upgrades but the playbook flow looks good for now.
thanks @mtnbikenc |
aos-ci-test |
@rhcarvalho ok to merge? |
577fdc0
to
8af1839
Compare
@juanvallejo yes, we've got plenty of approval :-) time for the CI dance |
aos-ci-test |
@rhcarvalho ci tests are green, will tag for [merge] |
[test]ing while waiting on the merge queue |
Evaluated for openshift ansible test up to 8af1839 |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/370/) (Base Commit: cf80c7e) (PR Branch Commit: 8af1839) |
flaked on openshift/origin#10162 re[merge] |
Flake openshift/origin#15356, perhaps fixed in openshift/origin#15482. And apparently something new, openshift/origin#15522. [merge] one last time. |
Evaluated for openshift ansible merge up to 8af1839 |
continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/754/) (Base Commit: 4a442d8) (PR Branch Commit: 8af1839) |
Flake openshift/origin#15522 again, proceeding to manual merge as per https://github.com/openshift/openshift-ansible/blob/master/docs/pull_requests.md#manual-merges |
@juanvallejo please review https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/754/ and open a new PR. |
I made a mistake and merged when the error was actually related to the changes introduced by this PR. Subsequently reverted the changes. |
begin adding memory and disk-space verification to upgrade process
Begin addressing https://trello.com/c/K93Wzz4u
cc @sosiouxme @brenton @mtnbikenc @rhcarvalho