Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks to upgrade playbook #4372

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions playbooks/common/openshift-cluster/upgrades/init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,5 @@
set_fact:
os_firewall_use_firewalld: false
when: "'Active: active' in service_iptables_status.stdout"

- include: ./pre/verify_health_checks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
- name: Verify Host Requirements
hosts: oo_all_hosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme shall we limit this also to run only on masters and nodes, or is it enough to filter it on the action plugin level?

Copy link
Member

@sosiouxme sosiouxme Jul 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho sorry i missed this. no, it looks like oo_all_hosts comes from g_all_hosts which is masters/nodes/etcd only so this should be fine. all our playbooks basically need to work the same as byo/config.yml with respect to groups.

roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: upgrade
post_tasks:
- action: openshift_health_check
args:
checks:
- disk_availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory requirements in an upgrade should match the requirements of the newer version... but for disk requirements, we probably have at least two scenarios:

In-place upgrade

The free space requirement should likely be different than for an install, after all there is already data in the cluster (e.g., Docker images and installed RPM packages).

I couldn't find any specific guidance about free disk space in the docs.

Blue-green

I believe the free space requirement for the new hosts should be the same as for a fresh install.


We may need a way to tell when an upgrade is in-place and when it is blue-green.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could propagate a context to our checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What context do you have in mind? We do have r_openshift_health_checker_playbook_context, but that's manually set from playbooks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just configuring how the checks run per playbook, you can just add another variable (like r_openshift_health_checker_playbook_context) on the playbook and read that from the check. The only thing that's special about that context var is that it gets recorded in results so that the callback plugin can see it and adjust output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme Thanks, although I don't suppose there is a straightforward way of determining the type of upgrade that's taking place? All I have found was https://docs.openshift.org/latest/install_config/upgrading/automated_upgrades.html#install-config-upgrading-automated-upgrades

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtnbikenc wondering if you have any thoughts on this as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo Does a different type of upgrade use a different playbook, or just different parameters? Either way you have something to key off of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand https://docs.openshift.org/latest/install_config/upgrading/blue_green_deployments.html#blue-green-deployments-preparing-for-upgrade correctly, it looks like in the case of a blue-green upgrade, an in-place upgrade is still run on existing masters and etcd hosts, but not on existing nodes. Instead, existing nodes are labeled with color=blue (or whatever value a user decides on). Then, entirely new nodes are created with the new version of OpenShift using a playbook. Once green nodes are ready, pods are evacuated from blue nodes to green.

Therefore, I think that if the upgrade context is set, and the host is in the nodes group, it is safe to assume that it is going through an in-place upgrade (since a blue-green upgrade would be creating entirely new nodes instead). masters and etcd hosts can be treated as in-place for both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right.
So essentially there are just two cases, install or upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push updated disk_availability check that takes the upgrade context into account; halving the required disk space in that case

- memory_availability
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ class DiskAvailability(OpenShiftCheck):
},
}

# recommended disk space for each location under an upgrade context
recommended_disk_upgrade_bytes = {
'/var': {
'masters': 10 * 10**9,
'nodes': 5 * 10 ** 9,
'etcd': 5 * 10 ** 9,
},
}

def is_active(self):
"""Skip hosts that do not have recommended disk space requirements."""
group_names = self.get_var("group_names", default=[])
Expand Down Expand Up @@ -80,9 +89,34 @@ def run(self):
config_bytes = max(config.get(name, 0) for name in group_names) * 10**9
recommended_bytes = config_bytes or recommended_bytes

# if an "upgrade" context is set, update the minimum disk requirement
# as this signifies an in-place upgrade - the node might have the
# required total disk space, but some of that space may already be
# in use by the existing OpenShift deployment.
context = self.get_var("r_openshift_health_checker_playbook_context", default="")
if context == "upgrade":
recommended_upgrade_paths = self.recommended_disk_upgrade_bytes.get(path, {})
if recommended_upgrade_paths:
recommended_bytes = config_bytes or max(recommended_upgrade_paths.get(name, 0)
for name in group_names)

if free_bytes < recommended_bytes:
free_gb = float(free_bytes) / 10**9
recommended_gb = float(recommended_bytes) / 10**9
msg = (
'Available disk space in "{}" ({:.1f} GB) '
'is below minimum recommended ({:.1f} GB)'
).format(path, free_gb, recommended_gb)

# warn if check failed under an "upgrade" context
# due to limits imposed by the user config
if config_bytes and context == "upgrade":
msg += ('\n\nMake sure to account for decreased disk space during an upgrade\n'
'due to an existing OpenShift deployment. Please check the value of\n'
' openshift_check_min_host_disk_gb={}\n'
'in your Ansible inventory, and lower the recommended disk space availability\n'
'if necessary for this upgrade.').format(config_bytes)

return {
'failed': True,
'msg': (
Expand Down
69 changes: 64 additions & 5 deletions roles/openshift_health_checker/test/disk_availability_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,9 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
assert not result.get('failed', False)


@pytest.mark.parametrize('group_names,configured_min,ansible_mounts,extra_words', [
@pytest.mark.parametrize('name,group_names,configured_min,ansible_mounts,extra_words', [
(
'test with no space available',
['masters'],
0,
[{
Expand All @@ -108,6 +109,7 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
['0.0 GB'],
),
(
'test with a higher configured required value',
['masters'],
100, # set a higher threshold
[{
Expand All @@ -117,6 +119,7 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
['100.0 GB'],
),
(
'test with 1GB available, but "0" GB space requirement',
['nodes'],
0,
[{
Expand All @@ -126,6 +129,7 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
['1.0 GB'],
),
(
'test with no space available, but "0" GB space requirement',
['etcd'],
0,
[{
Expand All @@ -135,16 +139,17 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
['0.0 GB'],
),
(
'test with enough space for a node, but not for a master',
['nodes', 'masters'],
0,
[{
'mount': '/',
# enough space for a node, not enough for a master
'size_available': 15 * 10**9 + 1,
}],
['15.0 GB'],
),
(
'test failure with enough space on "/", but not enough on "/var"',
['etcd'],
0,
[{
Expand All @@ -158,8 +163,8 @@ def test_succeeds_with_recommended_disk_space(group_names, configured_min, ansib
}],
['0.0 GB'],
),
])
def test_fails_with_insufficient_disk_space(group_names, configured_min, ansible_mounts, extra_words):
], ids=lambda argval: argval[0])
def test_fails_with_insufficient_disk_space(name, group_names, configured_min, ansible_mounts, extra_words):
task_vars = dict(
group_names=group_names,
openshift_check_min_host_disk_gb=configured_min,
Expand All @@ -170,7 +175,61 @@ def test_fails_with_insufficient_disk_space(group_names, configured_min, ansible

assert result['failed']
for word in 'below recommended'.split() + extra_words:
assert word in result['msg']
assert word in result.get('msg', '')


@pytest.mark.parametrize('name,group_names,context,ansible_mounts,failed,extra_words', [
(
'test without enough space for master under "upgrade" context',
['nodes', 'masters'],
"upgrade",
[{
'mount': '/',
'size_available': 1 * 10**9 + 1,
'size_total': 21 * 10**9 + 1,
}],
True,
["1.0 GB"],
),
(
'test with enough space for master under "upgrade" context',
['nodes', 'masters'],
"upgrade",
[{
'mount': '/',
'size_available': 10 * 10**9 + 1,
'size_total': 21 * 10**9 + 1,
}],
False,
[],
),
(
'test with not enough space for master, and non-upgrade context',
['nodes', 'masters'],
"health",
[{
'mount': '/',
# not enough space for a master,
# "health" context should not lower requirement
'size_available': 20 * 10**9 + 1,
}],
True,
["20.0 GB", "below minimum"],
),
], ids=lambda argval: argval[0])
def test_min_required_space_changes_with_upgrade_context(name, group_names, context, ansible_mounts, failed, extra_words):
task_vars = dict(
r_openshift_health_checker_playbook_context=context,
group_names=group_names,
ansible_mounts=ansible_mounts,
)

check = DiskAvailability(fake_execute_module, task_vars)
result = check.run()

assert result.get("failed", False) == failed
for word in extra_words:
assert word in result.get('msg', '')


def fake_execute_module(*args):
Expand Down