Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks to upgrade playbook #4372

Conversation

juanvallejo
Copy link
Contributor

@juanvallejo juanvallejo commented Jun 5, 2017

begin adding memory and disk-space verification to upgrade process

Begin addressing https://trello.com/c/K93Wzz4u

cc @sosiouxme @brenton @mtnbikenc @rhcarvalho

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: "upgrade"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, we need to handle this in

? /cc @sosiouxme

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from a4728f1 to 4b4e609 Compare June 6, 2017 16:42
@rhcarvalho
Copy link
Contributor

Does pre/verify_memory_and_diskspace.yml get run automatically?

@rhcarvalho
Copy link
Contributor

rhcarvalho commented Jun 12, 2017

Does pre/verify_memory_and_diskspace.yml get run automatically?

I bet it doesn't, I grepped for verify_upgrade_targets (because there is a pre/verify_upgrade_targets.yml file), and found multiple places where it is included, for example here:

https://github.com/openshift/openshift-ansible/blob/master/playbooks/common/openshift-cluster/upgrades/v3_3/upgrade.yml#L84

@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context):
'Variables can be set in the inventory or passed on the\n'
'command line using the -e flag to ansible-playbook.\n'
).format(playbook=self._playbook_file, checks=checks)
if context in ['pre-install', 'health']:
if context in ['pre-install', 'pre-upgrade', 'health']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme when you designed this, do you remember what were the "contexts" in which the summary message below should not be printed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho @juanvallejo I guess a comment would have helped. So what prompted this was that if you're running a playbook like byo/config.yml then these checks running may come as a surprise and it needs a little more explanation. If you're running a playbook that's specifically for running checks, it's not about "failing early" as they're the whole point, and so the default preamble above is a little out of place. I could imagine other cases in which we'd like to customize the message or even the outcome based on the user intent/expectations. But for now, there are really just two cases: you're trying to run checks, or you're trying to get something else done and the checks are "in the way".

So a "pre-upgrade" context would mean you're running a playbook specifically to see if your upgrade is gonna work out. An "upgrade" context would mean you're actually trying to run the upgrade.

Copy link
Member

@mtnbikenc mtnbikenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_memory_and_diskspace.yml should be called from an include: at the bottom of common/openshift-cluster/upgrades/init.yml. This will call the verification checks after the groups and facts have been established.

@@ -0,0 +1,12 @@
- name: Verify Host Requirements
hosts: OSEv3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be oo_all_hosts. OSEv3 should not be called in common playbooks.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 4b4e609 to 0621520 Compare June 13, 2017 19:16
@juanvallejo
Copy link
Contributor Author

@mtnbikenc @rhcarvalho thanks for the feedback, review comments addressed

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 0621520 to 1278b26 Compare June 14, 2017 21:11
roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: "pre-upgrade"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this is being included in an actual upgrade playbook above, I would say that should be just "upgrade" context. If you want to reuse the playbook it should probably not specify a context, instead relying on including playbooks to set context.

Also I don't like calling this verify_memory_and_diskspace.yml -- I assume we're going to add more checks to the same play in later iterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from a11a0e5 to 73e0a2e Compare June 14, 2017 21:34
roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: "pre-upgrade"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can drop the quotes in this case.

args:
checks:
- disk_availability
- memory_availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline character.

roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: "upgrade"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could safely drop the quotes here, just like in all other places in this YAML file.

@@ -23,3 +23,5 @@
set_fact:
os_firewall_use_firewalld: false
when: "'Active: active' in service_iptables_status.stdout"

- include: ./pre/verify_memory_and_diskspace.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reference is outdated.

- action: openshift_health_check
args:
checks:
- disk_availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory requirements in an upgrade should match the requirements of the newer version... but for disk requirements, we probably have at least two scenarios:

In-place upgrade

The free space requirement should likely be different than for an install, after all there is already data in the cluster (e.g., Docker images and installed RPM packages).

I couldn't find any specific guidance about free disk space in the docs.

Blue-green

I believe the free space requirement for the new hosts should be the same as for a fresh install.


We may need a way to tell when an upgrade is in-place and when it is blue-green.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could propagate a context to our checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What context do you have in mind? We do have r_openshift_health_checker_playbook_context, but that's manually set from playbooks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just configuring how the checks run per playbook, you can just add another variable (like r_openshift_health_checker_playbook_context) on the playbook and read that from the check. The only thing that's special about that context var is that it gets recorded in results so that the callback plugin can see it and adjust output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme Thanks, although I don't suppose there is a straightforward way of determining the type of upgrade that's taking place? All I have found was https://docs.openshift.org/latest/install_config/upgrading/automated_upgrades.html#install-config-upgrading-automated-upgrades

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtnbikenc wondering if you have any thoughts on this as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanvallejo Does a different type of upgrade use a different playbook, or just different parameters? Either way you have something to key off of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand https://docs.openshift.org/latest/install_config/upgrading/blue_green_deployments.html#blue-green-deployments-preparing-for-upgrade correctly, it looks like in the case of a blue-green upgrade, an in-place upgrade is still run on existing masters and etcd hosts, but not on existing nodes. Instead, existing nodes are labeled with color=blue (or whatever value a user decides on). Then, entirely new nodes are created with the new version of OpenShift using a playbook. Once green nodes are ready, pods are evacuated from blue nodes to green.

Therefore, I think that if the upgrade context is set, and the host is in the nodes group, it is safe to assume that it is going through an in-place upgrade (since a blue-green upgrade would be creating entirely new nodes instead). masters and etcd hosts can be treated as in-place for both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right.
So essentially there are just two cases, install or upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push updated disk_availability check that takes the upgrade context into account; halving the required disk space in that case

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 73e0a2e to c56f15d Compare June 15, 2017 22:13
@juanvallejo
Copy link
Contributor Author

@sosiouxme @rhcarvalho @mtnbikenc wondering your thoughts on 7bd04ce

# in use by the existing OpenShift deployment.
context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="")
if context == "upgrade":
min_free_bytes /= 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems... kinda arbitrary. I think in this context, if we run the check at all, it should just be checking that they have enough to run the upgrade. Not sure what that should be, maybe like 1GB? Or, this could be re-purposed as a "health" check when it's not in a fresh install context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not able to find a definite requirement in the docs for disk size during an upgrade.

Or, this could be re-purposed as a "health" check when it's not in a fresh install context.

That could also work. We could fail the check in this context if disk usage is something like > 90%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentage is not good. 90% usage of a 1TB disk means 100GB free -- more than enough for an install/upgrade. 90% usage of 20GB means 2GB free, risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unclear on a minimum disk availability value under "upgrade" context. Updated check to require at least 5.0 GB for in-place upgrades, for now

@@ -95,7 +95,7 @@ def _print_check_failure_summary(self, failed_checks, context):
'Variables can be set in the inventory or passed on the\n'
'command line using the -e flag to ansible-playbook.\n'
).format(playbook=self._playbook_file, checks=checks)
if context in ['pre-install', 'health']:
if context in ['pre-install', 'upgrade', 'health']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the summary for playbooks where the user is trying to run checks. I wouldn't include "upgrade" here so they can have the previous msg with a little more explanation. If there were a "pre-upgrade" playbook just for running checks then I would put it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will remove

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from 0ca870e to 44fef09 Compare June 23, 2017 20:21
'failed': True,
'msg': (
'Available disk space ({:.1f} GB) for the volume containing '
'"/var" is below minimum recommended space for an upgrade ({:.1f} GB).'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces some duplication with lines coming right below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. The return statement inside this block prevents the text below from being displayed if this one is.

# in use by the existing OpenShift deployment.
context = get_var(task_vars, "r_openshift_health_checker_playbook_context", default="")
if context == "upgrade":
upgrade_min_required_diskspace = 5.0 * 10**9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have in mind that both this and #4436 change the disk availability check. In an upgrade, this change of minimum requirements, whatever it end up being, will possibly only apply to /var, while /usr/local/bin and /tmp requirements would be left unchanged, agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with waiting for #4436 to merge, then adding the decreased upgrade requirement on top of that

@@ -0,0 +1,13 @@
---
- name: Verify Host Requirements
hosts: oo_all_hosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme shall we limit this also to run only on masters and nodes, or is it enough to filter it on the action plugin level?

Copy link
Member

@sosiouxme sosiouxme Jul 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho sorry i missed this. no, it looks like oo_all_hosts comes from g_all_hosts which is masters/nodes/etcd only so this should be fine. all our playbooks basically need to work the same as byo/config.yml with respect to groups.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch 2 times, most recently from eeac970 to 0b4d699 Compare June 30, 2017 17:50
@juanvallejo
Copy link
Contributor Author

@sosiouxme @rhcarvalho rebased with changes introduced to disk_availability.py in #4436
PTAL

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 0b4d699 to e840183 Compare June 30, 2017 17:52
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for e840183 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for e840183 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for e840183 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for e840183 (logs)

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still be able to simplify a bit, but otherwise looking good.

# if not able to resolve a recommended value, default to non-upgrade recommendation
min_upgrade_rec = max(recommendation.get(name, 0) for name in group_names)

if free_bytes < min_upgrade_rec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this duplicates logic unnecessarily. Couldn't we simple use the upgrade values when it's an upgrade and then use the same code paths to compare free / required?

tempfile.gettempdir(): {
'masters': 0.5 * 10**9,
'nodes': 0.5 * 10**9,
'etcd': 0.5 * 10**9,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted the 1 GB for /usr/local/bin and temp were both originally a rough guess, why do we make this different here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to just reduce space requirements for /var

["20.0 GB", "below minimum"],
),
])
def test_min_required_space_decreases_with_upgrade_context(group_names, context, ansible_mounts, failed, extra_words):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and other places we say "decrease" or otherwise give suggestions as to what the values are compared to "non-upgrade". I'd avoid doing that type of comment or naming, because what governs the relationship of the numbers is actual data. Change the data and all the names become misleading.

Likewise, on an extreme, it is considered bad practice to name variables like two = 2 only to see two = 3 some day in the future...

@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 071069a to 577fdc0 Compare July 21, 2017 20:06
Copy link
Member

@mtnbikenc mtnbikenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be refactoring init.yml for upgrades but the playbook flow looks good for now.

@rhcarvalho
Copy link
Contributor

thanks @mtnbikenc

@rhcarvalho
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 577fdc0 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 577fdc0 (logs)

@juanvallejo
Copy link
Contributor Author

@rhcarvalho ok to merge?

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2017
@juanvallejo juanvallejo force-pushed the jvallejo/verify-disk-memory-before-upgrade branch from 577fdc0 to 8af1839 Compare July 26, 2017 23:04
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2017
@rhcarvalho
Copy link
Contributor

@juanvallejo yes, we've got plenty of approval :-) time for the CI dance

@rhcarvalho
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 8af1839 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 8af1839 (logs)

@juanvallejo
Copy link
Contributor Author

juanvallejo commented Jul 27, 2017

@rhcarvalho ci tests are green, will tag for [merge]

@openshift-bot
Copy link

[test]ing while waiting on the merge queue

@openshift-bot
Copy link

Evaluated for openshift ansible test up to 8af1839

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/370/) (Base Commit: cf80c7e) (PR Branch Commit: 8af1839)

@juanvallejo
Copy link
Contributor Author

flaked on openshift/origin#10162 re[merge]

@rhcarvalho
Copy link
Contributor

Flake openshift/origin#15356, perhaps fixed in openshift/origin#15482.

And apparently something new, openshift/origin#15522.

[merge] one last time.

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 8af1839

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/754/) (Base Commit: 4a442d8) (PR Branch Commit: 8af1839)

@rhcarvalho
Copy link
Contributor

@rhcarvalho rhcarvalho merged commit d0761ae into openshift:master Jul 28, 2017
@rhcarvalho
Copy link
Contributor

@rhcarvalho
Copy link
Contributor

I made a mistake and merged when the error was actually related to the changes introduced by this PR. Subsequently reverted the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants