Verify memory and disk requirements before install #4161

rhcarvalho · 2017-05-11T13:32:52Z

Points to discuss:

Where exactly to call the checks
I have considered placing the verification down to the openshift-cluster role, though I am afraid it might end up affecting runs other than an initial install. Open to suggestions.
How to disable the checks
We need a mechanism for opting out of the checks. The mechanism so far is a comma-separated list of check names defined in the inventory (or passed in the command line) as openshift_disable_check. Open to alternatives and naming suggestions.

Example call:
```
$ ansible-playbook -i hosts playbooks/byo/config.yml -e openshift_disable_check=memory_availability,disk_availability
```
Maybe we could also have a way to disable all (openshift_disable_check=all).

Edit: one problem of a "disable all" mechanism is that it may set wrong expectations, as "all" has no clear meaning. E.g.: in the context of this PR, "disable all" would disable the disk and memory verifications, but would NOT disable the version verification done in the openshift_version role, as well as other things a user could have intended to disable. The Python mantra may fit well here: "explicit is better than implicit".

rhcarvalho · 2017-05-11T13:34:01Z

I expect this to make the installation CI jobs to fail because the hosts probably don't meet the requirements of memory and disk... let's see.

rhcarvalho · 2017-05-11T13:34:09Z

aos-ci-test

ingvagabund · 2017-05-11T13:39:44Z

playbooks/byo/config.yml

@@ -1,2 +1,14 @@
 ---
+- name: Verify Requirements


IIRC, we use playbooks/byo only as entry points. The verify code should go under playbooks/common. @mtnbikenc ?

I think @mtnbikenc was working on some graphical representation of what calls what... that would be useful to look at now. Any pointers to this dependency graph somewhere?

ingvagabund · 2017-05-11T13:41:23Z

roles/openshift_health_checker/action_plugins/openshift_health_check.py

@@ -46,18 +46,25 @@ def run(self, tmp=None, task_vars=None):

        result["checks"] = check_results = {}

+        user_disabled_checks = [
+            check.strip()
+            for check in task_vars.get("openshift_disable_check", "").split(",")


Is the list of all checks to disable listed somewhere?

Not really. Presumably there will eventually be docs for this.

The checks that did fail are mentioned in the output. We could make that more obvious with a summary line so that after a failure the user can review and decide if they don't care about the failures or want to reconfigure thresholds etc.

I make a brief mention of this var in openshift/openshift-docs#4452

openshift-bot · 2017-05-11T15:27:50Z

error: aos-ci-jenkins/OS_3.5_NOT_containerized for 89fe04b (logs)

openshift-bot · 2017-05-11T15:27:59Z

error: aos-ci-jenkins/OS_3.5_containerized for 89fe04b (logs)

openshift-bot · 2017-05-11T15:28:05Z

error: aos-ci-jenkins/OS_3.6_containerized for 89fe04b (logs)

openshift-bot · 2017-05-11T15:28:13Z

error: aos-ci-jenkins/OS_3.6_NOT_containerized for 89fe04b (logs)

rhcarvalho · 2017-05-12T11:50:13Z

The error logs show what I wanted to see:

...

Failure summary:

  1. Host:     10.8.170.165
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.8 GB) below recommended value (8.0 GB)

  2. Host:     10.8.170.17
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.8 GB) below recommended value (8.0 GB)

  3. Host:     10.8.170.174
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (7.8 GB) below recommended value (16.0 GB)

I'll send a patch to include a sentence after the error explaining how to disable checks.

rhcarvalho · 2017-05-12T16:49:20Z

Now the message should look like this, only where there are failed checks:

...

Failure summary:

  1. Host:     master1
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space (10.5 GB) for the volume containing "/var" is below minimum recommended space (40.0 GB)

    ...

The execution of the playbook 'playbooks/byo/config.yml' includes checks designed to ensure it can complete successfully. One or more of these checks failed. You may choose 
to disable checks by setting an Ansible variable:

    openshift_disable_check=disk_availability

Set the variable to a comma-separated list of check names. Check names are shown in the failure summary above.
The variable can be set in the inventory or passed in the command line using the -e flag to ansible-playbook.

rhcarvalho · 2017-05-12T16:50:05Z

playbooks/byo/config.yml

@@ -1,2 +1,14 @@
 ---
+- name: Verify Requirements
+  # REVIEW: what's the proper group to use: OSEv3, g_all_hosts or something else?


Needs review

OSEv3 at this location; however these changes might go better in playbooks/byo/openshift-cluster/config.yml which is included only by this file. They could go after

- include: initialize_groups.yml

... and actually use those group names.

Or if we put this in a playbook in common we can rely on the groups too.

rhcarvalho · 2017-05-12T16:50:46Z

roles/openshift_health_checker/action_plugins/openshift_health_check.py

                try:
                    r = check.run(tmp, task_vars)
                except OpenShiftCheckException as e:
                    r = {}
                    r["failed"] = True
                    r["msg"] = str(e)
            else:
+                # TODO(rhcarvalho): we may want to provide some distinctive
+                # complementary message to know why a check was skipped.


We may want to distinguish why a check was skipped, not necessarily to be implemented in this PR.

Actually should be easy enough to record that it's either explicitly skipped or not active.

rhcarvalho · 2017-05-12T16:51:26Z

roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

+                if result.get('failed', False)
+            ]
+            # FIXME: get name of currently running playbook, if possible.
+            NAME_OF_PLAYBOOK = 'playbooks/byo/config.yml'


This may be tricky to figure out, if possible. Didn't have much time to keep digging. We may also just drop this information from the output.

rhcarvalho · 2017-05-12T16:52:13Z

roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

+                'Check names are shown in the failure summary above.\n'
+                'The variable can be set in the inventory or passed in the '
+                'command line using the -e flag to ansible-playbook.'
+            ).format(NAME_OF_PLAYBOOK, ','.join(sorted(set(failed_checks))))


Note: we need to remove duplicates. Sorting provides consistent output.

rhcarvalho · 2017-05-12T16:53:16Z

playbooks/byo/config.yml

+      args:
+        checks:
+          - disk_availability
+          - memory_availability


Idea: this block might become a playbooks/openshift-check/pre-install.yml, and then we simply include it here.

The checks we want to run automatically on every install may not coincide with the checks we want to run if they're specifically running pre-install checks. I'll think about this a bit...

sosiouxme

Submitting review comments I forgot to on Friday :-(

sosiouxme · 2017-05-12T20:26:10Z

playbooks/byo/config.yml

@@ -1,2 +1,14 @@
 ---
+- name: Verify Requirements
+  # REVIEW: what's the proper group to use: OSEv3, g_all_hosts or something else?


OSEv3 at this location; however these changes might go better in playbooks/byo/openshift-cluster/config.yml which is included only by this file. They could go after

- include: initialize_groups.yml

... and actually use those group names.

Or if we put this in a playbook in common we can rely on the groups too.

sosiouxme · 2017-05-12T20:28:38Z

playbooks/byo/config.yml

+      args:
+        checks:
+          - disk_availability
+          - memory_availability


The checks we want to run automatically on every install may not coincide with the checks we want to run if they're specifically running pre-install checks. I'll think about this a bit...

sosiouxme · 2017-05-12T20:55:56Z

roles/openshift_health_checker/action_plugins/openshift_health_check.py

                try:
                    r = check.run(tmp, task_vars)
                except OpenShiftCheckException as e:
                    r = {}
                    r["failed"] = True
                    r["msg"] = str(e)
            else:
+                # TODO(rhcarvalho): we may want to provide some distinctive
+                # complementary message to know why a check was skipped.


Actually should be easy enough to record that it's either explicitly skipped or not active.

sosiouxme · 2017-05-16T14:49:34Z

aos-ci-test

sosiouxme · 2017-05-16T15:42:56Z

roles/openshift_health_checker/callback_plugins/zz_failure_summary.py

+            NAME_OF_PLAYBOOK = 'playbooks/byo/config.yml'
+            msg = (
+                "\nThe execution of the playbook '{}' includes checks designed "
+                'to ensure it can complete successfully. One or more of these '


Actually this message is directed at the checks being run as part of the install playbook. I would like it to be different for running the pre-install playbook explicitly, and for running health checks. I think it would make sense to add a role variable that the playbook sets to give some context for which of those it is, so this plugin can adjust output accordingly.

openshift-bot · 2017-05-16T16:51:48Z

error: aos-ci-jenkins/OS_3.5_containerized for b0bd18e (logs)

openshift-bot · 2017-05-16T16:56:42Z

error: aos-ci-jenkins/OS_3.6_containerized for b0bd18e (logs)

openshift-bot · 2017-05-16T17:08:03Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for b0bd18e (logs)

openshift-bot · 2017-05-16T17:08:50Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for b0bd18e (logs)

sosiouxme · 2017-05-23T16:07:33Z

aos-ci-test

sosiouxme · 2017-05-23T16:08:42Z

New example output when there are failed checks:

Failure summary:

  1. Host:     ec2-34-207-204-214.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space (25.3 GB) for the volume containing "/var" is below minimum recommended space (40.0 GB)
               
               check "memory_availability":
               Available memory (3.5 GB) below recommended value (20.0 GB)

  2. Host:     ec2-54-197-219-172.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.5 GB) below recommended value (8.0 GB)

The execution of "playbooks/byo/config.yml"
includes checks designed to fail early if the requirements
of the playbook are not met. One or more of these checks
failed. To disregard these results, you may choose to
disable failing checks by setting an Ansible variable:

   openshift_disable_check=disk_availability,memory_availability

Failing check names are shown in the failure details above.
Some checks may be configurable by variables if your requirements
are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the
command line using the -e flag to ansible-playbook.

If nothing fails, there's no summary. If something other than the checks fails, then failures are still nicely displayed by host but the summary about configuring checks is not displayed, e.g.:

Failure summary:

  1. Host:     ec2-34-207-204-214.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_sanitize_inventory : Abort when deployment type is invalid
     Message:  Please set openshift_deployment_type to one of:
               origin, online, enterprise, atomic-enterprise, openshift-enterprise

  2. Host:     ec2-54-197-219-172.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_sanitize_inventory : Abort when deployment type is invalid
     Message:  Please set openshift_deployment_type to one of:
               origin, online, enterprise, atomic-enterprise, openshift-enterprise

openshift-bot · 2017-05-23T17:03:51Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 06e9eaa (logs)

openshift-bot · 2017-05-23T17:04:20Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 06e9eaa (logs)

openshift-bot · 2017-05-23T17:06:12Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 06e9eaa (logs)

openshift-bot · 2017-05-23T17:06:51Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 06e9eaa (logs)

Example usage: $ ansible-playbook -i hosts playbooks/byo/config.yml -e openshift_disable_check=memory_availability,disk_availability Or add the variable to the inventory / hosts file.

sdodson · 2017-05-23T20:03:07Z

playbooks/byo/openshift-cluster/config.yml

@@ -3,6 +3,19 @@
  tags:
  - always

+- name: Verify Requirements
+  hosts: OSEv3


This will include other host types like [nfs] and [lb] which we don't want to apply our checks to. We should probably move this into playbooks/common/openshift-cluster/config.yml and use the oo_masters_to_config:oo_nodes_to_config:oo_etcd_to_config there.

Nevermind, I see that you're selecting which checks to apply based on byo group names.

It would be best to use the right groups in the checks (the way checks are structured, they all run under one task which runs against all hosts, so they filter themselves out by various criteria). It's just not obvious to me what the groups would ideally be, since I haven't grokked what the various groups are for and where they come from.

sdodson · 2017-05-23T20:08:40Z

aos-ci-test

openshift-bot · 2017-05-23T21:35:19Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for da04b57 (logs)

openshift-bot · 2017-05-23T21:35:39Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for da04b57 (logs)

openshift-bot · 2017-05-23T21:37:22Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for da04b57 (logs)

openshift-bot · 2017-05-23T21:38:05Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for da04b57 (logs)

sosiouxme · 2017-05-24T14:03:26Z

[merge]

sosiouxme · 2017-05-24T15:27:03Z

Jenkins tests failing because of the new checks. I guess I have to update those as well.

sosiouxme · 2017-05-24T20:11:31Z

[test] with CI changes

sdodson · 2017-05-24T20:19:04Z

[merge]

sosiouxme · 2017-05-24T21:59:36Z

[test] again with openshift-eng/aos-cd-jobs#288 in

brenton · 2017-05-25T17:19:47Z

[test]

brenton · 2017-05-25T17:22:53Z

@sdodson, since this was done by devcut just not able to make it through the merge queue in time, do you have any problems with hitting the green button if we can show a passed test run?

/cc @jupierce

openshift-bot · 2017-05-25T17:23:26Z

Evaluated for openshift ansible test up to da04b57

sdodson · 2017-05-25T17:26:33Z

[merge][severity: bug]

sdodson · 2017-05-25T17:26:41Z

If that fails we'll manually merge.

openshift-bot · 2017-05-25T17:31:18Z

Evaluated for openshift ansible merge up to da04b57

openshift-bot · 2017-05-25T18:56:32Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/155/) (Base Commit: a353d4d)

sosiouxme · 2017-05-25T19:13:51Z

flake on https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/155/

openshift-bot · 2017-05-25T19:15:37Z

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/459/) (Base Commit: a353d4d) (Extended Tests: bug)

sosiouxme · 2017-05-25T19:20:21Z

Thx :)
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/459/ failed on "github3.models.GitHubError: 403 API rate limit exceeded for openshift-bot." - that's a new flake to me! LOL

jeremyeder · 2017-05-26T15:07:07Z

#2152

rhcarvalho · 2017-06-13T16:49:55Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

        free_bytes = self.openshift_available_disk(ansible_mounts)

+        recommended_min = max(self.recommended_disk_space_bytes.get(name, 0) for name in group_names)
+        configured_min = int(get_var(task_vars, "openshift_check_min_host_disk_gb", default=0)) * 10**9


@sosiouxme this variable seems to be too generic, how was this supposed to work?

I'm updating this check now to check multiple paths / mount points (/var, /usr/local/bin, /tmp or equivalent), and in that case this variable becomes ambiguous, which filesystem should I apply it to?

Is it used somewhere we know of?

It's not used anywhere except here... I made it up

rhcarvalho requested review from sosiouxme, sdodson, codificat and juanvallejo May 11, 2017 13:32

ingvagabund reviewed May 11, 2017

View reviewed changes

rhcarvalho commented May 12, 2017

View reviewed changes

sosiouxme reviewed May 15, 2017

View reviewed changes

sosiouxme reviewed May 16, 2017

View reviewed changes

sosiouxme force-pushed the integrate-checks-with-install branch from b0bd18e to 06e9eaa Compare May 23, 2017 16:06

rhcarvalho added 2 commits May 23, 2017 14:16

Verify memory and disk requirements before install

c8ec88b

Allow disabling checks via Ansible variable

68ff609

Example usage: $ ansible-playbook -i hosts playbooks/byo/config.yml -e openshift_disable_check=memory_availability,disk_availability Or add the variable to the inventory / hosts file.

sdodson reviewed May 23, 2017

View reviewed changes

sosiouxme mentioned this pull request May 23, 2017

add "Configuring Cluster Preflight Checks" sub-section openshift/openshift-docs#4452

Closed

1 task

juanvallejo approved these changes May 24, 2017

View reviewed changes

codificat approved these changes May 24, 2017

View reviewed changes

sosiouxme mentioned this pull request May 24, 2017

lower thresholds for disk/memory install checks openshift-eng/aos-cd-jobs#286

Merged

tbielawa mentioned this pull request May 25, 2017

CFME as an OpenShift Pod #4041

Merged

sdodson merged commit e8c8ed5 into openshift:master May 25, 2017

rhcarvalho commented Jun 13, 2017

View reviewed changes

rhcarvalho deleted the integrate-checks-with-install branch June 22, 2017 08:58

Verify memory and disk requirements before install #4161

Verify memory and disk requirements before install #4161

Conversation

rhcarvalho commented May 11, 2017 • edited Loading

rhcarvalho commented May 11, 2017

rhcarvalho commented May 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented May 11, 2017

openshift-bot commented May 11, 2017

openshift-bot commented May 11, 2017

openshift-bot commented May 11, 2017

rhcarvalho commented May 12, 2017

rhcarvalho commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented May 16, 2017 • edited Loading

sosiouxme May 16, 2017 • edited Loading

Choose a reason for hiding this comment

openshift-bot commented May 16, 2017

openshift-bot commented May 16, 2017

openshift-bot commented May 16, 2017

openshift-bot commented May 16, 2017

sosiouxme commented May 23, 2017

sosiouxme commented May 23, 2017 • edited Loading

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson commented May 23, 2017

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

openshift-bot commented May 23, 2017

sosiouxme commented May 24, 2017

sosiouxme commented May 24, 2017

sosiouxme commented May 24, 2017

sdodson commented May 24, 2017

sosiouxme commented May 24, 2017

brenton commented May 25, 2017

brenton commented May 25, 2017

openshift-bot commented May 25, 2017

sdodson commented May 25, 2017

sdodson commented May 25, 2017

openshift-bot commented May 25, 2017

openshift-bot commented May 25, 2017

sosiouxme commented May 25, 2017

openshift-bot commented May 25, 2017

sosiouxme commented May 25, 2017

jeremyeder commented May 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented May 11, 2017 •

edited

Loading

sosiouxme commented May 16, 2017 •

edited

Loading

sosiouxme May 16, 2017 •

edited

Loading

sosiouxme commented May 23, 2017 •

edited

Loading