Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify memory and disk requirements before install #4161

Merged
merged 7 commits into from
May 25, 2017

Conversation

rhcarvalho
Copy link
Contributor

@rhcarvalho rhcarvalho commented May 11, 2017

Points to discuss:

  • Where exactly to call the checks
    I have considered placing the verification down to the openshift-cluster role, though I am afraid it might end up affecting runs other than an initial install. Open to suggestions.

  • How to disable the checks
    We need a mechanism for opting out of the checks. The mechanism so far is a comma-separated list of check names defined in the inventory (or passed in the command line) as openshift_disable_check. Open to alternatives and naming suggestions.

    Example call:

    $ ansible-playbook -i hosts playbooks/byo/config.yml -e openshift_disable_check=memory_availability,disk_availability

    Maybe we could also have a way to disable all (openshift_disable_check=all).

    Edit: one problem of a "disable all" mechanism is that it may set wrong expectations, as "all" has no clear meaning. E.g.: in the context of this PR, "disable all" would disable the disk and memory verifications, but would NOT disable the version verification done in the openshift_version role, as well as other things a user could have intended to disable. The Python mantra may fit well here: "explicit is better than implicit".

@rhcarvalho
Copy link
Contributor Author

I expect this to make the installation CI jobs to fail because the hosts probably don't meet the requirements of memory and disk... let's see.

@rhcarvalho
Copy link
Contributor Author

aos-ci-test

@@ -1,2 +1,14 @@
---
- name: Verify Requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we use playbooks/byo only as entry points. The verify code should go under playbooks/common. @mtnbikenc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @mtnbikenc was working on some graphical representation of what calls what... that would be useful to look at now. Any pointers to this dependency graph somewhere?

@@ -46,18 +46,25 @@ def run(self, tmp=None, task_vars=None):

result["checks"] = check_results = {}

user_disabled_checks = [
check.strip()
for check in task_vars.get("openshift_disable_check", "").split(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the list of all checks to disable listed somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Presumably there will eventually be docs for this.

The checks that did fail are mentioned in the output. We could make that more obvious with a summary line so that after a failure the user can review and decide if they don't care about the failures or want to reconfigure thresholds etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make a brief mention of this var in openshift/openshift-docs#4452

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.5_NOT_containerized for 89fe04b (logs)

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.5_containerized for 89fe04b (logs)

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_containerized for 89fe04b (logs)

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_NOT_containerized for 89fe04b (logs)

@rhcarvalho
Copy link
Contributor Author

The error logs show what I wanted to see:

...

Failure summary:

  1. Host:     10.8.170.165
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.8 GB) below recommended value (8.0 GB)

  2. Host:     10.8.170.17
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.8 GB) below recommended value (8.0 GB)

  3. Host:     10.8.170.174
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (7.8 GB) below recommended value (16.0 GB)

I'll send a patch to include a sentence after the error explaining how to disable checks.

@rhcarvalho
Copy link
Contributor Author

Now the message should look like this, only where there are failed checks:

...

Failure summary:

  1. Host:     master1
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space (10.5 GB) for the volume containing "/var" is below minimum recommended space (40.0 GB)

    ...

The execution of the playbook 'playbooks/byo/config.yml' includes checks designed to ensure it can complete successfully. One or more of these checks failed. You may choose 
to disable checks by setting an Ansible variable:

    openshift_disable_check=disk_availability

Set the variable to a comma-separated list of check names. Check names are shown in the failure summary above.
The variable can be set in the inventory or passed in the command line using the -e flag to ansible-playbook.

@@ -1,2 +1,14 @@
---
- name: Verify Requirements
# REVIEW: what's the proper group to use: OSEv3, g_all_hosts or something else?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSEv3 at this location; however these changes might go better in playbooks/byo/openshift-cluster/config.yml which is included only by this file. They could go after

- include: initialize_groups.yml

... and actually use those group names.

Or if we put this in a playbook in common we can rely on the groups too.

try:
r = check.run(tmp, task_vars)
except OpenShiftCheckException as e:
r = {}
r["failed"] = True
r["msg"] = str(e)
else:
# TODO(rhcarvalho): we may want to provide some distinctive
# complementary message to know why a check was skipped.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to distinguish why a check was skipped, not necessarily to be implemented in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually should be easy enough to record that it's either explicitly skipped or not active.

if result.get('failed', False)
]
# FIXME: get name of currently running playbook, if possible.
NAME_OF_PLAYBOOK = 'playbooks/byo/config.yml'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be tricky to figure out, if possible. Didn't have much time to keep digging. We may also just drop this information from the output.

'Check names are shown in the failure summary above.\n'
'The variable can be set in the inventory or passed in the '
'command line using the -e flag to ansible-playbook.'
).format(NAME_OF_PLAYBOOK, ','.join(sorted(set(failed_checks))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we need to remove duplicates. Sorting provides consistent output.

args:
checks:
- disk_availability
- memory_availability
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: this block might become a playbooks/openshift-check/pre-install.yml, and then we simply include it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks we want to run automatically on every install may not coincide with the checks we want to run if they're specifically running pre-install checks. I'll think about this a bit...

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting review comments I forgot to on Friday :-(

@@ -1,2 +1,14 @@
---
- name: Verify Requirements
# REVIEW: what's the proper group to use: OSEv3, g_all_hosts or something else?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSEv3 at this location; however these changes might go better in playbooks/byo/openshift-cluster/config.yml which is included only by this file. They could go after

- include: initialize_groups.yml

... and actually use those group names.

Or if we put this in a playbook in common we can rely on the groups too.

args:
checks:
- disk_availability
- memory_availability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks we want to run automatically on every install may not coincide with the checks we want to run if they're specifically running pre-install checks. I'll think about this a bit...

try:
r = check.run(tmp, task_vars)
except OpenShiftCheckException as e:
r = {}
r["failed"] = True
r["msg"] = str(e)
else:
# TODO(rhcarvalho): we may want to provide some distinctive
# complementary message to know why a check was skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually should be easy enough to record that it's either explicitly skipped or not active.

@sosiouxme
Copy link
Member

sosiouxme commented May 16, 2017

aos-ci-test

NAME_OF_PLAYBOOK = 'playbooks/byo/config.yml'
msg = (
"\nThe execution of the playbook '{}' includes checks designed "
'to ensure it can complete successfully. One or more of these '
Copy link
Member

@sosiouxme sosiouxme May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this message is directed at the checks being run as part of the install playbook. I would like it to be different for running the pre-install playbook explicitly, and for running health checks. I think it would make sense to add a role variable that the playbook sets to give some context for which of those it is, so this plugin can adjust output accordingly.

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.5_containerized for b0bd18e (logs)

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_containerized for b0bd18e (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for b0bd18e (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for b0bd18e (logs)

@sosiouxme sosiouxme force-pushed the integrate-checks-with-install branch from b0bd18e to 06e9eaa Compare May 23, 2017 16:06
@sosiouxme
Copy link
Member

aos-ci-test

@sosiouxme
Copy link
Member

sosiouxme commented May 23, 2017

New example output when there are failed checks:

Failure summary:

  1. Host:     ec2-34-207-204-214.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space (25.3 GB) for the volume containing "/var" is below minimum recommended space (40.0 GB)
               
               check "memory_availability":
               Available memory (3.5 GB) below recommended value (20.0 GB)

  2. Host:     ec2-54-197-219-172.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.5 GB) below recommended value (8.0 GB)

The execution of "playbooks/byo/config.yml"
includes checks designed to fail early if the requirements
of the playbook are not met. One or more of these checks
failed. To disregard these results, you may choose to
disable failing checks by setting an Ansible variable:

   openshift_disable_check=disk_availability,memory_availability

Failing check names are shown in the failure details above.
Some checks may be configurable by variables if your requirements
are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the
command line using the -e flag to ansible-playbook.

If nothing fails, there's no summary. If something other than the checks fails, then failures are still nicely displayed by host but the summary about configuring checks is not displayed, e.g.:

Failure summary:

  1. Host:     ec2-34-207-204-214.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_sanitize_inventory : Abort when deployment type is invalid
     Message:  Please set openshift_deployment_type to one of:
               origin, online, enterprise, atomic-enterprise, openshift-enterprise

  2. Host:     ec2-54-197-219-172.compute-1.amazonaws.com
     Play:     Verify Requirements
     Task:     openshift_sanitize_inventory : Abort when deployment type is invalid
     Message:  Please set openshift_deployment_type to one of:
               origin, online, enterprise, atomic-enterprise, openshift-enterprise

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 06e9eaa (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 06e9eaa (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 06e9eaa (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 06e9eaa (logs)

Example usage:

$ ansible-playbook -i hosts playbooks/byo/config.yml -e
openshift_disable_check=memory_availability,disk_availability

Or add the variable to the inventory / hosts file.
@@ -3,6 +3,19 @@
tags:
- always

- name: Verify Requirements
hosts: OSEv3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will include other host types like [nfs] and [lb] which we don't want to apply our checks to. We should probably move this into playbooks/common/openshift-cluster/config.yml and use the oo_masters_to_config:oo_nodes_to_config:oo_etcd_to_config there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that you're selecting which checks to apply based on byo group names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to use the right groups in the checks (the way checks are structured, they all run under one task which runs against all hosts, so they filter themselves out by various criteria). It's just not obvious to me what the groups would ideally be, since I haven't grokked what the various groups are for and where they come from.

@sdodson
Copy link
Member

sdodson commented May 23, 2017

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for da04b57 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for da04b57 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for da04b57 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for da04b57 (logs)

@sosiouxme
Copy link
Member

[merge]

@sosiouxme
Copy link
Member

Jenkins tests failing because of the new checks. I guess I have to update those as well.

@sosiouxme
Copy link
Member

[test] with CI changes

@sdodson
Copy link
Member

sdodson commented May 24, 2017

[merge]

@sosiouxme
Copy link
Member

[test] again with openshift-eng/aos-cd-jobs#288 in

@brenton
Copy link
Contributor

brenton commented May 25, 2017

[test]

@brenton
Copy link
Contributor

brenton commented May 25, 2017

@sdodson, since this was done by devcut just not able to make it through the merge queue in time, do you have any problems with hitting the green button if we can show a passed test run?

/cc @jupierce

@openshift-bot
Copy link

Evaluated for openshift ansible test up to da04b57

@sdodson
Copy link
Member

sdodson commented May 25, 2017

[merge][severity: bug]

@sdodson
Copy link
Member

sdodson commented May 25, 2017

If that fails we'll manually merge.

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to da04b57

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/155/) (Base Commit: a353d4d)

@sosiouxme
Copy link
Member

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/459/) (Base Commit: a353d4d) (Extended Tests: bug)

@sdodson sdodson merged commit e8c8ed5 into openshift:master May 25, 2017
@sosiouxme
Copy link
Member

Thx :)
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/459/ failed on "github3.models.GitHubError: 403 API rate limit exceeded for openshift-bot." - that's a new flake to me! LOL

@jeremyeder
Copy link
Contributor

#2152

free_bytes = self.openshift_available_disk(ansible_mounts)

recommended_min = max(self.recommended_disk_space_bytes.get(name, 0) for name in group_names)
configured_min = int(get_var(task_vars, "openshift_check_min_host_disk_gb", default=0)) * 10**9
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme this variable seems to be too generic, how was this supposed to work?

I'm updating this check now to check multiple paths / mount points (/var, /usr/local/bin, /tmp or equivalent), and in that case this variable becomes ambiguous, which filesystem should I apply it to?

Is it used somewhere we know of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used anywhere except here... I made it up

@rhcarvalho rhcarvalho deleted the integrate-checks-with-install branch June 22, 2017 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants