Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506

Merged
merged 1 commit into from
May 6, 2020

Conversation

abhinavdahiya
Copy link
Contributor

This adds a document that provides,

  1. structural information about various file in the bootstrap log bundle
  2. some common failures that can be troubleshooted using the bootstrap log bundle

/cc @openshift/openshift-team-installer

@openshift-ci-robot openshift-ci-robot requested a review from a team April 24, 2020 20:36
@abhinavdahiya abhinavdahiya added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. retest-not-required-docs-only labels Apr 24, 2020
@abhinavdahiya abhinavdahiya reopened this Apr 24, 2020
@abhinavdahiya
Copy link
Contributor Author

/test

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: The /test command needs one or more targets.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-fips
  • /test e2e-aws-proxy
  • /test e2e-aws-rhel8
  • /test e2e-aws-scaleup-rhel7
  • /test e2e-aws-shared-vpc
  • /test e2e-aws-upgrade
  • /test e2e-aws-upi
  • /test e2e-azure
  • /test e2e-azure-shared-vpc
  • /test e2e-azure-upi
  • /test e2e-gcp
  • /test e2e-gcp-shared-vpc
  • /test e2e-gcp-upgrade
  • /test e2e-gcp-upi
  • /test e2e-libvirt
  • /test e2e-metal
  • /test e2e-metal-ipi
  • /test e2e-openstack
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-vsphere
  • /test e2e-vsphere-upi
  • /test gofmt
  • /test golint
  • /test govet
  • /test images
  • /test shellcheck
  • /test tf-fmt
  • /test tf-lint
  • /test unit
  • /test verify-vendor
  • /test yaml-lint

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor Author

/test all

@wking
Copy link
Member

wking commented Apr 26, 2020

/refresh
/retest
/test unit

@wking
Copy link
Member

wking commented Apr 26, 2020

/test yaml-lint
/test verify-vendor
/test tf-lint
/test shellcheck
/test images
/test govet
/test golint
/test gofmt
/test e2e-aws-upgrade

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe as described [here](troubleshootingbootstrap.md).? But regardless of what you use as the link text, the URI should go in parens, because you have an inline link, not a reference-style link.

Copy link
Contributor

@jstuever jstuever Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.


1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.

2. Use the user'd home directory, `~/.ssh` on linux hosts, to load all the SSH private keys and use those for SSH authentication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "linux" -> "Linux"


### directory: unit-status

The unit-status directory contains the details of each failed systemd unit from [failed-units][#file-failed-units-txt]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: [failed-units](#file-failed-units-txt). (braces -> parens and trailing period).


### directory: bootstrap

The bootstrap directory consists of all the important logs and files from the bootstrap host. There are 2 sub directories for the bootstrap host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "sub directories" -> "subdirectories". And maybe want a trailing colon :.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, example has three subdirectories, not two. Maybe just say "The subdirectories are:"?

* `crio-configure.log` and `crio.log`, these units are responsible for configuring the CRI-O on the bootstrap host and CRI-O daemon respectively.
* `kubelet.log`, the kubelet service is responsible for running the kubelet on the bootstrap host. The kubelet on the bootstrap host is responsible for running the static pods for etcd, bootstrap-kube-controlplane and various other operators in bootstrap mode.
* `approve-csr.log`, the approve-csr unit is responsible for allowing control-plane machines to join OpenShift cluster. This unit performs the job of in-cluster approver while the bootstrapping is in progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: your previous list entries had no intervening blank lines; probably drop this one for consistency.

12 directories, 3 files
```

#### directory: control-plane/*/containers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Markdown (at least GitHub's PR/files renderer) thinks this * is the beginning of an italics span. You can backtick your paths like control-plane/*/containers to avoid confusing it.


#### directory: control-plane/*/containers

The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRIO on the control-plane host. The files are same as [containers directory][#directory-bootstrap-containers] on bootstrap host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: another braces -> parent inline link. Also "CRIO" -> "CRI-O"

* `kubelet.log`
* `machine-config-daemon-host.log` and `pivot.log`, these files have logs for RHCOS pivot related actions on the control plane host.

## Common Failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like the installer to grow diagnostics for common failures (#2569). Any thoughts about whether we can get an up/down decision on that direction once 4.6 splits off from master?

-- No entries --
```

There is high likelyhood that the Release Image cannot be downloaded and more details can be found using [release-image.log][#unable-to-pull-release-image]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: another braces -> parents inline link.

@abhinavdahiya
Copy link
Contributor Author

/assign @jstuever

Copy link
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes requested. Also, I'd like to see an actual troubleshooting workflow of some sort... such as 1) confirm images are downloading, 2) confirm etcd is up... etc... IMHO that is the most value we can add here.

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document.
Copy link
Contributor

@jstuever jstuever Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.


The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer,

1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an already setup...


1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.

2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user's
Also, clarify that this only happens if the SSH_ANGENT isn't already running.

Copy link
Contributor Author

@abhinavdahiya abhinavdahiya Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, clarify that this only happens if the SSH_ANGENT isn't already running.

https://github.com/openshift/installer/pull/3506/files#diff-135e3d860b56722d4c6282c25380d24dR13 already says One of

1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.

2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication.
a. The installer also configures the bootstrap host with a *generated* SSH key, and this private key will be used for SSH authentication none of the user keys are trusted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placement feels odd... should this be 3.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not 3, this is only valid in case of the discovering keys, if SSH_AGENT is set, we don't do any discovering.


When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure.

#### Authenticating with bootstrap host for ipi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Authenticating to the bootstrap host

@abhinavdahiya
Copy link
Contributor Author

abhinavdahiya commented Apr 27, 2020

Also, I'd like to see an actual troubleshooting workflow of some sort... such as 1) confirm images are downloading, 2) confirm etcd is up... etc... IMHO that is the most value we can add here.

@jstuever The goal is to tell people if the failure is due to one of the reasons. the users can see which one applies to them.

a worlkfow of what you should look isn't just possible because there are too many moving parts and people using respond better to symptons instead of path.

my bootstrap failed, was it is because control-plane machines didn't join? that is more easy to link to and define. instead of, hey let's go on a ride of flow-chart.

@abhinavdahiya abhinavdahiya force-pushed the bg_doc branch 2 times, most recently from a520530 to 120544e Compare April 28, 2020 05:49
@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/yaml-lint fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test yaml-lint
ci/prow/golint fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test golint
ci/prow/gofmt fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test gofmt
ci/prow/govet fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test govet
ci/prow/verify-vendor fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test verify-vendor
ci/prow/unit fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test unit
ci/prow/images fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test images
ci/prow/e2e-ovirt fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test e2e-ovirt
ci/prow/e2e-aws-scaleup-rhel7 fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-metal-ipi fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor Author

ping @jstuever for review

Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some proofreading suggestions

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap](./troubleshootingbootstap.md) document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link does not work.

as describe in [troubleshootingbootstrap] -> as described in the [troubleshooting bootstrap]

docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
docs/user/troubleshootingbootstrap.md Outdated Show resolved Hide resolved
@abhinavdahiya
Copy link
Contributor Author

@patrickdillon Thanks for the review, updated the PR! :)

@jstuever
Copy link
Contributor

jstuever commented May 6, 2020

hey let's go on a ride of flow-chart.

I was thinking more of a high-level flow-chart.... pre-installation, wait-for bootstrap, wait-for install... to help direct what the user should be doing to troubleshoot and concentrate on which errors might be applicable to the user. However, in hind-sight, this is likely beyond the scope of this particular story.

@jstuever
Copy link
Contributor

jstuever commented May 6, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2020
@jstuever
Copy link
Contributor

jstuever commented May 6, 2020

/retest

@abhinavdahiya
Copy link
Contributor Author

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2020
@openshift-merge-robot openshift-merge-robot merged commit 51bbaa4 into openshift:master May 6, 2020
@abhinavdahiya abhinavdahiya deleted the bg_doc branch May 6, 2020 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. retest-not-required-docs-only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants