-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506
Conversation
/test |
@abhinavdahiya: The
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test all |
/refresh |
/test yaml-lint |
docs/user/troubleshooting.md
Outdated
@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`. | |||
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service` | |||
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'` | |||
|
|||
The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe as described [here](troubleshootingbootstrap.md).
? But regardless of what you use as the link text, the URI should go in parens, because you have an inline link, not a reference-style link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.
|
||
1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication. | ||
|
||
2. Use the user'd home directory, `~/.ssh` on linux hosts, to load all the SSH private keys and use those for SSH authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "linux" -> "Linux"
|
||
### directory: unit-status | ||
|
||
The unit-status directory contains the details of each failed systemd unit from [failed-units][#file-failed-units-txt] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: [failed-units](#file-failed-units-txt).
(braces -> parens and trailing period).
|
||
### directory: bootstrap | ||
|
||
The bootstrap directory consists of all the important logs and files from the bootstrap host. There are 2 sub directories for the bootstrap host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "sub directories" -> "subdirectories". And maybe want a trailing colon :
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, example has three subdirectories, not two. Maybe just say "The subdirectories are:"?
* `crio-configure.log` and `crio.log`, these units are responsible for configuring the CRI-O on the bootstrap host and CRI-O daemon respectively. | ||
* `kubelet.log`, the kubelet service is responsible for running the kubelet on the bootstrap host. The kubelet on the bootstrap host is responsible for running the static pods for etcd, bootstrap-kube-controlplane and various other operators in bootstrap mode. | ||
* `approve-csr.log`, the approve-csr unit is responsible for allowing control-plane machines to join OpenShift cluster. This unit performs the job of in-cluster approver while the bootstrapping is in progress. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: your previous list entries had no intervening blank lines; probably drop this one for consistency.
12 directories, 3 files | ||
``` | ||
|
||
#### directory: control-plane/*/containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Markdown (at least GitHub's PR/files renderer) thinks this *
is the beginning of an italics span. You can backtick your paths like control-plane/*/containers
to avoid confusing it.
|
||
#### directory: control-plane/*/containers | ||
|
||
The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRIO on the control-plane host. The files are same as [containers directory][#directory-bootstrap-containers] on bootstrap host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: another braces -> parent inline link. Also "CRIO" -> "CRI-O"
* `kubelet.log` | ||
* `machine-config-daemon-host.log` and `pivot.log`, these files have logs for RHCOS pivot related actions on the control plane host. | ||
|
||
## Common Failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like the installer to grow diagnostics for common failures (#2569). Any thoughts about whether we can get an up/down decision on that direction once 4.6 splits off from master?
-- No entries -- | ||
``` | ||
|
||
There is high likelyhood that the Release Image cannot be downloaded and more details can be found using [release-image.log][#unable-to-pull-release-image] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: another braces -> parents inline link.
/assign @jstuever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes requested. Also, I'd like to see an actual troubleshooting workflow of some sort... such as 1) confirm images are downloading, 2) confirm etcd is up... etc... IMHO that is the most value we can add here.
docs/user/troubleshooting.md
Outdated
@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`. | |||
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service` | |||
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'` | |||
|
|||
The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.
|
||
The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer, | ||
|
||
1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an already setup
...
|
||
1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication. | ||
|
||
2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user's
Also, clarify that this only happens if the SSH_ANGENT isn't already running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, clarify that this only happens if the SSH_ANGENT isn't already running.
https://github.com/openshift/installer/pull/3506/files#diff-135e3d860b56722d4c6282c25380d24dR13 already says One of
1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication. | ||
|
||
2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication. | ||
a. The installer also configures the bootstrap host with a *generated* SSH key, and this private key will be used for SSH authentication none of the user keys are trusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The placement feels odd... should this be 3.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not 3, this is only valid in case of the discovering keys, if SSH_AGENT is set, we don't do any discovering.
|
||
When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure. | ||
|
||
#### Authenticating with bootstrap host for ipi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authenticating to the bootstrap host
@jstuever The goal is to tell people if the failure is due to one of the reasons. the users can see which one applies to them. a worlkfow of what you should look isn't just possible because there are too many moving parts and people using respond better to symptons instead of path. my bootstrap failed, was it is because control-plane machines didn't join? that is more easy to link to and define. instead of, hey let's go on a ride of flow-chart. |
a520530
to
120544e
Compare
@abhinavdahiya: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
ping @jstuever for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some proofreading suggestions
docs/user/troubleshooting.md
Outdated
@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`. | |||
1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service` | |||
2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'` | |||
|
|||
The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap](./troubleshootingbootstap.md) document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link does not work.
as describe in [troubleshootingbootstrap]
-> as described in the [troubleshooting bootstrap]
@patrickdillon Thanks for the review, updated the PR! :) |
I was thinking more of a high-level flow-chart.... pre-installation, wait-for bootstrap, wait-for install... to help direct what the user should be doing to troubleshoot and concentrate on which errors might be applicable to the user. However, in hind-sight, this is likely beyond the scope of this particular story. |
/lgtm |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This adds a document that provides,
/cc @openshift/openshift-team-installer