Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
OCPBUGS-32105: Fix race to mark node Joined (openshift#823)
* OCPBUGS-32105: Narrow race between bootstrap and controller If a host is in the Installed state already (which can occur when the assisted-installer-controller sets the progress to Done), don't try to set the progress to Joined as it will not only never succeed, but also take 30+ minutes of unlogged retries inside the client before an error is returned. This narrows the window in which this can occur, but if the bootstrap assisted-installer reads the Host before the assisted-installer-controller updates the status, this could still occur. Ensure any failed requests are retried by not adding the Node to the readyMasters list until the Progress has been set to either Joined or Done (the latter triggers a change of Status to Installed). Improve debugging by not logging different request_ids for messages corresponding to a single request. * OCPBUGS-32105: Reduce retries of 4xx error codes Since 4xx error codes indicate a problem on the client side, most of them cannot be usefully retried at the HTTP transport level. e.g. if a 409 Conflict is returned in response to a PUT request, then we need to fetch the resource again with a GET before creating a new PUT request. Blocking for 30+ minutes in the original PUT call (without logging) is not helpful; we want the transport to return immediately so we can try again. Retry on only those 4xx error codes where it is conceivable that trying the same request again might work.
- Loading branch information