Skip to content

Commit

Permalink
OCPBUGS-32105: Fix race to mark node Joined (openshift#823)
Browse files Browse the repository at this point in the history
* OCPBUGS-32105: Narrow race between bootstrap and controller

If a host is in the Installed state already (which can occur when the
assisted-installer-controller sets the progress to Done), don't try to
set the progress to Joined as it will not only never succeed, but also
take 30+ minutes of unlogged retries inside the client before an error
is returned.

This narrows the window in which this can occur, but if the bootstrap
assisted-installer reads the Host before the
assisted-installer-controller updates the status, this could still
occur.

Ensure any failed requests are retried by not adding the Node to the
readyMasters list until the Progress has been set to either Joined or
Done (the latter triggers a change of Status to Installed).

Improve debugging by not logging different request_ids for messages
corresponding to a single request.

* OCPBUGS-32105: Reduce retries of 4xx error codes

Since 4xx error codes indicate a problem on the client side, most of
them cannot be usefully retried at the HTTP transport level. e.g. if a
409 Conflict is returned in response to a PUT request, then we need to
fetch the resource again with a GET before creating a new PUT request.
Blocking for 30+ minutes in the original PUT call (without logging) is
not helpful; we want the transport to return immediately so we can try
again.

Retry on only those 4xx error codes where it is conceivable that trying
the same request again might work.
  • Loading branch information
zaneb authored May 2, 2024
1 parent 1c6860c commit f548d32
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 7 deletions.
12 changes: 7 additions & 5 deletions src/installer/installer.go
Original file line number Diff line number Diff line change
Expand Up @@ -777,16 +777,18 @@ func (i *installer) updateReadyMasters(nodes *v1.NodeList, readyMasters *[]strin
ctx := utils.GenerateRequestContext()
log := utils.RequestIDLogger(ctx, i.log)
log.Infof("Found a new ready master node %s with id %s", node.Name, node.Status.NodeInfo.SystemUUID)
*readyMasters = append(*readyMasters, node.Name)

host, ok := common.HostMatchByNameOrIPAddress(node, inventoryHostsMap, knownIpAddresses)
if ok && (host.Host.Status == nil || *host.Host.Status != models.HostStatusInstalled) {
if err := i.inventoryClient.UpdateHostInstallProgress(ctx, host.Host.InfraEnvID.String(), host.Host.ID.String(), models.HostStageJoined, ""); err != nil {
log.Errorf("Failed to update node installation status, %s", err)
continue
}
}
*readyMasters = append(*readyMasters, node.Name)
if !ok {
return fmt.Errorf("node %s is not in inventory hosts", node.Name)
}
ctx = utils.GenerateRequestContext()
if err := i.inventoryClient.UpdateHostInstallProgress(ctx, host.Host.InfraEnvID.String(), host.Host.ID.String(), models.HostStageJoined, ""); err != nil {
utils.RequestIDLogger(ctx, i.log).Errorf("Failed to update node installation status, %s", err)
}
}
}

Expand Down
8 changes: 6 additions & 2 deletions src/inventory_client/inventory_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,15 @@ func CreateInventoryClientWithDelay(clusterId string, inventoryURL string, pullS
rehttp.RetryAny(
rehttp.RetryAll(
rehttp.RetryMaxRetries(minRetries),
rehttp.RetryStatusInterval(400, 404),
rehttp.RetryStatuses(404, 423, 425),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
rehttp.RetryStatusInterval(405, 600),
rehttp.RetryStatuses(408, 429),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
rehttp.RetryStatusInterval(500, 600),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
Expand Down

0 comments on commit f548d32

Please sign in to comment.