-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
registry: add heartbeat call for running builds #11846
Conversation
packer/registry_builder.go
Outdated
@@ -50,6 +55,31 @@ func (b *RegistryBuilder) Run(ctx context.Context, ui packersdk.Ui, hook packers | |||
log.Printf("[TRACE] failed to update HCP Packer registry status for %q: %s", b.Name, err) | |||
} | |||
|
|||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe similar logic will be needed in the RegistryPostProcessor which wraps all configured post processors and run after a successful build has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think copying this logic to the RegistryPostProcessor is necessary, since we only stop sending heartbeats on context cancellation, which happens independently from the build finishing.
However there is another issue due to the fact that the cloud-provider is set, which causes heartbeats to fail as they include the cloud-provider in the updates (this should not be the case). We need to fix this before we can consider merging this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: fixed in c31a436
311b88a
to
c31a436
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me, nice work Lucas! I think we should add some acceptance tests though in HPATS as we talked about, and probably wait for the next release on this one.
@@ -156,10 +156,10 @@ func (b *Bucket) UpdateBuildStatus(ctx context.Context, name string, status mode | |||
_, err = b.client.UpdateBuild(ctx, | |||
buildToUpdate.ID, | |||
buildToUpdate.RunUUID, | |||
buildToUpdate.CloudProvider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the builds were never set to running? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were, just once they passed a certain point in the build (after the build is done, when post-processors start to be worked on), the status would never be updated again as all requests would fail.
Also a new run of the build would also never succeed to change the status as it would get rejected since the updates payload would contain the cloud provider, which gets rejected on HCP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the issue here. A build, if I'm not mistaken, will get the call to update the status to running once a build begins in the registry builder. The cloud provider information is derived from an build image (at the end of a build) so cloud provider will always be empty here.
Unless a build was somehow given a provider without an image. Is this possible on the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the updating of build labels is being removed here. If the issue is that we can not update the cloud provider once set would it makes sense to consider the following options.
- Update the API to only error if the value of cloudprovider is being changed to some new value
- Update the UpdateBuild to check if the status of
buildToUpdate
is already set to DONE and if so bail as there is no need for a heartbeat. All builds are stored in memory and refreshed after a successful API call so that state in memory should contain all the information needed to determine if a heartbeat update needs to be sent. - Handle the error of calling UpdateBuild when the provider is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's probably not super clear here, but that's linked to heartbeats being emitted during the post-processing steps, since they use that same function to update the status to "RUNNING" again in order to bump "UpdatedAt", which in turn caused the cloud-provider
filed to be set at that time, and because updates to the field are rejected and we include it all the time, we can't send anymore updates to the build later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the API to only error if the value of cloudprovider is being changed to some new value.
The API already does that! Labels can be updated and new labels will be concatenated with what's already stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think checking for the Done status will fix this particular issue. If I got it right, the heartbeat that sends the cloud-provider happens before the build is set to done but after the build is populated with the metadata. So the heartbeat is the one updating those fields as Lucas said.
Is safe to only set the cloud provider inside markBuildComplete
, this way we make sure this is set just once when the build is actually done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sylviamoss you have nailed that issue, well said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Thanks for clarifying!
@sylviamoss pointed out that the persisting of the cloud provider to the in-memory build object is also happening here and not only when calling markBuildComplete. Which is why CloudProvider is not empty for the UpdateBuild call.
Looking at the bucket method (b *Bucket) UpdateBuildStatus
I agree with removing all of the fields except the required and status. So this change make sense.
I do wonder if we should just create a new method called (c *Client) UpdateBuildStatus
that just updates the status that is called in (b *Bucket) UpdateBuildStatus
and leave the call to (c *Client) UpdateBuild
inside of the (b *Bucket) markBuildComplete
method. To avoid confusion in the future. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For visibility, there's going to be some changes to this code:
1- We'll change the conditions for the heartbeats, replicate the code, or mutualise it so we continue broadcasting the heartbeats until the individual build is done. As of now with the current code, the heartbeats may continue after a build has terminated (either with a success or a failure), which is not ideal.
2- We'll do a slight refactor to the heartbeat logic so they only send the state of the build, and not change the value of the shared status.
3- We'll split status updates from the final build done step.
There are more things to work on, but that'll probably be when the HCP logic gets extracted to a separate plugin later on.
Agree with @JenGoldstrich that we can bake this one a little more before releasing it. |
03f1a12
to
7fb9615
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done. The changes look good to me. I left a comment about the log messaging. Debug logs are used by the user for troubleshooting and sometimes general logging. We should provide some context into what information is being sent up every 2 minutes - heartbeat seems a bit vague.
@@ -141,7 +146,7 @@ func (b *Bucket) CreateInitialBuildForIteration(ctx context.Context, componentTy | |||
// UpdateBuildStatus updates the status of a build entry on the HCP Packer registry with its current local status. | |||
func (b *Bucket) UpdateBuildStatus(ctx context.Context, name string, status models.HashicorpCloudPackerBuildStatus) error { | |||
if status == models.HashicorpCloudPackerBuildStatusDONE { | |||
return b.markBuildComplete(ctx, name) | |||
return fmt.Errorf("do not use UpdateBuildStatus for updating to DONE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way to bubble up the info to the caller.
When a build is running, we send periodic heartbeats to HCP Packer's API. This will be used on the service side to detect if a build is stalled on the core side because of a crash or any other malfunction that caused Packer not to send an update on the status of a build.
Prior to this commit, we'd send updates to both the labels and the cloud-provider whenever an update to the status of a build would be sent. This would cause a bug in which once a build reached the post-processing state, and its cloud-provider was set, the status could not be updated anymore, as the cloud-provider would be set and further updates are rejected by the platform. To avoid this problem, we only transmit the status when doing a status update, and no other fields along with it.
The markBuildComplete private function used to be called from a final status update to DONE, through the UpdateBuildStatus function. The problem being that the (now) CompleteBuild function not only updates the status of the build, but also all its metadata. For consistency, we remove the indirection, and explicitely call CompleteBuild when we want to finish a build.
3091024
to
0d80c19
Compare
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
When a build is running, we send periodic heartbeats to HCP Packer's
API.
This will be used on the service side to detect if a build is stalled on
the core side because of a crash or any other malfunction that caused
Packer not to send an update on the status of a build.