-
Notifications
You must be signed in to change notification settings - Fork 61
Aktualizr refuses to update to requested image #906
Comments
I also don't know immediately how your device got in that state. The It is true that we don't retry the ostree pull immediately, but normally if a download fails, it will be re-attempted the next time that aktualizr fetches metadata and checks for updates on the server. You can also try manually using ostree tools to fetch the missing object yourself. That might help figure out where the problem is. If you just desperately want to get the device out of that state, does |
This looks like an issue that happened during the download process but that didn't cause any failure, and then deploy ended up failing instead (I was wondering if the cleanup process before deployment ended up removing the object or related, as I couldn't see how this would happen during ostree pull). Assuming that there was a failure after pull, how to restore back the state in which OTA+/aktualizr would allow another retry or moving to a newer update? While unclear how we got here, it looks like there might be a bug in the update state machine during the deployment phase, as aktualizr ended up getting stuck at this state. |
That is possible. We're going to try to take a look to see if we can reproduce or simulate the error somehow.
I misspoke before: if an update fails, aktualizr should report that to the server and it should allow trying again or trying to install something else. We will also try to reproduce that. We have an open task to allow canceling a pending update, but we don't support that yet. In the meantime, though, can you try pulling the missing object manually with ostree tools? And what happens after aktualizr reports the error to the server? Do you mind sharing the relevant part of the logs, preferably with |
I have confirmed that there is a bug. The installation error appears to be correctly reported to the server, but then any subsequent installation requests seem to fail before they are even attempted. I'm looking into it. |
Thanks, this is the exact behavior of the issue we found. |
@rsalveti Sorry for the long delay it's taken to address this properly. In the past couple weeks, we've looked into a few related issues to yours. We believe that we have fixed the issue with the server continually sending metadata that causes aktualizr to continually try to install a package that fails. The change is still in staging but will go live soon. Now the server should allow trying other packages that hopefully will succeed. As for the second issue, the failed installation itself, we've recently seen a couple similar instances of this that we traced back to an issue on the server in which OSTree objects could deleted under obscure circumstances. We have fixed that. However, the most obvious circumstance that would cause that is running multiple instances of garage-deploy or garage-push on the same objects simultaneously. Do you think that might have been the case for you? We've also fixed a number of issues on the client side, so garage-deploy, garage-push, and aktualizr should all be a bit more robust in these situations and should log things a bit better. We're still debating some more improvements on the client-server interaction when installations fail to make it even better. Do you still have the troubled device available? If so, it would be interesting to see if it can recover once the server-side fix goes into production. If not, is there anything else we can do or that we should consider before we close this ticket? Thanks again for your communication and help! |
I assume this is a fix to the Director. Is there a specific version we should try from here:
In our failure the object was there inside the TreeHub. In fact other devices in our test pool were able to fetch it without issue. So we may have hit something different from you. |
Yes, currently the latest version: advancedtelematic/director@9fb516c, which looks like
We actually saw that in some obscure cases as well, although the object was associated with other accounts. There is still some room for network connectivity problems while downloading. It's unlikely, but a retry might still be worthwhile. |
@doanac We recently have implemented automatic download retries to try to help resolve issues like this. Dealing with missing objects is still an ongoing topic of discussion, but is it fair to say we've addressed your concerns here? Can we close this issue, or is there something else you were looking for? |
As per: advancedtelematic/aktualizr#906 (comment) Signed-off-by: Andy Doan <[email protected]>
Closing due to lack of response and apparent fixes for the problem. |
I'm not quite sure how my device got to this state, but it looks like at some point our CI automation attempted to move my device to build "405". It seems this update failed and the device stayed on build "404". We now have "406" and I can tell director to update the device. Director accepts the request and it show up in api/v1/admin/devices//queue and the device-registry shows it "Outdated". However, if I run aktualizr with loglevel=0 the device seems to reject the update and the device-registry shows the device in the "Error" state. looking at some of the loglevel=0 output, it looks like this might be part of the issue:
So it feels like there are 2 issues here:
eaa9e...
does exist in the treehub. So there's some retry type logic that seems to be missing.The text was updated successfully, but these errors were encountered: