Aktualizr refuses to update to requested image #906

doanac · 2018-08-10T21:16:39Z

I'm not quite sure how my device got to this state, but it looks like at some point our CI automation attempted to move my device to build "405". It seems this update failed and the device stayed on build "404". We now have "406" and I can tell director to update the device. Director accepts the request and it show up in api/v1/admin/devices//queue and the device-registry shows it "Outdated". However, if I run aktualizr with loglevel=0 the device seems to reject the update and the device-registry shows the device in the "Error" state. looking at some of the loglevel=0 output, it looks like this might be part of the issue:

put request body:
{
	"signatures" : 
	[
		
		{
			"keyid" : "df3da9f96a653389a24fadb54b03d664c934be07ce4c3737668692ec7b402775",
			"method" : "rsassa-pss",
			"sig" : "IsOL7uESULqjTDiUQxmdlyjcJLGrHSeFaldAerb4E8vUF/JvHWRgeXguThxXwgTTOd0LVRTejgbtq2lprFeqLbUYsRQtbhn31bVj9gNYdg63otM3EZqkdbXTp42H9d/JpJItkdvjT+1PZrjtek04W8uhuDGggEurWM1boCG59XW0wtbBHUNAKG22CkLDi31Y0o3MpT5Q5YAIdzpIsd2031hqax+MazMqjuRJPnpqtuK6+pfvud1kSu0AwzujDrkSmMQr3uuwPaV4hmAKJPbvvTxyrOC9d0zlK7HWVdYkqwJtpmn6NK/KeOIjksNutrU6D9IjElzBzUt39kzwR1Rvgg=="
		}
	],
	"signed" : 
	{
		"ecu_version_manifests" : 
		{
			"df3da9f96a653389a24fadb54b03d664c934be07ce4c3737668692ec7b402775" : 
			{
				"signatures" : 
				[
					
					{
						"keyid" : "df3da9f96a653389a24fadb54b03d664c934be07ce4c3737668692ec7b402775",
						"method" : "rsassa-pss",
						"sig" : "C8xSd1aD6K3vtB/bqsjjH6HoP/g2cDayygJ73R2DxXw0iOwZXa1gQpix9EW6jewW4lJrTnEq/6x5+afPShNvxcABSqQeTLjaNuTZZ/joy1vPi7Agqvydf8b/NSDDbaGClgXfs6fputk7pybzg0n1wtJoXFIppubxaFTE9iAjtUp5Dc4IrO2teiZLFuoQl3hHqNEqf02XS9ycLQ8aeViC2e4s18HOfjkXbr9PjyiLK+ILf7Xbwr6R79UvdGI/VL2ei+dqHpUvUKlpGMWc9yA8XmoajHzxPKRhI8atSRj8idmrb+Oe/lADJKMB4Hyf8vA6cv3l8U8bR+daCKexGbJOWg=="
					}
				],
				"signed" : 
				{
					"attacks_detected" : "",
					"custom" : 
					{
						"operation_result" : 
						{
							"id" : "intel-corei7-64-lmp-premerge-405",
							"result_code" : 4,
							"result_text" : "Checking out deployment tree: No such metadata object eaa9e12e0415b308d250a8b8701fdda69269f376e64cac07d9598c2ab4a03119.dirtree"
						}
					},
					"ecu_serial" : "df3da9f96a653389a24fadb54b03d664c934be07ce4c3737668692ec7b402775",
					"installed_image" : 
					{
						"fileinfo" : 
						{
							"hashes" : 
							{
								"sha256" : "98fcfdbea95853be519c19da549c012d3e75c271b4bb035f6d835c259c416817"
							},
							"length" : 0
						},
						"filepath" : "intel-corei7-64-lmp-premerge-404"
					},
					"previous_timeserver_time" : "1970-01-01T00:00:00Z",
					"timeserver_time" : "1970-01-01T00:00:00Z"
				}
			}
		},
		"primary_ecu_serial" : "df3da9f96a653389a24fadb54b03d664c934be07ce4c3737668692ec7b402775"
	}
}

So it feels like there are 2 issues here:

That object, eaa9e... does exist in the treehub. So there's some retry type logic that seems to be missing.
Its rejecting any attempts to put a new image on the device so I can't figure out how to get my device out of this state.

The text was updated successfully, but these errors were encountered:

pattivacek · 2018-08-14T07:18:36Z

I also don't know immediately how your device got in that state. The Checking out deployment tree: No such metadata object is straight from OSTree, probably from the call we make to ostree_sysroot_deploy_tree. It looks like we somehow are trying to install an OSTree object that didn't get fetched. Did you see any other messages in the log indicating an errors there? And if you use any of the ostree tools, can you check if the object is in fact on the device (ostree admin status, ostree log, etc.)?

It is true that we don't retry the ostree pull immediately, but normally if a download fails, it will be re-attempted the next time that aktualizr fetches metadata and checks for updates on the server. You can also try manually using ostree tools to fetch the missing object yourself. That might help figure out where the problem is.

If you just desperately want to get the device out of that state, does ostree admin status still show whatever the previously installed version was? Is rolling back to that an option?

rsalveti · 2018-08-14T14:25:30Z

It is true that we don't retry the ostree pull immediately, but normally if a download fails, it will be re-attempted the next time that aktualizr fetches metadata and checks for updates on the server. You can also try manually using ostree tools to fetch the missing object yourself. That might help figure out where the problem is.

This looks like an issue that happened during the download process but that didn't cause any failure, and then deploy ended up failing instead (I was wondering if the cleanup process before deployment ended up removing the object or related, as I couldn't see how this would happen during ostree pull).

Assuming that there was a failure after pull, how to restore back the state in which OTA+/aktualizr would allow another retry or moving to a newer update? While unclear how we got here, it looks like there might be a bug in the update state machine during the deployment phase, as aktualizr ended up getting stuck at this state.

pattivacek · 2018-08-17T15:16:36Z

This looks like an issue that happened during the download process but that didn't cause any failure, and then deploy ended up failing instead

That is possible. We're going to try to take a look to see if we can reproduce or simulate the error somehow.

how to restore back the state in which OTA+/aktualizr would allow another retry or moving to a newer update? While unclear how we got here, it looks like there might be a bug in the update state machine during the deployment phase, as aktualizr ended up getting stuck at this state.

I misspoke before: if an update fails, aktualizr should report that to the server and it should allow trying again or trying to install something else. We will also try to reproduce that. We have an open task to allow canceling a pending update, but we don't support that yet.

In the meantime, though, can you try pulling the missing object manually with ostree tools? And what happens after aktualizr reports the error to the server? Do you mind sharing the relevant part of the logs, preferably with --loglevel 0?

pattivacek · 2018-08-20T07:33:21Z

I have confirmed that there is a bug. The installation error appears to be correctly reported to the server, but then any subsequent installation requests seem to fail before they are even attempted. I'm looking into it.

rsalveti · 2018-08-20T22:34:21Z

I have confirmed that there is a bug. The installation error appears to be correctly reported to the server, but then any subsequent installation requests seem to fail before they are even attempted. I'm looking into it.

Thanks, this is the exact behavior of the issue we found.

pattivacek · 2018-09-06T10:08:23Z

@rsalveti Sorry for the long delay it's taken to address this properly. In the past couple weeks, we've looked into a few related issues to yours. We believe that we have fixed the issue with the server continually sending metadata that causes aktualizr to continually try to install a package that fails. The change is still in staging but will go live soon. Now the server should allow trying other packages that hopefully will succeed.

As for the second issue, the failed installation itself, we've recently seen a couple similar instances of this that we traced back to an issue on the server in which OSTree objects could deleted under obscure circumstances. We have fixed that. However, the most obvious circumstance that would cause that is running multiple instances of garage-deploy or garage-push on the same objects simultaneously. Do you think that might have been the case for you?

We've also fixed a number of issues on the client side, so garage-deploy, garage-push, and aktualizr should all be a bit more robust in these situations and should log things a bit better. We're still debating some more improvements on the client-server interaction when installations fail to make it even better.

Do you still have the troubled device available? If so, it would be interesting to see if it can recover once the server-side fix goes into production. If not, is there anything else we can do or that we should consider before we close this ticket?

Thanks again for your communication and help!

doanac · 2018-09-06T13:28:19Z

Now the server should allow trying other packages that hopefully will succeed.

I assume this is a fix to the Director. Is there a specific version we should try from here:
https://hub.docker.com/r/advancedtelematic/director/tags/

As for the second issue, the failed installation itself, we've recently seen a couple similar instances of this that we traced back to an issue on the server in which OSTree objects could deleted under obscure circumstances.

In our failure the object was there inside the TreeHub. In fact other devices in our test pool were able to fetch it without issue. So we may have hit something different from you.

pattivacek · 2018-09-06T14:25:38Z

I assume this is a fix to the Director. Is there a specific version we should try from here:
https://hub.docker.com/r/advancedtelematic/director/tags/

Yes, currently the latest version: advancedtelematic/director@9fb516c, which looks like 0.4.0-11-g9fb516c in dockerhub.

In our failure the object was there inside the TreeHub. In fact other devices in our test pool were able to fetch it without issue. So we may have hit something different from you.

We actually saw that in some obscure cases as well, although the object was associated with other accounts. There is still some room for network connectivity problems while downloading. It's unlikely, but a retry might still be worthwhile.

pattivacek · 2018-11-21T09:37:12Z

@doanac We recently have implemented automatic download retries to try to help resolve issues like this. Dealing with missing objects is still an ongoing topic of discussion, but is it fair to say we've addressed your concerns here? Can we close this issue, or is there something else you were looking for?

As per: advancedtelematic/aktualizr#906 (comment) Signed-off-by: Andy Doan <[email protected]>

pattivacek · 2019-05-31T09:24:40Z

Closing due to lack of response and apparent fixes for the problem.

doanac added a commit to foundriesio/ota-community-edition that referenced this issue Jan 24, 2019

[OSF noup]: Move to latest director

e49a81c

As per: advancedtelematic/aktualizr#906 (comment) Signed-off-by: Andy Doan <[email protected]>

pattivacek closed this as completed May 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aktualizr refuses to update to requested image #906

Aktualizr refuses to update to requested image #906

doanac commented Aug 10, 2018

pattivacek commented Aug 14, 2018

rsalveti commented Aug 14, 2018

pattivacek commented Aug 17, 2018

pattivacek commented Aug 20, 2018

rsalveti commented Aug 20, 2018

pattivacek commented Sep 6, 2018

doanac commented Sep 6, 2018

pattivacek commented Sep 6, 2018

pattivacek commented Nov 21, 2018

pattivacek commented May 31, 2019

Aktualizr refuses to update to requested image #906

Aktualizr refuses to update to requested image #906

Comments

doanac commented Aug 10, 2018

pattivacek commented Aug 14, 2018

rsalveti commented Aug 14, 2018

pattivacek commented Aug 17, 2018

pattivacek commented Aug 20, 2018

rsalveti commented Aug 20, 2018

pattivacek commented Sep 6, 2018

doanac commented Sep 6, 2018

pattivacek commented Sep 6, 2018

pattivacek commented Nov 21, 2018

pattivacek commented May 31, 2019