inconsistent handling of invalid user data #1346

bcressey · 2021-02-23T20:14:54Z

Image I'm using:
aws-k8s-1.18 v1.05

What I expected to happen:
I expected the same failure mode for "invalid TOML" as for "unrecognized settings" from early-boot-config - in both cases, the desired settings are not usable in the provided form, so the system should respond in the same way.

What actually happened:
Invalid TOML triggers a cascading series of service startup failures:

   11.842376] early-boot-config[2778]: Provider error: Error parsing TOML user data: expected a table key, found a right bracket at line 1 column 1
[FAILED] Failed to start Bottlerocket userdata configuration system.
See 'systemctl status early-boot-config.service' for details.
[DEPEND] Dependency failed for Bottlerocket â�¦nd dynamic configuration complete.
[DEPEND] Dependency failed for A versatile iâ�¦tion of the Network Time Protocol.
[FAILED] Failed to start containerd container runtime.
See 'systemctl status containerd.service' for details.
[DEPEND] Dependency failed for Kubelet.
[DEPEND] Dependency failed for Call signpostâ�¦ter all required targets are met..
[FAILED] Failed to start containerd runtime for host containers.
See 'systemctl status host-containerd.service' for details.
         Starting User-specified setting generators...
[   11.932268] sundog[2853]: Setting generator 'schnauzer settings.host-containers.control.source' failed with exit code 1 - stderr: Failed to render setting 'settings.host-containers.control.source' from template '{{ ecr-prefix settings.aws.region }}/bottlerocket-control:v0.4.1': Error rendering "Unnamed template" line 1, col 1: TemplateHelperError: Expected an AWS region, got 'null' in template dynamic template
[FAILED] Failed to start User-specified setting generators.
See 'systemctl status sundog.service' for details.
         Starting Applies settings to create config files...
[   12.001676] thar-be-settings[2914]: Restart command failed - '/usr/bin/host-containers': 18:52:08 [ERROR] Failed to handle host container 'admin': Host containers 'admin' missing field 'source'
[   12.004636] thar-be-settings[2914]: 18:52:08 [ERROR] Failed to handle host container 'control': Host containers 'control' missing field 'source'
[FAILED] Failed to start Applies settings to create config files.

An unrecognized setting triggers a more compact set of failures that allows the default host containers to start:

[  OK  ] Stopped containerd runtime for host containers.
[   34.815484] early-boot-config[2854]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 400 when PATCHing /settings?tx=bottlerocket-launch: Json deserialize error: unknown field `standalone-mode`, expected one of `cluster-name`, `cluster-certificate`, `api-server`, `node-labels`, `node-taints`, `max-pods`, `cluster-dns-ip`, `cluster-domain`, `node-ip`, `pod-infra-container-image` at line 1 column 162
[FAILED] Failed to start Bottlerocket userdata configuration system.
See 'systemctl status early-boot-config.service' for details.
[DEPEND] Dependency failed for Bottlerocket â�¦nd dynamic configuration complete.
         Starting containerd container runtime...
         Starting containerd runtime for host containers...
[  OK  ] Finished Applies settings to create config files.
[  OK  ] Started containerd runtime for host containers.
         Starting Host container: control...
[  OK  ] Started Host container: control.
[  OK  ] Started containerd container runtime.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

The system is easier to troubleshoot in this state since the admin container can be enabled to examine logs in more detail.

How to reproduce the problem:
Launch an instance with invalid TOML in the userdata.

The text was updated successfully, but these errors were encountered:

tjkirch · 2021-02-23T21:28:46Z

Should dependencies still try to start with default values? What if your user data was trying to set custom host containers, because you don't want to run the default AWS-based ones? I don't think it's safe to assume. To debug, you're going to be looking at journal output, which you can see on the console in any case, without having to access the node. In the error you pasted it's pretty clear what happened, I think; "unknown field foo" or "error parsing TOML" with some more details. If the real problem is that it's hard / unclear how to check console output, we can improve / document that...

bcressey · 2021-02-24T01:28:34Z

I've updated the title with what I consider the real issue, which is that the behavior is inconsistent. The failure mode should be the same.

bcressey · 2021-02-24T01:33:42Z

Should dependencies still try to start with default values? What if your user data was trying to set custom host containers, because you don't want to run the default AWS-based ones? I don't think it's safe to assume.

I agree. It's definitely surprising that the default host containers are used when custom host containers are specified along with an invalid setting. It's not a huge leap from "surprising" to a serious misconfiguration with security or operational implications.

The "invalid TOML" path is what we intend.

zmrow · 2021-03-01T22:47:13Z

I think there is a misunderstanding.

For AWS instances, early-boot-config fetches the IID (for region) and user-data from IMDS, and if fetching and parsing both is successful, sends them to the API.

@bcressey In this specific case, settings.aws.region doesn't get set because early-boot-config bombs out early because of invalid TOML. The system attempts to start the host containers but can't because the host container "source" template includes settings.aws.region. In this case, your system was hosed regardless. A few lines down (pardon the formatting):

...sundog[2853]...Failed to render setting 'settings.host-containers.control.source' 
from template '{{ ecr-prefix settings.aws.region}}'

AWS is special currently because it has multiple "sources" for data (even though they're both IMDS). Perhaps we should support PATCH-ing data to the API immediately after we fetch it successfully? In this case, your host containers would have started successfully, but the system still would have been crippled.

bcressey · 2021-03-02T03:11:32Z

I expect the same behavior from these two cases. I'm aware that the behavior is different, but that's not a misunderstanding, it's a bug.

arnaldo2792 · 2021-04-08T00:00:21Z

We fixed this in #1387

bcressey added type/bug Something isn't working priority/p1 labels Feb 23, 2021

bcressey changed the title ~~invalid user data prevents host containers from running~~ inconsistent handling of invalid user data Feb 24, 2021

gregdek assigned zmrow Feb 26, 2021

jhaynes unassigned zmrow Mar 31, 2021

gregdek added this to the techdebt milestone Apr 1, 2021

arnaldo2792 linked a pull request Apr 7, 2021 that will close this issue

Add support for bootstrap containers via API settings #1387

Merged

arnaldo2792 closed this as completed Apr 8, 2021

jhaynes modified the milestones: techdebt, 1.0.8 Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent handling of invalid user data #1346

inconsistent handling of invalid user data #1346

bcressey commented Feb 23, 2021

tjkirch commented Feb 23, 2021 •

edited

Loading

bcressey commented Feb 24, 2021

bcressey commented Feb 24, 2021

zmrow commented Mar 1, 2021

bcressey commented Mar 2, 2021

arnaldo2792 commented Apr 8, 2021

inconsistent handling of invalid user data #1346

inconsistent handling of invalid user data #1346

Comments

bcressey commented Feb 23, 2021

tjkirch commented Feb 23, 2021 • edited Loading

bcressey commented Feb 24, 2021

bcressey commented Feb 24, 2021

zmrow commented Mar 1, 2021

bcressey commented Mar 2, 2021

arnaldo2792 commented Apr 8, 2021

tjkirch commented Feb 23, 2021 •

edited

Loading