Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazon-k8s-cni crashes on startup (unless undocumented env var set) #1725

Closed
johngmyers opened this issue Nov 5, 2021 · 6 comments
Closed
Labels

Comments

@johngmyers
Copy link

What happened:

amazon-k8s-cni container dereferenced nil pointer upon startup

Attach logs

==== START logs for container aws-node of pod kube-system/aws-node-657r6 ====
{"level":"info","ts":"2021-11-05T01:24:37.847Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-11-05T01:24:37.848Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2021-11-05T01:24:37.880Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-11-05T01:24:37.885Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x39 pc=0x563711dac568]

goroutine 291 [running]:
github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd.(*IPAMContext).StartNodeIPPoolManager(0x0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go:633 +0x28
created by main._main
	/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:64 +0x32c
==== END logs for container aws-node of pod kube-system/aws-node-657r6 ====

What you expected to happen:

Container to not panic. Bonus points for running successfully, or at least including the reason for the failure in the container logs.

How to reproduce it (as minimally and precisely as possible):

  • Build containers off of 169d28d
  • Modify existing aws-node Daemonset manifest to use newly built containers in its image fields.

Anything else we need to know?:

As of the writing of this bug, the above commit is the most recent on the release-1.10 branch.

The panic indicates that StartNodeIPPoolManager was called on a nil IPAMContext. This would have to result from the code:

if !c.isConfigValid() {
return nil, err
}

As the body of the if statement does not assign to err, it is always nil at that point.

Further analysis shows the reason that isConfigValid() is returning false is this:

} else if !c.enableIPv4 && !c.enableIPv6 {
log.Errorf("IPv4 and IPv6 are both disabled. One of them have to be enabled")
return false
}

Strangely, isIPv4Enabled() defaults to false. I would have not made such an incompatible change, but it's your project.

I note that despite the requirement that having either ENABLE_IPv4 or ENABLE_IPv6 be set in the environment, neither variable is documented in the "CNI Configuration Variables" section of README.md.

I also suggest that having isConfigValid() return an error instead of logging to an obscure on-host-filesystem file and merely returning bool would make troubleshooting invalid configuration problems easier.

Environment:

@jayanthvn
Copy link
Contributor

This was recently merged to master - #1698 which will prevent the exception.

Yes readme is pending update which will be done prior to v1.10.0 release.

/cc @achevuru

@achevuru
Copy link
Contributor

achevuru commented Nov 5, 2021

@johngmyers These changes are not yet released. There will be corresponding CNI manifest and readme updates when we release IPv6 support. You are trying changes on top of master with an old CNI manifest.

New Manifest on top of master includes these new env variables - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/master/aws-k8s-cni.yaml#L193.

@johngmyers
Copy link
Author

@achevuru I am aware the changes are not yet released; that is apparent from the lack of tags on the branch. Nonetheless, panicing on lack of required settings is probably something you'd want to backport the fix for prior to release.

@johngmyers
Copy link
Author

I note that returning the actual error instead of a generic "failed to validate configuration" message would be more helpful to diagnosis.

@achevuru
Copy link
Contributor

achevuru commented Nov 5, 2021

@johngmyers Understand what you're trying to say but the problem only arises when someone attempts to use the latest image on an old incompatible CNI manifest (i.e.,) updating the image tag on a v1.9 or v1.8 CNI manifest. If an user wants to upgrade to v1.10 - they would need to apply the v1.10 manifest which will have the required default values required for the new image. Same will be true for few of the older versions as well (trying to use 1.5 manifest for 1.6 etc). We'll consider taking it in.

Actual error log will be printed in the config validation function. For ex here and here. Above linked PR added a generic error msg in the caller.

@jayanthvn
Copy link
Contributor

#1698 will be part of v1.10.1 and that should prevent this issue. As @achevuru mentioned the env variables will be included in the v1.10.0 default manifest upon release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants