-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky network with CPU soft lockups after upgrade to 31.20200323.2.0 #665
Comments
@wcurry thanks for the very accurate report and for the bisection! I'd agree the console logging seems unrelated, while IOPS-throttling seems more interesting (but possibly not exactly the same thing). Here the most suspicious thing to me seems to be this:
Overall it looks like the kernel is having troubles keeping up with the load, and it seems to be something somehow specific to AWS or related to ENA. I don't think there is anything FCOS-specific at play. /cc @davdunc mmerkes |
I created this issue at amzn-drivers: amzn/amzn-drivers#147 Of note, the ena version is the same between those two kernels. Here are the notes I provided in that issue: GoodOS: Fedora Coreos 31.20200310.3.0 BadOS: Fedora Coreos 31.20200323.2.0 Good - FCOS 31.20200323.2.0 / 5.5.10-200.fc31.x86_64
Bad - FCOS 31.20200323.2.0 / 5.5.10-200.fc31.x86_64
|
To confirm, do you also see this issue in f32 versions? (And might be worth checking f33 as well). Edit: Ahh right I see you did find this because it was present in f32. Would you be able to test f33 as well? It's possible that it was fixed in the latest kernel there. If so, then it might be easier to just wait until f33 hits testing and stable. |
I happened to test 33.20201101.1.0 and saw the issue there. |
I scanned the logs for 5.5.9 and 5.5.10 and nothing obvious jumped out. |
I found the issue. We were enabling SMT on first-boot by running a service with the following command:
Later reboots took advantage of another unit that appended "mitigations=auto" to kargs. As of 31.20200323.2.0, this apparently stopped working. When adding only "--reboot" to our kargs unit and removing the above /sys/devices... unit, our etcd cluster would not survive simultaneous immediate reboots. I have added the following to each of the systemd service units (excluding the kargs unit) to delay their start until second-boot:
|
To clarify the last comment, our etcd/kube-system hosts didn't recover from a simultaneous reboot due to the use of bootkube and lack of pod-checkpointing. I'm closing this issue as we've got it working. |
@wcurry - I'm glad you were able to figure out how to get unblocked. Thanks for updating this issue. |
Describe the bug
Network is flaky after upgrade from 31.20200310.3.0 to 31.20200323.2.0.
Reproduction steps
Steps to reproduce the behavior:
Expected behavior
CPU should not lockup. Network should deliver all packets. Network interface should not reset.
Actual behavior
System details
Ignition config
Additional information
While perfoming an upgrade from 31.20200310.3.0 to the latest FCOS 32 I tracked this issue back to 31.20200323.2.0. 31.20200310.3.0 (the next oldest AMI available) does not exhibit the issue.
I found this issue (amzn/amzn-drivers#84) that suggested console logging could be to blame. We had selinux in permissive mode and it was spamming. I never observed the "too much work for irq..." error. I disabled selinux anyway to clean up dmesg and the problem persisted in a new cluster.
I found this issue (awslabs/amazon-eks-ami#454) that suggests IOPS may be to blame. We have an NVME disk and had 3 GP2 volumes attached. None of the volumes had used their burst budget, but the root volume had come close. I upped all these GP2 volumes to io1 with 3000 IOPS. The problem still exists with these settings.
dmesg errors:
etcd errors:
etcd "timed out" warnings:
The text was updated successfully, but these errors were encountered: