Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft lock up issues in eks nodes with logs indicating ena issue #129

Closed
cshivashankar opened this issue May 3, 2020 · 7 comments
Closed

Comments

@cshivashankar
Copy link

Hi,

We are running eks clusters on version 1.14 , often we are experiencing node issues and node becomes unresponsive due to soft lock up issues , when logs were analyzed following information was found

kernel: ena 0000:00:08.0 eth3: Found a Tx that wasn't completed on time, qid 4, index 787.
kernel: watchdog: BUG: soft lockup - CPU#58 stuck for 23s! [kworker/u146:7:707658]
kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xfrm_user xfrm_algo br_netfilter bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_recent ipt_REJECT nf_reject_ipv4 xt_addrtype xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_conntrack nf_nat nf_conntrack overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel psmouse evdev aes_x86_64 crypto_simd glue_helper pcc_cpufreq button cryptd ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
kernel: CPU: 58 PID: 707658 Comm: kworker/u146:7 Tainted: G             L  4.14.165-133.209.amzn2.x86_64 #1
kernel: Hardware name: Amazon EC2 c5.18xlarge/, BIOS 1.0 10/16/2017
kernel: Workqueue: writeback wb_workfn (flush-259:0)
kernel: task: ffff8893fefa0000 task.stack: ffffc9002daec000
kernel: RIP: 0010:__list_del_entry_valid+0x28/0x90
kernel: RSP: 0018:ffffc9002daefcc0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
kernel: RAX: ffff88916d19b470 RBX: ffffc9002daefce8 RCX: dead000000000200
kernel: RDX: ffff88804b1f36c8 RSI: ffff888013237e08 RDI: ffff888013237e08
kernel: RBP: ffff88916d19b470 R08: ffff889488d1eb48 R09: 0000000180400037
kernel: R10: ffffc9002daefe10 R11: 0000000000000000 R12: ffff88916d608800
kernel: R13: ffff888013237e08 R14: ffffc9002daefd78 R15: ffff889488d1eb48
kernel: FS:  0000000000000000(0000) GS:ffff88a371380000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000010923314f000 CR3: 000000000200a002 CR4: 00000000007606e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: move_expired_inodes+0x6a/0x230
kernel: queue_io+0x61/0xf0
kernel: wb_writeback+0x258/0x300
kernel: ? wb_workfn+0xdf/0x370
kernel: ? __local_bh_enable_ip+0x6c/0x70
kernel: wb_workfn+0xdf/0x370
kernel: ? __switch_to_asm+0x41/0x70
kernel: ? __switch_to_asm+0x35/0x70
kernel: process_one_work+0x17b/0x380
kernel: worker_thread+0x2e/0x390
kernel: ? process_one_work+0x380/0x380
kernel: kthread+0x11a/0x130
kernel: ? kthread_create_on_node+0x70/0x70
kernel: ret_from_fork+0x35/0x40

It shows that Transaction wasnt completed in time by ENA on eth3 which could have triggered the issue. Due to this soft lockup issue , node doesn't recover . Eventhough its a EKS node issue , logs indicate there could be some issue with ENA.

Can you please confirm if there is any ENA issue related to this ? How can this be solved ?
Kindly let me know if any other information is required from my end.

@sameehj
Copy link

sameehj commented May 3, 2020

Hi @cshivashankar ,

In order to make this efficient, can you please contact me through my email address [email protected] so that we continue investigating the issue offline. I need you to provide me with more information.

Thanks,
Sameeh

@cshivashankar
Copy link
Author

Thanks @sameehj for the quick response.
I have sent you an email.

Regards,
Chetan

@sameehj
Copy link

sameehj commented May 14, 2020

@cshivashankar

I'm closing this issue for now since we have resolved it offline, please feel free to reopen if needed.

Thanks,
Sameeh

@sameehj sameehj closed this as completed May 14, 2020
@eeeschwartz
Copy link

@sameehj we're experiencing very similar log messages and behavior on EKS 1.16 running AmazonLinux. Are you able to provide details on the resolution you reached? TIA -Erik

@AWSNB
Copy link
Contributor

AWSNB commented May 28, 2020 via email

@cshivashankar
Copy link
Author

Hi @eeeschwartz I have raised issue for AMI at awslabs/amazon-eks-ami#454.
Can you please post your findings there as well so that its easier to collect data in case if its an issue with AMI/ Kernel

@eeeschwartz
Copy link

I'm cautiously optimistic that upgrading our CNI plugin to 1.6.1 has resolved the issue. We were seeing 1-2 nodes churn per hour. Since upgrading we have 10 hours w/o churn so it looks promising. Thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants