Soft lock up issues in eks nodes with logs indicating ena issue #129

cshivashankar · 2020-05-03T12:45:20Z

Hi,

We are running eks clusters on version 1.14 , often we are experiencing node issues and node becomes unresponsive due to soft lock up issues , when logs were analyzed following information was found

kernel: ena 0000:00:08.0 eth3: Found a Tx that wasn't completed on time, qid 4, index 787.
kernel: watchdog: BUG: soft lockup - CPU#58 stuck for 23s! [kworker/u146:7:707658]
kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xfrm_user xfrm_algo br_netfilter bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_recent ipt_REJECT nf_reject_ipv4 xt_addrtype xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_conntrack nf_nat nf_conntrack overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel psmouse evdev aes_x86_64 crypto_simd glue_helper pcc_cpufreq button cryptd ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
kernel: CPU: 58 PID: 707658 Comm: kworker/u146:7 Tainted: G             L  4.14.165-133.209.amzn2.x86_64 #1
kernel: Hardware name: Amazon EC2 c5.18xlarge/, BIOS 1.0 10/16/2017
kernel: Workqueue: writeback wb_workfn (flush-259:0)
kernel: task: ffff8893fefa0000 task.stack: ffffc9002daec000
kernel: RIP: 0010:__list_del_entry_valid+0x28/0x90
kernel: RSP: 0018:ffffc9002daefcc0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
kernel: RAX: ffff88916d19b470 RBX: ffffc9002daefce8 RCX: dead000000000200
kernel: RDX: ffff88804b1f36c8 RSI: ffff888013237e08 RDI: ffff888013237e08
kernel: RBP: ffff88916d19b470 R08: ffff889488d1eb48 R09: 0000000180400037
kernel: R10: ffffc9002daefe10 R11: 0000000000000000 R12: ffff88916d608800
kernel: R13: ffff888013237e08 R14: ffffc9002daefd78 R15: ffff889488d1eb48
kernel: FS:  0000000000000000(0000) GS:ffff88a371380000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000010923314f000 CR3: 000000000200a002 CR4: 00000000007606e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: move_expired_inodes+0x6a/0x230
kernel: queue_io+0x61/0xf0
kernel: wb_writeback+0x258/0x300
kernel: ? wb_workfn+0xdf/0x370
kernel: ? __local_bh_enable_ip+0x6c/0x70
kernel: wb_workfn+0xdf/0x370
kernel: ? __switch_to_asm+0x41/0x70
kernel: ? __switch_to_asm+0x35/0x70
kernel: process_one_work+0x17b/0x380
kernel: worker_thread+0x2e/0x390
kernel: ? process_one_work+0x380/0x380
kernel: kthread+0x11a/0x130
kernel: ? kthread_create_on_node+0x70/0x70
kernel: ret_from_fork+0x35/0x40

It shows that Transaction wasnt completed in time by ENA on eth3 which could have triggered the issue. Due to this soft lockup issue , node doesn't recover . Eventhough its a EKS node issue , logs indicate there could be some issue with ENA.

Can you please confirm if there is any ENA issue related to this ? How can this be solved ?
Kindly let me know if any other information is required from my end.

The text was updated successfully, but these errors were encountered:

sameehj · 2020-05-03T13:07:55Z

Hi @cshivashankar ,

In order to make this efficient, can you please contact me through my email address [email protected] so that we continue investigating the issue offline. I need you to provide me with more information.

Thanks,
Sameeh

cshivashankar · 2020-05-03T13:32:18Z

Thanks @sameehj for the quick response.
I have sent you an email.

Regards,
Chetan

sameehj · 2020-05-14T13:09:15Z

@cshivashankar

I'm closing this issue for now since we have resolved it offline, please feel free to reopen if needed.

Thanks,
Sameeh

eeeschwartz · 2020-05-28T00:00:13Z

@sameehj we're experiencing very similar log messages and behavior on EKS 1.16 running AmazonLinux. Are you able to provide details on the resolution you reached? TIA -Erik

AWSNB · 2020-05-28T00:34:52Z

Hi Erik Could mind sharing instance-id for trouble shooting on our side ? Feel free to send it to [email protected]<mailto:[email protected]>

…

-Nafea From: Erik Schwartz <[email protected]> Reply-To: amzn/amzn-drivers <[email protected]> Date: Wednesday, May 27, 2020 at 5:01 PM To: amzn/amzn-drivers <[email protected]> Cc: Subscribed <[email protected]> Subject: Re: [amzn/amzn-drivers] Soft lock up issues in eks nodes with logs indicating ena issue (#129) @sameehj<https://github.com/sameehj> we're experiencing very similar log messages and behavior on EKS 1.16 running AmazonLinux. Are you able to provide details on the resolution you reached? TIA -Erik — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#129 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCPVWFJEMT22GPIEL33RTWSR3ANCNFSM4MYDSTGA>.

cshivashankar · 2020-05-28T05:26:54Z

Hi @eeeschwartz I have raised issue for AMI at awslabs/amazon-eks-ami#454.
Can you please post your findings there as well so that its easier to collect data in case if its an issue with AMI/ Kernel

eeeschwartz · 2020-05-28T13:09:12Z

I'm cautiously optimistic that upgrading our CNI plugin to 1.6.1 has resolved the issue. We were seeing 1-2 nodes churn per hour. Since upgrading we have 10 hours w/o churn so it looks promising. Thanks for the help

sameehj closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft lock up issues in eks nodes with logs indicating ena issue #129

Soft lock up issues in eks nodes with logs indicating ena issue #129

cshivashankar commented May 3, 2020

sameehj commented May 3, 2020

cshivashankar commented May 3, 2020

sameehj commented May 14, 2020

eeeschwartz commented May 28, 2020

AWSNB commented May 28, 2020 via email

cshivashankar commented May 28, 2020

eeeschwartz commented May 28, 2020

Soft lock up issues in eks nodes with logs indicating ena issue #129

Soft lock up issues in eks nodes with logs indicating ena issue #129

Comments

cshivashankar commented May 3, 2020

sameehj commented May 3, 2020

cshivashankar commented May 3, 2020

sameehj commented May 14, 2020

eeeschwartz commented May 28, 2020

AWSNB commented May 28, 2020 via email

cshivashankar commented May 28, 2020

eeeschwartz commented May 28, 2020