Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak on node versions below 18.17.0 #5851

Closed
matthewkeil opened this issue Aug 5, 2023 · 13 comments · Fixed by #5873
Closed

Memory Leak on node versions below 18.17.0 #5851

matthewkeil opened this issue Aug 5, 2023 · 13 comments · Fixed by #5873
Labels
meta-bug Issues that identify a bug and require a fix.

Comments

@matthewkeil
Copy link
Member

matthewkeil commented Aug 5, 2023

NOTE: There is a memory leak on Lodestar version 1.10 with node versions below 18.17.0.

Make sure to update node to a version >=18.17.0 or >=20.1.0. We recommend to install the latest release of node 20.


Describe the bug

Memory is growing unbounded on a number of groups.

Affected Groups

beta on version v1.10.0/mkeil/network-worker-new-space/d0f01f3 (recently rebased onto unstable, no leak prior)
feat1 on version v1.10.0/mkeil/network-worker-new-space/d0f01f3 (recently rebased onto unstable, no leak prior)
feat2 on version v1.9.2/tuyen/gossipsub_protons/893eff7
feat3 on version v1.10.0/bbadfed

Affected in nogroup

nogroup-ctvoss-2 on version v1.10.0/tuyen/v1.10.0_rc.1_no_ee/83f6ba7

Unaffected in nogroup

nogroup-ctvoss-0 on version v1.8.0/stable/a4b29cf
nogroup-ctvoss-1 on version v1.8.0/tuyen/PR_5559_5592/b2a3949
nogroup-ctvoss-3 on version v1.8.0/tuyen/PR_5574_5592/2101f95
nogroup-ctvoss-4 on version v1.8.0/a4b29cf
nogroup-hzax41-0 on version v1.8.0/tuyen/network_thread_profile/f245849

Unaffected Groups

unstable on version v1.10.0/a7466f5
stable on version v1.10.0/7546292
stress_test on v1.8.0/8464c6b
goerli on v1.10.0/7546292
lido-goerli on v1.10.0/7546292
sepolia on v1.10.0/7546292

Screenshot 2023-08-05 at 2 28 17 PM

Screenshot 2023-08-05 at 2 28 28 PM

Screenshot 2023-08-05 at 2 28 55 PM

Screenshot 2023-08-05 at 2 29 08 PM

Screenshot 2023-08-05 at 2 29 19 PM

Expected behavior

Memory should remain stable

Steps to reproduce

No response

Additional context

No response

Operating system

Linux

Lodestar version or commit hash

somewhere post 1.9.2 or maybe 1.10.0

@matthewkeil matthewkeil added the meta-bug Issues that identify a bug and require a fix. label Aug 5, 2023
@matthewkeil
Copy link
Member Author

matthewkeil commented Aug 6, 2023

rebased beta and feat1 branches onto stable and will see if memory leak re-emerges to try and narrow down the commit where it originates.

Current stable seems good so figured that was a good start for where to look after

@twoeths
Copy link
Contributor

twoeths commented Aug 7, 2023

v1.9.2/tuyen/gossipsub_protons/893eff7 => this branch has its own memory issue ChainSafe/js-libp2p-gossipsub#318 (comment)

Update: protons may not be the root cause, could be the common issue of lodestar

@matthewkeil
Copy link
Member Author

rebased beta and feat1 branches onto stable and will see if memory leak re-emerges to try and narrow down the commit where it originates.

Current stable seems good so figured that was a good start for where to look after

Turns out the leak was also on stable. Currently the stable and unstable groups are deployed using docker and do not show the leak. I deployed stable to feat2 and unstable to feat3 using bare metal and will see if the leak shows up. Will let run overnight and if it proves to be there (first hour of run looks like it is) I will use beta, feat1, feat1 and feat3 and any other instance we have and will deploy a different sha to each in a methodical manner to see if i can find which commit it shows up in first to narrow down the code that is causing it.

@matthewkeil
Copy link
Member Author

matthewkeil commented Aug 7, 2023

Leak is present on stable and unstable branches when running on bare metal but not when running in a container. Confirmed code is running the same version just on a different platform.

stable is version 7546292ef150c37a9a912b349d51f8ea59f686fb
unstable is version bbadfed24a5dceeb6a21b5bba08c2965dabb630d

@twoeths
Copy link
Contributor

twoeths commented Aug 8, 2023

All bare metal deployments go with NodeJS v18.16.1, docker deployments go with NodeJS 20, that's what we should use from Lodestar v1.10.0. I deployed feat3 with NodeJS v20.5.0 to confirm.

@matthewkeil
Copy link
Member Author

Deployed several commits from over the last couple weeks to beta and feat2. beta was on node 18 and feat2 on node 20. No leaks were present on v20 but the instance running code after native fetch implementation on node 18 was leaking. No other commits between the version leaking and version not leaking were suspect so narrowed it down to commit 3e65be7 . Looking through the node issues this one nodejs/node#46435 seemed a possibility and wanted to test for it.

As a note we are passing AbortSignal into fetch in beacon-node but not in validator but leak was present in both.

Built a branch mkeil/revert-native-fetch that came off of HEAD at yesterday and reverted the commit with native fetch swap. Deployed to beta group on node 18 and leak is gone.

Screenshot 2023-08-10 at 12 48 28 PM

Screenshot 2023-08-10 at 12 48 40 PM

@matthewkeil
Copy link
Member Author

The leak was fixed in nodejs/undici#2049 and released in 5.21.1
https://github.com/nodejs/undici/releases/tag/v5.21.1

That fix was applied to node 20.1.0
https://github.com/nodejs/node/releases/tag/v20.1.0

[ba9ec91f0e] - deps: update undici to 5.21.1 (Node.js GitHub Bot) nodejs/node#47488

I do not see that version of undici backported to node 18 which confirms the behavior we saw empirically.

@nflaig
Copy link
Member

nflaig commented Aug 10, 2023

Thanks @matthewkeil for the thorough testing. I am also pretty confident this is related to native fetch on older node versions after reviewing all the PRs we did in 1.10 and going through the nodejs issues. I also never saw this on my end on any server since I pretty early updated to node 20.

I do not see that version of undici backported to node 18 which confirms the behavior we saw empirically.

The fix has been backported to v18.17.0 as well

@matthewkeil
Copy link
Member Author

matthewkeil commented Aug 10, 2023

The fix has been backported to v18.17.0 as well

Nice catch @nflaig! I was looking at the synopsis on the releases page and didnt notice the tilde! 🤦‍♂️

On the upside still does confirm because we were running 18.16.1. So i think we are safe with 18.17.0. I will deploy that to beta with unstable to double check and we can make a docs note.

@nflaig
Copy link
Member

nflaig commented Aug 10, 2023

So i think we are safe with 18.17.0. I will deploy that to beta with unstable to double check and we can make a docs note.

Sounds good, maybe we just bump the version in package.json for next release

lodestar/package.json

Lines 4 to 6 in 34d8955

"engines": {
"node": ">=18.15.0"
},

@matthewkeil
Copy link
Member Author

So i think we are safe with 18.17.0. I will deploy that to beta with unstable to double check and we can make a docs note.

Sounds good, maybe we just bump the version in package.json for next release

lodestar/package.json

Lines 4 to 6 in 34d8955

"engines": {
"node": ">=18.15.0"
},

Even better. Test is deploying now! 🚀

@nflaig
Copy link
Member

nflaig commented Aug 12, 2023

Re-opening for better visibility, would suggest we keep this open for a few more weeks

@nflaig nflaig reopened this Aug 12, 2023
@nflaig nflaig changed the title Memory Leak in node versions below 18.17.0 Memory Leak with node versions below 18.17.0 Aug 12, 2023
@nflaig nflaig changed the title Memory Leak with node versions below 18.17.0 Memory Leak on node versions below 18.17.0 Aug 12, 2023
@nflaig
Copy link
Member

nflaig commented Oct 5, 2023

Closing as v1.11.1 strictly requires a node version >= 18.17.0

"node": ">=18.17.0 <19 || >=20.1.0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants