Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using debug jemalloc version with CMSSW_12_0_2 #35729

Closed
smorovic opened this issue Oct 19, 2021 · 10 comments
Closed

Using debug jemalloc version with CMSSW_12_0_2 #35729

smorovic opened this issue Oct 19, 2021 · 10 comments

Comments

@smorovic
Copy link
Contributor

Hello,

I would like to retool CMSSW to a debug jemalloc version to investigate memory corruption issues seen during 15-17 Oct in HLT (although crashes have, as of now, disappeared after Monday morning after appearing all wekend). Crashes are possibly related to condition data, since there have been coinciding with network issues, but we are not certain about it. Still we'd like to have tools ready to debug in case they appear again.

It was suggested by frontier experts, which helped with investigation since many crashes appeared in frontier client code, to turn on debugging in allocator.

I see that CMSSW 12_0_2 (slc7_amd64_gcc900) currently installs jemalloc v5.2.1, but also a debug version 4.5.0:
external+jemalloc+5.2.1-cms
external+jemalloc-debug+4.5.0-cms

Are debug builds suitable for inspecting memory corruption issues, for example is the thread caching disabled (MALLOC_CONF=tcache:false) as described here?:
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Find-a-memory-corruption-bug

Since debug version is older, can it be user in place of 5.2.1? If this is not the case, I would like to ask for a debug build of 5.2.1.

Regarding how to switch the allocator, is it sufficient to just do modifications in xml file config/toolbox/slc7_amd64_gcc900/tools/selected/jemalloc.xml + scram setup or is it more involved?

Best regards,
Srecko Morovic

@cmsbuild
Copy link
Contributor

A new Issue was created by @smorovic Srecko Morovic.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

Valgrind is the best tool we have for finding memory issues.

@smuzaffar
Copy link
Contributor

@smorovic , regard your question about using jemalloc-debug , yes updating config/toolbox/slc7_amd64_gcc900/tools/selected/jemalloc.xml ( just update the JEMALLOC_BASE to point to jemalloc-debug install location) and scram setup + cmsenv is enough to for cmsRun to start using jemalloc debug

@smuzaffar
Copy link
Contributor

by the way, cms-sw/cmsdist#7396 should fix the issue with different versions used for jemalloc and jemalloc-debug

@smorovic
Copy link
Contributor Author

Thank you @smuzaffar, after merging your PR< will this be reflected in the next release (e.g. 12_0_3)?

The idea was to be able to run full HLT, on around 200 nodes, tooled with debug jemalloc, since crashes are not easily reproducible, but spurious and were occurring in bursts every few hours (although the issue could still be present in all instances, depending on probability to crash) - in case it will happen again.

Valgrind is difficult to use in this way. I can try to run it manually for a single process, if it's not too slow to keep up with the rest of HLT.

@smuzaffar
Copy link
Contributor

it will first go in 12.1.X and then we will be backport it to 12.0.X ( will take couple days before it is available in 12.0.X IBs).

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#7399 has been merged in 12.0.X, next cmssw 12.0 release will include it

@smorovic
Copy link
Contributor Author

Thank you, @smuzaffar for sorting this out.
I will then close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants