Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

Closed
arsnyder16 opened this issue Dec 14, 2022 · 17 comments · Fixed by #20651
Closed

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

arsnyder16 opened this issue Dec 14, 2022 · 17 comments · Fixed by #20651
Assignees

Comments

@arsnyder16
Copy link
Contributor

I am investigating some performance with my project and one thing that sticks out comparing platforms (win,linux,wasm) is that the wasm version seems to be slower generally where there is a fair amount of allocations.

From what i can tell emscripten uses dlmalloc which i believe is the same allocator as musl.
There is also a more compact allocator available emmalloc.

From what i can find poor allocator performance might be a know problem for musl, so i am curious about alternatives that i can try. One tricky part is the allocator must support sbrk. One promising one that i found is mimalloc. Which does seem to have some support for wasm.

Has mimalloc been explored at all? or how could i go about overriding the default malloc behavior to use use mimalloc

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2022

I think all you would need to do is compile it and link with -lmimalloc. emcc would end up putting that on the link line before libc or libmalloc and the symbols in your library would take precidence.

@arsnyder16
Copy link
Contributor Author

Thanks! I'll give that a shot

@arsnyder16
Copy link
Contributor Author

I was able to get this working but didn’t notice a big difference in my particular case unfortunately.

More interesting at the time, we were a few patch releases behind and once we upgraded we got a pretty significant improvement in performance I am not sure what the direct cause was but my suspicion was maybe #18186

With that said I think there could be some interesting investigation around performance and other allocators.

I am certainly ok with closing this issue for now

@sbc100 Is this something you care for me to hold open?

@sbc100
Copy link
Collaborator

sbc100 commented Apr 3, 2023

I'm pretty sure that using an alternate allocator works as expected in emscripten. I think we have have test for it: See test_core.py test_wrap_malloc .. in this test we not only test that we can override malloc but that we can even call the original malloc.

So I think this issue can be closed.

@Kingrd97, perhaps you have a different specific issue? Can you share your full link command?

@kripken
Copy link
Member

kripken commented Jun 14, 2023

This is probably worth looking into for the multithreading case, since dlmalloc doesn't have per-thread arenas, and as a result we can end up with a lot of lock contention in the case of many allocations on different threads.

Allocators with per-thread arenas like mimalloc can have much better performance even on native builds, and in wasm where atomics can be more expensive that might be even more noticeable. (edit: I didn't benchmark myself, but have heard reports of a 2x difference in microbenchmarks)

Adding an option for -sMALLOC=mimalloc might be worthwhile to simplify this for users, and also doing a full port could get things like mallinfo, tracing, and other stuff working with Emscripten.

@kripken kripken changed the title Alternate allocator Option for a faster allocator than dlmalloc (perhaps mimalloc) Jun 14, 2023
@junyuecao
Copy link
Contributor

@arsnyder16 I am investigating some performance issue and I see you've got mimalloc integrated. It'll help alot if you could share some of the snippets to show how to build mimalloc and replace dlmalloc?

@arsnyder16
Copy link
Contributor Author

@junyuecao Are you running into a particular issue? I don't recall hitting any roadblocks simply building mimalloc with emscripten and then linking my application with it

@junyuecao
Copy link
Contributor

@arsnyder16 I linked with mimalloc successfully but it keeps crashing in mimalloc (memory out of bound error). BTW it's a multi-threaded web app.

@arsnyder16
Copy link
Contributor Author

hmm , Can you supply your link arguments?

@junyuecao
Copy link
Contributor

@arsnyder16 just like this

add_library(libmimalloc STATIC IMPORTED)
set_target_properties(
		libmimalloc
		PROPERTIES IMPORTED_LOCATION
		/path/to/libmimalloc.a
		)

@arsnyder16
Copy link
Contributor Author

Can you supply the full link argument passed

For example something like:
-sINITIAL_MEMORY=100MB -sSTACK_SIZE=2MB -fexceptions -sWASM_BIGINT -sALLOW_MEMORY_GROWTH -sEXIT_RUNTIME -pthread -sPROXY_TO_PTHREAD -Os

Also what version of the sdk are you using?

@Markus87
Copy link
Contributor

My experience with using mimalloc (2.1.2, emsdk 3.1.25):
My application uses threads and locking for allocations kills the performance. It runs 100x slower than on Windows.
With mimalloc its only ~10 times slower. (or over 10 times faster than with the default allocator)
Sadly the workload does not complete yet because mimalloc runs out of memory quickly, not sure why yet.
From what mimalloc logs the reserved sizes for the threads do not seem crazy big.

So having mimalloc working for threaded applications seems to be a big win.

@kripken kripken self-assigned this Oct 11, 2023
@kripken
Copy link
Member

kripken commented Oct 11, 2023

@Markus87 Thanks for the information!

I've recently been looking into another allocator option here. On a simple benchmark I see dlmalloc not scaling at all - each additional core gets slower - while other allocators improve. So, yes, dlmalloc being single-threaded can be a problem.

It will take some work to get a proper port of a new allocator, though. One issue, maybe related to the OOM issue you saw with mimalloc, is that its easiest for such a parallel malloc to not return memory to the system at all (that's what the wasi port in mimalloc does), but that's obviously not ideal. The problem is that using sbrk or memory.growth underneath a parallel allocator, instead of what the allocator is used to using - VirtualAlloc or mmap - doesn't really allow freeing zones. But I think we can fix that with a two-tiered malloc, basically to do something more like VirtualAlloc than sbrk in wasm.

I hope to have a PR up in the next few weeks.

@Markus87
Copy link
Contributor

@kripken That is good to know. I am looking forward to trying the new solution.

kripken added a commit that referenced this issue Oct 19, 2023
With this PR if emmalloc.c is built with -DEMMALLOC_NO_STD_EXPORTS then we do
not define malloc, free, etc. That means we only provide emmalloc_malloc,
emmalloc_free, etc., the prefixed versions. They can then be used alongside another
malloc impl.

This will be useful in a later PR that adds a two-tiered allocator: a fast multithreaded
one, and underneath it emmalloc, which will function as the "system allocator" for it.
That is, emmalloc will play the role of VirtualAlloc on windows or mmap on POSIX,
a way for the main allocator to get system memory. (We can't just use sbrk for that
purpose because we also want to free memory to the system.) For that goal,
emmalloc seems suitable as it is compact (we don't need it to be super-fast; this is
the system allocator that will be called rarely, compared to the fast one before it).
And for emmalloc to be used like that we need this PR so that we can build
emmalloc alongside another allocator (that other allocator will define malloc etc.
itself).

Helps #18369
@kripken
Copy link
Member

kripken commented Nov 8, 2023

A PR for mimalloc is now up: #20651 - testing and feedback would be welcome!

@Markus87
Copy link
Contributor

@kripken Thank you, this is amazing!
With your solution my usecase is only around 1.5-3x slower than on Windows.
The problem were it ran out of memory is gone as well, as expected.

@kripken
Copy link
Member

kripken commented Nov 15, 2023

Great, thanks for testing @Markus87 !

kripken added a commit that referenced this issue Nov 16, 2023
…20651)

The new allocator can be used with -sMALLOC=mimalloc.

On the benchmark added in this PR, dlmalloc does quite poorly here (getting
actually slower with each additional core, because the lock contention is much
larger than the actual work in the artificial benchmark). mimalloc, in
comparison, scales the same as natively: more cores keeps helping. So mimalloc
can be a significant speedup in codebases that have lock contention on malloc.

mimalloc is significantly larger than dlmalloc, however, so we do not want it
on by default. It also uses more memory, because of how mimalloc works and also
due to #20645.

Design-wise, this layers mimalloc on top of emmalloc. emmalloc functions as the
"system allocator", which is more powerful than just using raw sbrk - sbrk can't
free holes in the middle, for example.

Code-wise, all of system/lib/mimalloc is unchanged from upstream (see
README.emscripten) except for an ifdef or two, and then the new backend which
is in system/lib/mimalloc/src/prim/emscripten/prim.c. That file has more
comments explaining the design of the port.

A new test is added which is also usable as a benchmark,
test/other/test_mimalloc.cpp, which is where the numbers above come from.

Fixes #18369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants