Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

arsnyder16 · 2022-12-14T01:28:52Z

I am investigating some performance with my project and one thing that sticks out comparing platforms (win,linux,wasm) is that the wasm version seems to be slower generally where there is a fair amount of allocations.

From what i can tell emscripten uses dlmalloc which i believe is the same allocator as musl.
There is also a more compact allocator available emmalloc.

From what i can find poor allocator performance might be a know problem for musl, so i am curious about alternatives that i can try. One tricky part is the allocator must support sbrk. One promising one that i found is mimalloc. Which does seem to have some support for wasm.

Has mimalloc been explored at all? or how could i go about overriding the default malloc behavior to use use mimalloc

sbc100 · 2022-12-14T02:08:23Z

I think all you would need to do is compile it and link with -lmimalloc. emcc would end up putting that on the link line before libc or libmalloc and the symbols in your library would take precidence.

arsnyder16 · 2022-12-14T02:10:08Z

Thanks! I'll give that a shot

arsnyder16 · 2023-04-02T22:47:48Z

I was able to get this working but didn’t notice a big difference in my particular case unfortunately.

More interesting at the time, we were a few patch releases behind and once we upgraded we got a pretty significant improvement in performance I am not sure what the direct cause was but my suspicion was maybe #18186

With that said I think there could be some interesting investigation around performance and other allocators.

I am certainly ok with closing this issue for now

@sbc100 Is this something you care for me to hold open?

sbc100 · 2023-04-03T02:46:15Z

I'm pretty sure that using an alternate allocator works as expected in emscripten. I think we have have test for it: See test_core.py test_wrap_malloc .. in this test we not only test that we can override malloc but that we can even call the original malloc.

So I think this issue can be closed.

@Kingrd97, perhaps you have a different specific issue? Can you share your full link command?

kripken · 2023-06-14T18:18:09Z

This is probably worth looking into for the multithreading case, since dlmalloc doesn't have per-thread arenas, and as a result we can end up with a lot of lock contention in the case of many allocations on different threads.

Allocators with per-thread arenas like mimalloc can have much better performance even on native builds, and in wasm where atomics can be more expensive that might be even more noticeable. (edit: I didn't benchmark myself, but have heard reports of a 2x difference in microbenchmarks)

Adding an option for -sMALLOC=mimalloc might be worthwhile to simplify this for users, and also doing a full port could get things like mallinfo, tracing, and other stuff working with Emscripten.

junyuecao · 2023-08-02T01:14:06Z

@arsnyder16 I am investigating some performance issue and I see you've got mimalloc integrated. It'll help alot if you could share some of the snippets to show how to build mimalloc and replace dlmalloc?

arsnyder16 · 2023-08-02T16:00:11Z

@junyuecao Are you running into a particular issue? I don't recall hitting any roadblocks simply building mimalloc with emscripten and then linking my application with it

junyuecao · 2023-08-03T11:15:46Z

@arsnyder16 I linked with mimalloc successfully but it keeps crashing in mimalloc (memory out of bound error). BTW it's a multi-threaded web app.

arsnyder16 · 2023-08-03T15:15:59Z

hmm , Can you supply your link arguments?

junyuecao · 2023-08-09T09:39:06Z

@arsnyder16 just like this

add_library(libmimalloc STATIC IMPORTED)
set_target_properties(
		libmimalloc
		PROPERTIES IMPORTED_LOCATION
		/path/to/libmimalloc.a
		)

arsnyder16 · 2023-08-09T13:58:26Z

Can you supply the full link argument passed

For example something like:
-sINITIAL_MEMORY=100MB -sSTACK_SIZE=2MB -fexceptions -sWASM_BIGINT -sALLOW_MEMORY_GROWTH -sEXIT_RUNTIME -pthread -sPROXY_TO_PTHREAD -Os

Also what version of the sdk are you using?

Markus87 · 2023-10-11T15:04:41Z

My experience with using mimalloc (2.1.2, emsdk 3.1.25):
My application uses threads and locking for allocations kills the performance. It runs 100x slower than on Windows.
With mimalloc its only ~10 times slower. (or over 10 times faster than with the default allocator)
Sadly the workload does not complete yet because mimalloc runs out of memory quickly, not sure why yet.
From what mimalloc logs the reserved sizes for the threads do not seem crazy big.

So having mimalloc working for threaded applications seems to be a big win.

kripken · 2023-10-11T19:41:55Z

@Markus87 Thanks for the information!

I've recently been looking into another allocator option here. On a simple benchmark I see dlmalloc not scaling at all - each additional core gets slower - while other allocators improve. So, yes, dlmalloc being single-threaded can be a problem.

It will take some work to get a proper port of a new allocator, though. One issue, maybe related to the OOM issue you saw with mimalloc, is that its easiest for such a parallel malloc to not return memory to the system at all (that's what the wasi port in mimalloc does), but that's obviously not ideal. The problem is that using sbrk or memory.growth underneath a parallel allocator, instead of what the allocator is used to using - VirtualAlloc or mmap - doesn't really allow freeing zones. But I think we can fix that with a two-tiered malloc, basically to do something more like VirtualAlloc than sbrk in wasm.

I hope to have a PR up in the next few weeks.

Markus87 · 2023-10-12T07:53:50Z

@kripken That is good to know. I am looking forward to trying the new solution.

With this PR if emmalloc.c is built with -DEMMALLOC_NO_STD_EXPORTS then we do not define malloc, free, etc. That means we only provide emmalloc_malloc, emmalloc_free, etc., the prefixed versions. They can then be used alongside another malloc impl. This will be useful in a later PR that adds a two-tiered allocator: a fast multithreaded one, and underneath it emmalloc, which will function as the "system allocator" for it. That is, emmalloc will play the role of VirtualAlloc on windows or mmap on POSIX, a way for the main allocator to get system memory. (We can't just use sbrk for that purpose because we also want to free memory to the system.) For that goal, emmalloc seems suitable as it is compact (we don't need it to be super-fast; this is the system allocator that will be called rarely, compared to the fast one before it). And for emmalloc to be used like that we need this PR so that we can build emmalloc alongside another allocator (that other allocator will define malloc etc. itself). Helps #18369

kripken · 2023-11-08T00:45:51Z

A PR for mimalloc is now up: #20651 - testing and feedback would be welcome!

Markus87 · 2023-11-14T20:45:31Z

@kripken Thank you, this is amazing!
With your solution my usecase is only around 1.5-3x slower than on Windows.
The problem were it ran out of memory is gone as well, as expected.

kripken · 2023-11-15T20:23:27Z

Great, thanks for testing @Markus87 !

…20651) The new allocator can be used with -sMALLOC=mimalloc. On the benchmark added in this PR, dlmalloc does quite poorly here (getting actually slower with each additional core, because the lock contention is much larger than the actual work in the artificial benchmark). mimalloc, in comparison, scales the same as natively: more cores keeps helping. So mimalloc can be a significant speedup in codebases that have lock contention on malloc. mimalloc is significantly larger than dlmalloc, however, so we do not want it on by default. It also uses more memory, because of how mimalloc works and also due to #20645. Design-wise, this layers mimalloc on top of emmalloc. emmalloc functions as the "system allocator", which is more powerful than just using raw sbrk - sbrk can't free holes in the middle, for example. Code-wise, all of system/lib/mimalloc is unchanged from upstream (see README.emscripten) except for an ifdef or two, and then the new backend which is in system/lib/mimalloc/src/prim/emscripten/prim.c. That file has more comments explaining the design of the port. A new test is added which is also usable as a benchmark, test/other/test_mimalloc.cpp, which is where the numbers above come from. Fixes #18369

kripken changed the title ~~Alternate allocator~~ Option for a faster allocator than dlmalloc (perhaps mimalloc) Jun 14, 2023

kripken self-assigned this Oct 11, 2023

kripken mentioned this issue Oct 18, 2023

emmalloc: Add an option to not define the standard exports #20487

Merged

kripken mentioned this issue Nov 8, 2023

Add a port of mimalloc, a fast and scalable multithreaded allocator #20651

Merged

kripken closed this as completed in #20651 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

arsnyder16 commented Dec 14, 2022

sbc100 commented Dec 14, 2022

arsnyder16 commented Dec 14, 2022

arsnyder16 commented Apr 2, 2023

sbc100 commented Apr 3, 2023

kripken commented Jun 14, 2023 •

edited

Loading

junyuecao commented Aug 2, 2023

arsnyder16 commented Aug 2, 2023

junyuecao commented Aug 3, 2023

arsnyder16 commented Aug 3, 2023

junyuecao commented Aug 9, 2023

arsnyder16 commented Aug 9, 2023

Markus87 commented Oct 11, 2023

kripken commented Oct 11, 2023

Markus87 commented Oct 12, 2023

kripken commented Nov 8, 2023

Markus87 commented Nov 14, 2023

kripken commented Nov 15, 2023

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

Option for a faster allocator than dlmalloc (perhaps mimalloc) #18369

Comments

arsnyder16 commented Dec 14, 2022

sbc100 commented Dec 14, 2022

arsnyder16 commented Dec 14, 2022

arsnyder16 commented Apr 2, 2023

sbc100 commented Apr 3, 2023

kripken commented Jun 14, 2023 • edited Loading

junyuecao commented Aug 2, 2023

arsnyder16 commented Aug 2, 2023

junyuecao commented Aug 3, 2023

arsnyder16 commented Aug 3, 2023

junyuecao commented Aug 9, 2023

arsnyder16 commented Aug 9, 2023

Markus87 commented Oct 11, 2023

kripken commented Oct 11, 2023

Markus87 commented Oct 12, 2023

kripken commented Nov 8, 2023

Markus87 commented Nov 14, 2023

kripken commented Nov 15, 2023

kripken commented Jun 14, 2023 •

edited

Loading