MEDIA-3445: Reduce memory consumption for audio only conferences #401

RicardoMDomingues · 2025-02-11T10:52:26Z

The goal of this change is save memory for audio only conferences. On C9 SFU we experienced a problem where sufs runs out of memory with low seats reservation. SMB allocates a lot of memory for each conference and when we have a lot of small conference (low number of participants) the endpoints we can have in a single sfu reduce drastically due to the quantity of memory each conference pre allocated.

This PR tries to save around 1MB allocation per each conference when we know they are audio only (it's always the case for C9) Although for C9 video is never necessary, we decided to make it configurable through the API when the conference is allocated so allow a better memory adjustment when we have mixed scenarios (like we have in Symphony where the VoiceDirect rooms are audio only and maybe the future integration with WhatsApp calls will be also audio only).

If an endpoint tries to allocated video streams, the allocation does not fail. Instead, the video is ignored and the offer does not contains any information about video, only one warning is logged saying that endpoint tried to allocate video in an audio only conference.

Notes

We had to add support to zero capacity on MpmcHashmap and MpmcQueue. Now, we can say we want it with zero capacity and no dynamic memory is allocated

Memory allocated by each EngineMixer before this PR

Memory allocated by each EngineMixer after PR (for audio only). Diff: -6.6MB

Memory allocated by each Transport before this PR

Memory allocated by each Transport after PR (for audio only). Diff: -1MB

RicardoMDomingues · 2025-02-11T11:10:02Z

concurrency/MpmcQueue.h

-    const uint32_t _entriesPerCacheline;
-    const uint32_t _capacity;
-    const size_t _blockSize;
+    alignas(64) std::atomic<VersionedIndex> _readCursor;


Replace the explicit line sepatator fields and forced the alignment on 64 bits. This is a bit more resilient as before we couldn't have sure if readCursor was not sharing the cache line with another atomic members in a different object. alignas(64) force it to be the first address on the cache line

RicardoMDomingues · 2025-02-11T11:11:51Z

concurrency/MpmcQueue.h

+    // Special Cell state for when queue has zero size
+    // Null Entry is not written and it is only read it when queue is zero size. So it can live in same cache line of
+    // _readCursor without performance impact
+    const EntrySate _nullEntry = CellState::State::nullSlot;


This is in the same cache line of _readCursor to save a little of memory. This value is unchangeable and it only read when _readCursor is also unchangeable (ie, when the capacity is zero)

RicardoMDomingues · 2025-02-11T11:16:12Z

concurrency/MpmcQueue.h

+    alignas(64) const uint32_t _capacity;
    Entry* const _elements;


I am not convinced that separating this 2 fields from _writeCursor cache line has a significant impact on the performance as they are read only fields (so they don't cause cache trashing), maybe they can suffer from cache trashing when they are being read but as they are const fileds, I think compiler has ways to optimize it (or even we could to store them on the stack memory on the beginning of each method that needs them).

Anyway, I forced 64 bytes alignment to keep the same idea of _cacheLineSeparator2

The contention will be on readcursor and write cursor, those must be separated.
I assume it will invalidate the cacheline if it has been modified but I think that only is valid for atomic variables. If you read a variable that is not atomic like _capacity or _elements, you get whatever value is cached. CPU has not obligation to provide atomic access because you had an atomic in the cache line.
I think the cacheLineSeparator2 was intended to separate _writeCursor from atomics after the MpmcQueue. Otherwise that could cause contention with other unknown object containing atomics.

concurrency/MpmcQueue.h

concurrency/MpmcHashmap.h

bridge/Mixer.h

bridge/Mixer.cpp

bridge/engine/EngineMixer.cpp

concurrency/MpmcQueue.h

olofkallander

I would make the comments more concise. (really short and to the point). They take space and make it harder to read the real code.

olofkallander · 2025-02-12T06:46:51Z

concurrency/MpmcQueue.h

+    // but forcing a new cache line also ensures tha nothing is placed in this cache line after
+    // MpmcQueue boundary (which we can predict if can affect performance
+    // of _writeCursor)
+    alignas(64) const uint32_t _capacity;


Having thought this through, this one should not be aligned. It is not read/written atomically and the value will be from whatever is in cache and it will not stall cpu pipeline to read it (which is ok as it is const).
It is important that writecursor is protected from atomic operations in the next cache line, but that should be resolved by all other objects doing atomic operations align their first atomic member variable. Preferably having the atomic one first to not waste memory.

I removed this alignas. Then _nullEntry, _capacity and _elements are in the same cache line of _writeCursor to now waste space.

the _writecursor is protected from other atomic variables outside of this object because the MpmcQueue as an alignment requirement of 64 and the size needs to be a multiple of 64, so the size of MpmcQueue is 128. Although there is no standardization on binary level, I think compilers don't have another option than forcing to be a multiple of aligment due to arrays and pointers iterators (Example, if we have a MpmcQueue[N], accessing the memory position should index * sizeof(MpmcQueue) a and that needs always be a 64 bytes aligned. Same thing for when you iterating a MpmcQueue*, then a padding needs to added and that ensures no other variables are added to the cache line outside of this object)

RicardoMDomingues · 2025-02-13T23:10:27Z

bridge/Mixer.cpp

+    const bool canDoVideo = hasVideoEnabled && !this->hasVideoDisabled();
+    const bool enableUplinkEstimation = canDoVideo;
+    const size_t expectedInboundStreamCount = canDoVideo ? 16 : 8;
+    const size_t expectedOutboundStreamCount = canDoVideo ? 256 : 16;
+    const size_t jobQueueSize = canDoVideo ? 4096 : 1024;
+


Better adjustment of transport memory allocation

RicardoMDomingues requested review from olofkallander and reddvl1980 February 11, 2025 10:52

RicardoMDomingues commented Feb 11, 2025

View reviewed changes