Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEDIA-3445: Reduce memory consumption for audio only conferences #401

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

RicardoMDomingues
Copy link
Member

@RicardoMDomingues RicardoMDomingues commented Feb 11, 2025

The goal of this change is save memory for audio only conferences. On C9 SFU we experienced a problem where sufs runs out of memory with low seats reservation. SMB allocates a lot of memory for each conference and when we have a lot of small conference (low number of participants) the endpoints we can have in a single sfu reduce drastically due to the quantity of memory each conference pre allocated.

This PR tries to save around 1MB allocation per each conference when we know they are audio only (it's always the case for C9) Although for C9 video is never necessary, we decided to make it configurable through the API when the conference is allocated so allow a better memory adjustment when we have mixed scenarios (like we have in Symphony where the VoiceDirect rooms are audio only and maybe the future integration with WhatsApp calls will be also audio only).

If an endpoint tries to allocated video streams, the allocation does not fail. Instead, the video is ignored and the offer does not contains any information about video, only one warning is logged saying that endpoint tried to allocate video in an audio only conference.

Notes

We had to add support to zero capacity on MpmcHashmap and MpmcQueue. Now, we can say we want it with zero capacity and no dynamic memory is allocated

Memory allocated by each EngineMixer before this PR

Screenshot 2025-02-13 at 10 11 23

Memory allocated by each EngineMixer after PR (for audio only). Diff: -6.6MB

Screenshot 2025-02-13 at 09 44 26

Memory allocated by each Transport before this PR

Screenshot 2025-02-13 at 10 53 37

Memory allocated by each Transport after PR (for audio only). Diff: -1MB

Screenshot 2025-02-13 at 12 08 53

const uint32_t _entriesPerCacheline;
const uint32_t _capacity;
const size_t _blockSize;
alignas(64) std::atomic<VersionedIndex> _readCursor;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the explicit line sepatator fields and forced the alignment on 64 bits. This is a bit more resilient as before we couldn't have sure if readCursor was not sharing the cache line with another atomic members in a different object. alignas(64) force it to be the first address on the cache line

// Special Cell state for when queue has zero size
// Null Entry is not written and it is only read it when queue is zero size. So it can live in same cache line of
// _readCursor without performance impact
const EntrySate _nullEntry = CellState::State::nullSlot;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the same cache line of _readCursor to save a little of memory. This value is unchangeable and it only read when _readCursor is also unchangeable (ie, when the capacity is zero)

Comment on lines 299 to 290
alignas(64) const uint32_t _capacity;
Entry* const _elements;
Copy link
Member Author

@RicardoMDomingues RicardoMDomingues Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that separating this 2 fields from _writeCursor cache line has a significant impact on the performance as they are read only fields (so they don't cause cache trashing), maybe they can suffer from cache trashing when they are being read but as they are const fileds, I think compiler has ways to optimize it (or even we could to store them on the stack memory on the beginning of each method that needs them).

Anyway, I forced 64 bytes alignment to keep the same idea of _cacheLineSeparator2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contention will be on readcursor and write cursor, those must be separated.
I assume it will invalidate the cacheline if it has been modified but I think that only is valid for atomic variables. If you read a variable that is not atomic like _capacity or _elements, you get whatever value is cached. CPU has not obligation to provide atomic access because you had an atomic in the cache line.
I think the cacheLineSeparator2 was intended to separate _writeCursor from atomics after the MpmcQueue. Otherwise that could cause contention with other unknown object containing atomics.

bridge/Mixer.h Outdated Show resolved Hide resolved
bridge/Mixer.h Outdated Show resolved Hide resolved
bridge/Mixer.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@olofkallander olofkallander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the comments more concise. (really short and to the point). They take space and make it harder to read the real code.

// but forcing a new cache line also ensures tha nothing is placed in this cache line after
// MpmcQueue boundary (which we can predict if can affect performance
// of _writeCursor)
alignas(64) const uint32_t _capacity;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought this through, this one should not be aligned. It is not read/written atomically and the value will be from whatever is in cache and it will not stall cpu pipeline to read it (which is ok as it is const).
It is important that writecursor is protected from atomic operations in the next cache line, but that should be resolved by all other objects doing atomic operations align their first atomic member variable. Preferably having the atomic one first to not waste memory.

Copy link
Member Author

@RicardoMDomingues RicardoMDomingues Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this alignas. Then _nullEntry, _capacity and _elements are in the same cache line of _writeCursor to now waste space.

the _writecursor is protected from other atomic variables outside of this object because the MpmcQueue as an alignment requirement of 64 and the size needs to be a multiple of 64, so the size of MpmcQueue is 128. Although there is no standardization on binary level, I think compilers don't have another option than forcing to be a multiple of aligment due to arrays and pointers iterators (Example, if we have a MpmcQueue[N], accessing the memory position should index * sizeof(MpmcQueue) a and that needs always be a 64 bytes aligned. Same thing for when you iterating a MpmcQueue*, then a padding needs to added and that ensures no other variables are added to the cache line outside of this object)

Comment on lines +294 to +299
const bool canDoVideo = hasVideoEnabled && !this->hasVideoDisabled();
const bool enableUplinkEstimation = canDoVideo;
const size_t expectedInboundStreamCount = canDoVideo ? 16 : 8;
const size_t expectedOutboundStreamCount = canDoVideo ? 256 : 16;
const size_t jobQueueSize = canDoVideo ? 4096 : 1024;

Copy link
Member Author

@RicardoMDomingues RicardoMDomingues Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better adjustment of transport memory allocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants