-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activation Rebalancing #9140
Activation Rebalancing #9140
Conversation
this is how the allowed entropy deviation will change given a base rate |
src/Orleans.Runtime/Configuration/Options/ActivationRebalancerOptions.cs
Show resolved
Hide resolved
src/Orleans.Runtime/Configuration/Options/ActivationRebalancerOptions.cs
Show resolved
Hide resolved
src/Orleans.Runtime/Configuration/Options/ActivationRebalancerOptions.cs
Show resolved
Hide resolved
src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerMonitor.cs
Outdated
Show resolved
Hide resolved
src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerWorker.cs
Outdated
Show resolved
Hide resolved
src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerWorker.cs
Outdated
Show resolved
Hide resolved
@ledjon-behluli wrote:
In what way was the configuration changed to accomplish this? |
src/Orleans.Runtime/Placement/Rebalancing/ActivationRebalancerWorker.cs
Outdated
Show resolved
Hide resolved
Although I think you've figured it out by now, sorry I didn't see the notifications |
0d9ccf3
to
a907ae1
Compare
LGTM! I will discuss with the team before merging |
src/Orleans.Runtime/Configuration/Options/ActivationRebalancerOptions.cs
Show resolved
Hide resolved
e46bbb6
to
52d735c
Compare
789f78a
to
f52b86f
Compare
f52b86f
to
defab04
Compare
This PR introduces Memory-aware Activation Rebalancing (MAR), it also fixes #9135
MAR is designed to ensure a balanced distribution of activations across silos by considering both the number of activations and their memory usage. The goal is to maintain an efficient and balanced system, even as the cluster undergoes dynamic changes. MAR employs the principle of maximum entropy, which involves constraints to balance the total number of activations. The process involves forming pairs of silos, calculating deviations, and iteratively adjusting the activations until equilibrium is achieved.
The full theory and simulations behind MAR can be found here
Here you can find an illustration of MAR working in an imbalanced 4-silo cluster. The same can be run also by navigating to "playground/ActivationRebalancing" and starting the cluster & frontend projects.
Implementation Details
Rebalancer
MAR is implement as grain called
ActivationRebalancerWorker
, and the algorithm runs there. It must be a single grain and it must be active all the time if users decided to activate it (the feature is opt-in). User's can request a report from it any time, resume or suspend its activity for some time (or indefinitely).Since the rebalancer is a grain that is hosted somewhere in the cluster, it must be kept alive during any circumstance. The rebalancer is under the watch of system targets
IActivationRebalancerMonitor
and will report to them periodically. If it fails to do so, one of them will contact the rebalancer. If its dead due to its silo crashing, it will be woken up and continue to report back, otherwise if its a network isolation issue, its a no-op from the monitor. When the silo that hosts the rebalancer is gracefully shut-down (rolling deployments and such), the system target that is in the same host as the rebalancer will instruct the runtime to migrate the rebalancer somewhere else. The system target itself dies with the silo. If migration is successful, the rebalancer will bring its current state over to the other silo, and will began right where it left off before its host silo began shutting down This essentially means that if its in between a (potentially) long rebalancing session, progress is preserved.Monitor
The monitor system targets are defined like below. One of them will start the rebalancer, and the rebalancer will report to all of them. If it fails to do so, one of them will wake it up.
The monitor servers also as a proxy for clients to:
The monitor sys target extends the
IActivationRebalancer
which servers as a a gateway to interface with the activation rebalancer itself.Reports
Users can query the latest reports from their local monitor, but they can lag behind if
ActivationRebalancerOptions.SessionCyclePeriod
is chosen to be less thanIActivationRebalancerMonitor.WorkerReportPeriod
(which by default it is). If users want the latest report they can doGetRebalancingReport(force: true)
which will contact the rebalancer grain but potentially incur a remote call.The report structure contains information about the rebalancer itself, and rebalancing statistics for the active silos in the cluster.
Information about each property is provided via XML doc comments.
Options
The rebalancer can be configured via the standard
Options
pattern in dotnet, below are the possible options to control the behavior of the rebalancer. See XML doc comments for further details.Registration
As mentioned above Activation Rebalancing is opt-in and is marked
Experimental("ORLEANSEXP002")
. Everything is pretty much standard with the difference that users can supply their own implementation ofIFailedRebalancingSessionBackoffProvider
which is used to determine how long to wait between successive rebalancing sessions, if an aprior session has failed. A session is considered "failed" if n-consecutive number of cycles yielded no significant improvement to the cluster's entropy.Tests
There are tests which cover the following:
This is a graph that I ran at the beginning, it shows convergence with silos having different initial activations and different memory usages while during cycles we add more activations. It might not be perfect, but this was during its initial phases.
Microsoft Reviewers: Open in CodeFlow