[Proposal] Grain migration #7692

ReubenBond · 2022-04-16T22:20:18Z

This is a proposal to implement live workload migration in Orleans. Live grain migration gives grains the capability to move from one host to another without dropping requests or losing in-memory state. The proposal is for a live migration mechanism. As for the policy, which determines when, where to, and which grains to migrate, that is a separate concern. There may be multiple policies which perform migration:

During graceful shutdown/upgrade
For load leveling
Programmatically, from application or library code
As a part of self-optimization

Motivation

Capacity utilization improvements for rolling upgrades & scale-out

When a silo is added to a cluster, active grains are not moved to the new silo. When Orleans was originally released, it was common to use blue/green deployments (on Azure Cloud Services Worker Roles), so the cluster was much less dynamic than it is today where
rolling deployments are fairly commonplace. Depending on the workload (number of grains, calling patterns), rolling deployments can result in grains distribution becoming heavily skewed towards the silos which were upgraded first. Therefore, some hosts may become heavily loaded while others are under-utilized and adding additional resources will not promptly reduce utilization of the existing resources. There are two problems here:

Rolling upgrades can result in a load skew
Scale-out does not promptly reduce the load on existing hosts

If we have a mechanism for live migration of grains, we could move a portion of the load on heavily loaded servers to under-utilized servers.

Why deactivation is not enough

Deactivating some portion of grains on the heavily loaded servers is not enough to implement load leveling. We will explore why that is in this section.

Grains are responsible for registering themselves in the grain directory as a part of the activation process. When a grain deactivates, it is also responsible for removing its own registration.

Thus, if a grain calls DeactivateOnIdle() today, there will be a window of time where the grain is not registered anywhere, and any silo is free to try to place it somewhere. In most cases, silos will choose different targets and there will be a race for new activations on each silo to register themselves in the grain directory. Only one activation can win this race and the others will need to forward pending messages to the winning activation and update their caches.

However, silos cache directory lookups and those caches aren't immediately invalidated when a grain is deactivated. We no longer require that messages carry an ActivationId which must match an existing activation or set the NewActivation flag on the message (which has also been removed). Hence, messages sent to the original silo will reactivate the grain without causing a message rejection and cache invalidation. This means that once a frequently messaged grain is activated on a silo, voluntarily deactivating the grain will not likely cause that grain to be reactivated on a different silo because caches will point to the original silo, which will happily reactivate the grain upon receiving a message. The rationale for the changes which lead to this behavior are (I believe) sound: it reduces the flurry of rerouting and multiple directory lookups caused by activating a new grain and it simplifies the process, however it has this arguable downside of making grain placement stickier than desirable in all circumstances.

Soft-state preservation during upgrades

Some customers have told us that they would like to preserve running grain state during migrations.
This state tends to be soft state, state which can be reconstructed (perhaps loaded from a database) at a cost.
If a live grain migration mechanism includes a facility to preserve soft state across a migration, then that could be utilized by these customers to preserve soft state.
Additionally, it can be utilized by grain state providers to avoid needing to reload state from the database during the migration, lessening database load.

Self-optimizing placement

Previous research (referred to as ActOp) into optimizing actor-based systems (in particular, Orleans) indicates that substantial benefits can be obtained by co-locating grains which frequently communicate with each other. The ActOp research involved implementing an efficient distributed graph-partitioning algorithm to incrementally optimize inter-host RPC calls by collocating the groups of grains which communicate with each other the most without violating load-levelling constraints. In the Halo Presence scenario which the paper targeted, experiments showed a decrease from around 90% of calls being remote (using random placement) to around 12% of calls being remote. As a result, 99th percentile latency decreased by more than 3x and throughput increased by 2x. The improvements would vary substantially based upon cluster size and the particulars of the workload, but they demonstrate the potential gains.

Summary of the functionality

Provides a mechanism to migrate running grains from one host to another
Components of the grain can preserve their in-memory state using a provided mechanism. i.e, the runtime will not automatically preserve in-memory state without a component's explicit cooperation.
Migration preserves any pending requests and messages sent to the grain's previous address are forwarded to the grain's new address
- Consider what happens when a message is forwarded to a new location which then crashes: that should be made to result in a prompt SiloUnavailableException on the call chain.
  This issue exists today with activation races and with clients calling gateways, but currently they are left to timeout.
Migration can be triggered programmatically from application or library code running in the context of the grain
- This could allow specifying a migration target so that developers can implement their own placement optimizations
Migration can be triggered for a grain externally, from code which is not running in the context of the grain
- This is how load leveling and self-optimization can be implemented.
Grain activations can indicate their migration cost when queried
- This is useful for more advanced self-optimization
Grain types can opt out of automated migration
- Some grain types, such as system targets, have fixed placement and are not eligible to be moved
Migration atomically updates the grain's address in the grain directory
- This could be implemented using the state preservation mechanism.
Concurrently executing requests are allowed to finish before migration is initiated
Grains are deactivated using the regular grain deactivation process (OnDeactivateAsync)

Design

Lifecycle of the grain will be modified with two new hooks, for initiating and completing migration. Any process can crash at any point. That fact is omitted below for simplicity.

stateDiagram-v2
    direction LR
    [*] --> Constructor
    Constructor --> OnActivateAsync
    Constructor --> OnRehydrate
    OnActivateAsync --> Running
    OnRehydrate --> OnActivateAsync 
    OnDeactivateAsync --> OnDehydrate
    Running --> OnDeactivateAsync 
    OnDeactivateAsync --> DisposeAsync
    OnDehydrate --> DisposeAsync
    DisposeAsync --> [*]

The grain class and components injected into the grain can enlist in a per-activation migration lifecycle hook:

// Created per-activation, on-demand
public interface IGrainLifecycle
{
    void Add(IGrainMigrationParticipant participant);
    void Remove(IGrainMigrationParticipant participant);
}

// Interface for components of a grain (or the grain class itself) to implement in order to participate in the
// migration process (adding or consuming migration context)
public interface IGrainMigrationParticipant
{
    // Called on the original activation when migration is initiated, before OnDeactivateAsync.
    // The participant can access and update the migration context dictionary.
    // Methods are void, since we do not want components to perform any IO before
    // the grain is activated.
    void OnDehydrate(IDictionary<string, object> migrationContext);

    // Called on the new activation after a migration, before OnActivateAsync.
    // The participant can restore state from the migration context dictionary.
    // Method is void since we only want to set in-memory context here and do 
    void OnRehydrate(IReadOnlyDictionary<string, object> migrationContext);

    // Future work:
    // Perhaps this should return an `int` or a `float` to express the migration
    // cost. The interface can be updated later to add this, with a default return
    // value indicating no/minor cost.
    MigrationCost GetMigrationCost();
}

When OnDeactivateAsync is called, the reason will be set to 'Dehydrating' so that app code can decide whether to persist state or not. Any developer performing a state write in OnDeactivateAsync either doesn't require that the state is written or doesn't realize that a process can crash at any moment and therefore they cannot rely on OnDeactivateAsync being called to preserve state.

Discussion: What terminology should be used here? This migration functionality could be used to put a grain into storage (eg, cached on disk or in a database), so if it was repurposed for that then maybe a more generic terminology should be used. Eg, dehydrating vs rehydrating, hibernating vs waking (estivate is the reverse of hibernate, but that's a very esoteric word), pause vs resume, or freeze vs thaw. Python uses the terms pickle & unpickle, but in the context of serialization which this involves but which isn't the purpose. Personally, I favor dehydrate vs rehydrate since we are using the analogue of grains. A little research shows that preserving grains for long periods largely involves removing oxygen, reducing moisture, and reducing temperature.

Enhancements / future work

Track/propagate forwarding targets in order to promptly fail requests forwarded to hosts which crash shortly after.
Prevent deactivation of grains due to loss of a directory partition by pausing request processing and salvaging the entries.
See: [Idea] Salvage Activations from Defunct Directory Partitions #2656
- It is straightforward to pause processing of new requests while the directory registration is in doubt.
  Pausing the processing of background tasks is not straightforward: they can capture the activation's TaskScheduler and schedule any work they like on it.
  Perhaps we deem this acceptable, but a better solution would be to have a more resilient in-memory directory.
In-memory directory replication to improve resiliency to shutdown/failure and prevent duplicate activations
Add a CheckAndSet operation to IGrainDirectory so that we can atomically update grain directory entries and avoid races

Related: #4694

The text was updated successfully, but these errors were encountered:

jsteinich · 2022-05-04T19:28:48Z

When using Grain<T> or IGrainState should state be automatically preserved? It seems beneficial overall, but I imagine there could be a case where that would be undesirable. It would be good to at least have a migration participant implementation pre-built for those types.

Would there be any disruption to grain extensions as a result of a migration? I'm think about transactions in particular, but applies to other scenarios as well.

ghost · 2022-06-02T18:31:27Z

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

bill-poole · 2022-07-05T12:25:04Z

I think this grain migration during rebalancing functionality is essential. Both Akka and Dapr have this functionality as I understand it.

I'm new to Orleans, so apologies is any of what I say here is incorrect. Do you intend grains to be migrated/rebalanced even while busy? I think that it is essential that be the case - i.e., I think grains should be migrated during rebalancing, even if they are busy in the middle of one (or more) operations. However, this would mean that grains would be proactively deactivated in the middle of operations, which is not the case today. i.e., grains today are only deactivated after being idle for some time.

Therefore, perhaps migration mid-operation needs to be opt-in via configuration. I think Dapr enables all actors (of any type) to be rebalanced via the drainRebalancedActors setting. Perhaps Orleans should be configurable by grain type?

I also think there should also be a configurable timeout - i.e., how long each grain has after OnDehydrate is invoked to cease its operations (and possibly persist state) such that it can be migrated. I think Dapr enables this behaviour via the drainOngoingCallTimeout setting.

Also, what is the priority/milestone for implementing this feature? I noticed that #7470 was closed in favour of this proposal and that #7470 was on the roadmap for .NET 7.0 as stated by #7467; however, this proposal is currently on the Backlog milestone rather than the ".NET 7.0" or "Planning-4.0" milestone.

ReubenBond · 2022-07-05T14:48:03Z

Neither Dapr nor Akka have this functionality. Actors in Akka are not virtual actors: their location is decided up-front and does not change while the actor is active. Dapr uses hash-based routing and does not have a directory, so actor placement cannot be customized, and it cannot migrate an actor from one place to another while preserving its ephemeral state (incl pending requests). Here are the docs on drainRebalancedActors:

drainRebalancedActors - If true, Dapr will wait for drainOngoingCallTimeout duration to allow a current actor call to complete before trying to deactivate an actor. Default: true

This is needed because of Dapr's lack of a directory. It means that whenever a host joins or leaves the cluster (eg, a scaling event or an upgrade), the actors necessarily shift around (because the hash ring has changed).

The purpose of this proposal is for the ability to migrate grains without first deactivating them, allowing them to preserve state and move atomically from one host to another. If a grain is processing some request, it would at least finish that request before migrating the other requests elsewhere.

This is not high priority. There are some benefits to having this functionality, but it's certainly not critical (it's not as if it exists elsewhere).

Is there something you're looking for in this feature?

bill-poole · 2022-07-05T15:43:47Z

Thanks @ReubenBond. I'm looking for some grains to be migrated to a new silo when it joins the cluster. e.g., if there are two silos and a third joins the cluster due to scaling, 1/3 of the grains from each silo move to the new silo, thus balancing the load. I'm not concerned about migrating grains without first deactivating them - I'm happy for them to be deactivated, migrated and then reactivated. I'm only concerned that grains are not relocated to new silos when added to the cluster during scaling.

That's specifically the functionality I was referring to - migrating existing actors to new nodes added to a cluster during scaling. As you say, the absence of a directory forces them into this behaviour, but that behaviour is desirable for me.

ReubenBond · 2022-07-05T16:03:10Z

@oising's activation shedding package ought to help with that: https://github.com/oising/OrleansContrib.ActivationShedding
I would suppose that in your case you have a fairly fixed set of grains which are active at all times, rather than a rolling window of active grains. In that case, the above is likely for you. If not, you can use [ActivationCountBasedPlacement], which places grain activations on one of the least-loaded silos using the power of two choices algorithm:

A placement strategy which attempts to achieve approximately even load based upon the number of recently-active grains on each server.

The intention of this placement strategy is to place new grain activations on the least heavily loaded server based on the number of recently busy grains.
It includes a mechanism in which all servers periodically publish their total activation count to all other servers.
The placement director then selects a server which is predicted to have the fewest activations by examining the most recently
reported activation count and a making prediction of the current activation count based upon the recent activation count made by
the placement director on the current server. The director selects a number of servers at random when making this prediction,
in an attempt to avoid multiple separate servers overloading the same server. By default, two servers are selected at random,
but this value is configurable via Orleans.Runtime.ActivationCountBasedPlacementOptions.

This algorithm is based on the thesis The Power of Two Choices in Randomized Load Balancing by Michael David Mitzenmacher: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf
and is also used in NGINX for distributed load balancing, as described in the article NGINX and the "Power of Two Choices" Load-Balancing Algorithm: https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/

bill-poole · 2022-07-05T16:24:06Z

Thanks @ReubenBond! I wasn't aware of that library. You are correct in that we have a fairly fixed set of grains, which are active at all times. The load across these grains is highly dynamic in that most of the time, most active grains are being drip fed work to do. However, under some circumstances (unpredictably so), hundreds of those grains can each start fully consuming 4-8 vCPUs for up to a few minutes.

Kubernetes scales up the cluster in response to this increased load, at which time we need many grains to migrate to those new nodes to complete the work. Many of the grains needing to be migrated will be in the middle of executing one or more operations when the scaling event occurs (they are all re-entrant).

I'll check out the Orleans.ActivationShedding library. Thanks again!

bill-poole · 2022-07-05T20:03:24Z

@ReubenBond, upon inspection of OrleansContrib.ActivationShedding, it appears that the mechanism this library uses to migrate grains to a different silo is to deactivate them (specifically, by invoking Grain.DeactivateOnIdle) so that they are then activated in an underutilised silo (assuming the grains are using [ActivationCountBasedPlacement]) when next invoked.

However, it is stated above in this issue that "voluntarily deactivating the grain will not likely cause that grain to be reactivated on a different silo because caches will point to the original silo, which will happily reactivate the grain upon receiving a message."

Will Orleans therefore actually reactivate a grain (using [ActivationCountBasedPlacement]) in a different silo after it is proactively deactivated using Grain.DeactivateOnIdle?

johnkattenhorn · 2022-07-05T20:26:13Z

@ReubenBond out of all of the options for naming I prefer the terminology you've used dehydrating vs rehydrating. If my memory serves me correct Biztak also used this term for describing a similar operation for orchestrations.

ReubenBond · 2022-07-05T22:35:43Z

However, it is stated above in this issue that "voluntarily deactivating the grain will not likely cause that grain to be reactivated on a different silo because caches will point to the original silo, which will happily reactivate the grain upon receiving a message."

Will Orleans therefore actually reactivate a grain (using [ActivationCountBasedPlacement]) in a different silo after it is proactively deactivated using Grain.DeactivateOnIdle?

Orleans 4.0 is different in that regard, but the approach works for Orleans 3.x.

bill-poole · 2022-07-05T22:46:32Z

Orleans 4.0 is different in that regard, but the approach works for Orleans 3.x.

Is there or will there be an approach that works in Orleans 4.0?

ReubenBond · 2022-07-05T23:27:40Z

Yes, the approach which we plan for 4.0 is an atomic directory update.

bill-poole · 2022-07-06T08:55:49Z

Yes, the approach which we plan for 4.0 is an atomic directory update.

Is this proposal described somewhere I could read and understand it?

Will the approach of proactively deactivating a grain with Grain.DeactivateOnIdle and using [ActivationCountBasedPlacement] continue to work in Orleans 4.0? Or will "atomic directory updates" achieve the outcome another way?

bill-poole · 2022-07-07T10:05:30Z

@ReubenBond, if a grain is migrated between silos due to rebalancing, then what would you expect would happen to its in-progress calls? This will be especially relevant if those calls are long-running (as proposed by #7649) and if the long-running call chains are long. If in-progress calls are cancelled when a grain is migrated, then that would cancel the entire call chain, which means a large amount of in-flight work could be cancelled because an upstream grain is migrated during rebalancing.

This would make rebalancing much more disruptive (i.e., impacting many more grains than just those being migrated) in the case where there are long-running method calls and long call chains.

ReubenBond · 2022-07-07T14:08:28Z

Calls (long-running or not) will need to complete before migration. As shown above, migration follows the regular deactivation path, so this is not related to migration in particular. If calls fail to complete before some timeout elapses, then they'll be rejected. We cannot terminate work (cancellation is cooperative), so if a grain has some runaway work, then that will continue that way. We can throw exceptions when the runaway grain tries to access the runtime, though.

This proposal is only about the mechanism for grain migration, not the policies. If you don't want a grain to migrate, then there will be some way to specify a policy stating that it must never be migrated.

bill-poole · 2022-07-08T14:23:49Z

Thanks @ReubenBond. I figured that was the case. I was thinking that if a grain eligible for migration during rebalancing were to be informed that Orleans is seeking to migrate the grain (perhaps via a OnMigrating method call), then the grain could gracefully cancel its long-running operations, so that the grain could then be migrated. However, doing so would also mean cancelling any in-progress outbound calls, which would mean cancelling a lot of downstream in-progress work if the grain being migrated was upstream in a long call chain.

I'm just making the point that having long-running calls in long call chains runs counter to dynamic/responsive cluster rebalancing. I was considering a design that had deep call chains containing long-running calls, but I require load to be dynamically/quickly rebalanced across the cluster when it is scaled up - which means I need to consider a different design - one where the heavy lifting is decomposed into smaller, shorter-running calls.

ReubenBond · 2022-07-08T14:59:16Z

How long are these long-running calls that you speak of and how frequently do you expect the cluster to be rebalancing?

bill-poole · 2022-07-08T15:15:16Z

The most upstream grain operations in the call chain would run for up to 5 minutes. They would spend most of their time waiting for downstream work to complete. The calls would be structured in a tree, where each node in the tree might have one minute of CPU-time work to do, but each node would also need to wait for all its child nodes to complete before completing.

The load is highly dynamic in that most (99.999%) of events received from the outside world are handled with small trees executed over the cluster with not much work to do. However, 0.001% of events must be handled with a large tree where each node has a lot of work to do - which means that when one of these events is received, the cluster needs to scale up substantially.

I expect that the cluster will need to scale up to deal with these kinds of external events about 10-20 times per month per customer.

Furthermore, the tree is "spun up / scheduled" first when the external event is received before the work is then executed to ensure that all interdependencies between tree nodes is known before execution commences. Therefore, all grains are active prior to the heavy lifting beginning. Thus when the heavy lifting begins and more nodes are added to the cluster, the grains are all busy doing work and/or waiting for child grains to complete their work.

If the upstream grains (in the first 5-10 layers of the tree) were to be idled and migrated, then all the downstream work at the lower levels of the tree would also have to be cancelled and restarted.

I have some ideas for dealing with this, which involves a different decomposition of work across grains and grain operations. I'm just pointing out that my original design won't dynamically scale due to long-running calls with long call chains - just in case that might influence the design proposed by this issue.

bill-poole · 2022-07-11T09:33:42Z

I was thinking about this some more over the weekend. What happens if Grain.DeactivateOnIdle is invoked when there are other method calls waiting for the current grain's method call to complete? The documentation states, "The next call to this grain will result in a different activation to be used, which typical means a new activation will be created automatically by the runtime." Does that hold true if the "next call" is already queued waiting for the current call to complete when Grain.DeactivateOnIdle is invoked?

What if the grain is re-entrant and the grain has multiple concurrent calls at the time Grain.DeactivateOnIdle is invoked? Is the grain deactivated after all in-progress calls complete? Are all calls after Grain.DeactivateOnIdle is invoked queued rather than dispatched to the grain? Or are all calls dispatched to the grain until it is deactivated, which only occurs when there are no current calls?

If the latter is true, then a very busy re-entrant grain that is always executing one or more calls will never actually be deactivated after Grain.DeactivateOnIdle is invoked - which means that there would be no capacity for such a grain to ever be migrated when the cluster is rebalanced after scaling.

Ideally, Orleans will queue all new method calls to a grain after Grain.DeactivateOnIdle is invoked, at which time Orleans would wait for all current method calls to complete, deactivate/migrate the grain and then forward all queued method calls to the reactivated grain.

The next question is whether Orleans will invoke the proposed OnDehydrate method while a grain has method calls in progress, or whether it will only do so while the grain is idle. If it will only do so while a grain is idle, then that will prevent grains that are under constant load from ever being migrated.

I think it would be very useful if Orleans were to invoke OnDehydrate when Orleans wishes to migrate the grain, irrespective of whether one or more method calls are in progress, giving the grain an opportunity to gracefully pause the work it is doing, to be resumed once the grain has been re-hydrated.

When OnDehydrate is invoked, Orleans would queue all subsequent method calls. When OnDehydrate returns, Orleans would inspect whether all in-progress method calls had completed. If so, the grain would be migrated and the queued method calls would be dispatched once migration is complete. If not, the grain would remain in place and the queued method calls would be dispatched to the grain where it is.

ReubenBond · 2022-07-11T17:14:24Z

Does that hold true if the "next call" is already queued waiting for the current call to complete when Grain.DeactivateOnIdle is invoked?

What if the grain is re-entrant and the grain has multiple concurrent calls at the time Grain.DeactivateOnIdle is invoked? Is the grain deactivated after all in-progress calls complete? Are all calls after Grain.DeactivateOnIdle is invoked queued rather than dispatched to the grain? Or are all calls dispatched to the grain until it is deactivated, which only occurs when there are no current calls?

Lots of good questions. When DeactivateOnIdle is called, the grain will not start processing any additional requests and it will hold on to pending requests until it finishes deactivating. It will complete all currently running requests before deactivating. Once it finishes deactivating, those requests will be forwarded on to a new activation.

Ideally, Orleans will queue all new method calls to a grain after Grain.DeactivateOnIdle is invoked, at which time Orleans would wait for all current method calls to complete, deactivate/migrate the grain and then forward all queued method calls to the reactivated grain.

I'm happy to say that the ideal case is what's implemented today.

The next question is whether Orleans will invoke the proposed OnDehydrate method while a grain has method calls in progress, or whether it will only do so while the grain is idle. If it will only do so while a grain is idle, then that will prevent grains that are under constant load from ever being migrated.

The regular deactivation procedure will be followed (i.e, it will not call OnDehydrate while there are pending calls - see the above state diagram). The primary difference is that once deactivation completes, instead of deregistering the grain from the directory, the pending messages are sent to a new location with a special message telling it to atomically update its directory entry from the original location to the new location.

bill-poole · 2022-07-12T08:44:30Z

I'm happy to say that the ideal case is what's implemented today.

That's really great to hear! I'm very impressed.

The regular deactivation procedure will be followed (i.e, it will not call OnDehydrate while there are pending calls - see the above state diagram).

Is there or will there be a mechanism for grains to be notified that Orleans wishes to migrate them to another silo due to cluster rebalancing? i.e., is there or will there be a mechanism for a grain to be informed it is being deactivated (i.e., its deactivation is pending)? This would allow a grain involved in long-running method calls to gracefully stop/pause that work so that those in-progress method calls complete more quickly, thus allowing the grain to be deactivated and migrated more quickly.

HandledLizard · 2023-03-13T15:06:07Z

Hi @ReubenBond - this issue is still in the backlog, although is a significant issue for elastic systems. Is there any view on how this would be progressed?

Side question - the ActOp paper holds some significant performance enhancements, as this was implemented directly on Orleans, is there any path for this to become an option in the framework? thanks!

ReubenBond · 2023-03-13T15:09:21Z

Hi @ReubenBond - this issue is still in the backlog, although is a significant issue for elastic systems. Is there any view on how this would be progressed?

I am hoping to include this in .NET 8 and to start making progress on it soon.

Side question - the ActOp paper holds some significant performance enhancements, as this was implemented directly on Orleans, is there any path for this to become an option in the framework? thanks!

That would be the next step, but I think it's an interesting one. Once this item is in, we'll have the basic mechanism required and we can consider clever algorithms for auto-optimization like the graph cutting one presented in ActOp

* Grain migration * Review feedback * Add helper methods to migration state, migrate grain state by default, add IGrainDirectory fallback tentatively * Fixup last commit * Perform Migration lookup asynchronously and avoid starvation * Minor cleanup, mostly drive-by * Implement support for placement hints in placement directors where it makes some sense * Add migration tests for Grain<T> and IPersistentState<T> * Propagate captured RequestContext to dehydration and add fault tests * Add GrainDirectory tests for atomic update * RedisGrainDirectory: use a hash instead of serialized blob, update style * Use JSON instead of hash again - retaining compatibility * Additional tests and fix LRU to support updating * Fix doc comment for IGrainMigrationParticipant * Minor doc comment fix * Support live migration for PubSubRendezvousGrain, MemoryStreamQueueGrain, and ReminderTableGrain * Fixes * Fix test * Fix test

ghost added the Needs: triage 🔍 label Apr 16, 2022

ReubenBond changed the title ~~[Proposal] Active grain migration~~ [Proposal] Grain migration Apr 16, 2022

rafikiassumani-msft added area-grains Category for all the Grain related issues area-persistence category for Orleans persistence issues and removed Needs: triage 🔍 labels Jun 2, 2022

rafikiassumani-msft added this to the Backlog milestone Jun 2, 2022

rafikiassumani-msft added the epic label Jun 2, 2022

ReubenBond mentioned this issue Jun 25, 2022

Implement runtime grain migrations between machines #7470

Closed

rafikiassumani-msft assigned ReubenBond Jun 29, 2022

logicaloud mentioned this issue Aug 7, 2022

Orleans for primary/stand-by servers? #7900

Closed

ReubenBond mentioned this issue May 25, 2023

Live grain migration (#7692) #8452

Merged

5 tasks

benjaminpetit closed this as completed in #8452 Jun 6, 2023

ghost added the Status: Fixed label Jun 6, 2023

ghost locked as resolved and limited conversation to collaborators Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Grain migration #7692

[Proposal] Grain migration #7692

ReubenBond commented Apr 16, 2022 •

edited

Loading

jsteinich commented May 4, 2022

ghost commented Jun 2, 2022

bill-poole commented Jul 5, 2022

ReubenBond commented Jul 5, 2022 •

edited

Loading

bill-poole commented Jul 5, 2022 •

edited

Loading

ReubenBond commented Jul 5, 2022 •

edited

Loading

bill-poole commented Jul 5, 2022

bill-poole commented Jul 5, 2022

johnkattenhorn commented Jul 5, 2022

ReubenBond commented Jul 5, 2022

bill-poole commented Jul 5, 2022

ReubenBond commented Jul 5, 2022

bill-poole commented Jul 6, 2022

bill-poole commented Jul 7, 2022

ReubenBond commented Jul 7, 2022

bill-poole commented Jul 8, 2022

ReubenBond commented Jul 8, 2022

bill-poole commented Jul 8, 2022

bill-poole commented Jul 11, 2022

ReubenBond commented Jul 11, 2022

bill-poole commented Jul 12, 2022

HandledLizard commented Mar 13, 2023

ReubenBond commented Mar 13, 2023

[Proposal] Grain migration #7692

[Proposal] Grain migration #7692

Comments

ReubenBond commented Apr 16, 2022 • edited Loading

Motivation

Capacity utilization improvements for rolling upgrades & scale-out

Why deactivation is not enough

Soft-state preservation during upgrades

Self-optimizing placement

Summary of the functionality

Design

Enhancements / future work

jsteinich commented May 4, 2022

ghost commented Jun 2, 2022

bill-poole commented Jul 5, 2022

ReubenBond commented Jul 5, 2022 • edited Loading

bill-poole commented Jul 5, 2022 • edited Loading

ReubenBond commented Jul 5, 2022 • edited Loading

bill-poole commented Jul 5, 2022

bill-poole commented Jul 5, 2022

johnkattenhorn commented Jul 5, 2022

ReubenBond commented Jul 5, 2022

bill-poole commented Jul 5, 2022

ReubenBond commented Jul 5, 2022

bill-poole commented Jul 6, 2022

bill-poole commented Jul 7, 2022

ReubenBond commented Jul 7, 2022

bill-poole commented Jul 8, 2022

ReubenBond commented Jul 8, 2022

bill-poole commented Jul 8, 2022

bill-poole commented Jul 11, 2022

ReubenBond commented Jul 11, 2022

bill-poole commented Jul 12, 2022

HandledLizard commented Mar 13, 2023

ReubenBond commented Mar 13, 2023

ReubenBond commented Apr 16, 2022 •

edited

Loading

ReubenBond commented Jul 5, 2022 •

edited

Loading

bill-poole commented Jul 5, 2022 •

edited

Loading

ReubenBond commented Jul 5, 2022 •

edited

Loading