-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live grain migration (#7692) #8452
Live grain migration (#7692) #8452
Conversation
881e14e
to
9f97751
Compare
@@ -17,7 +17,7 @@ public interface IGrainDirectory | |||
/// </summary> | |||
/// <param name="address">The <see cref="GrainAddress"/> to register</param> | |||
/// <returns>The <see cref="GrainAddress"/> that is effectively registered in the directory.</returns> | |||
Task<GrainAddress> Register(GrainAddress address); | |||
Task<GrainAddress> Register(GrainAddress address, GrainAddress? previousAddress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to introduce another interface for grain directories that support migration - that way this wouldn't be a breaking change for people with custom grain directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can afford having breaking changes (I don't expect a lot of people having a custom directory...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of using a default interface method which falls back to unregister+register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a directory that does not implement the "replace" atomically will produce some race conditions during migrations though. Not sure if it is really a big deal, but on some very active grains this migration could fail often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race is acceptable: if the destination grain loses the race, it drops the state and deactivate to migrate. It's the same as a regular activation race today. I am also happy for us to make a breaking change instead.
var newLocation = await placementService.PlaceGrainAsync(GrainId, requestContext, PlacementStrategy); | ||
if (newLocation == Address.SiloAddress || newLocation is null) | ||
{ | ||
// No more appropriate silo was selected for this grain, but we cannot cancel migration now, so we must continue to deactivate and reactivate the grain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What protections do we have around a grain continually calling MigrateOnIdle (because of some condition it detects), migration being attempted but ends on the same place, and the process begins again. Should there be some kind of "cool off" for migration requests to help avoid this from happening (I'm assuming that the cost of migration is in the serialization of state in preparation for migration?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first thing which happens when migrating, so the cost is just the cost of running placement, which is cheap.
I've updated the comment, since this one is now inaccurate: the migration will be canceled, and the grain will not be deactivated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't answer your actual point: do we want to somehow prevent the grain code from spamming MigrateOnIdle()
? Maybe. I'm not sure what the behavior should be. It seems like throwing wouldn't be useful. We could make the method async and return ValueTask<bool>
or something else (with an implementation/design change). Something makes me uncomfortable about that. Another option is to ignore migration requests if they fall within some time period, but I think the added complexity is not worth it for the small benefit that you gain (not re-running placement, which is typically cheap).
void RemoveMigrationParticipant(IGrainMigrationParticipant participant); | ||
} | ||
|
||
public interface IGrainMigrationParticipant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether we should rename these events and add some more:
- OnDehyrdrating
- OnDehydrated
- OnRehydrating
- OnRehydrated
Even if we don't add the additional events I think OnDehydrating and OnRehydrating are more descriptive temporaly speaking in terms of WHEN they are fired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with using different tenses. Can you think of a use case for using past/future tense instead of present tense?
To me, past/future tense make the interface seem more like it's for (passive) observers than for (active) participants. I could be in the minority. I also find it slightly confusing: as an implementer, what should I be doing in each of those methods? Should either of them be asynchronous?
Maybe the interface should be renamed to IGrainLifecycleParticipant
or IGrainLifecycleFilter
and it can be used as a place to add post-activation hooks, etc, satisfying #4519
As a counter argument (to my argument), you could take precedence from .NET BinaryFormatter which had OnSerializing/OnSerialized/OnDeserializing/OnDeserialized hooks - then again, those aren't the actual serialize/deserialize methods, they are there as some kind of pre/post processing. eg, to grab something from the serialization context after the fields have been deserialized in order to populate some value.
In Orleans, the grain base class has overridable OnActivateAsync
and OnDeactivateAsync
methods for the existing lifecycle events, which sets some precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with present tense is that it is almost never true. You either fire these callbacks before or after something happens and if it fires in the middle it usually makes a case for it being something more specific. I don't think we need both callback methods but I think its good to communicate WHEN something occurs before or after so that there is no doubt.
Fair point about the existing OnActivateAsync and OnDeactivateAsync. I think provided its documented WHEN it occurs it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mitchdenny thank you for the feedback, I've added some doc comments to document when the call is made. Hopefully, these fit the bill:
public interface IGrainMigrationParticipant
{
/// <summary>
/// Called on the original activation when migration is initiated, after <see cref="IGrainBase.OnDeactivateAsync(DeactivationReason, CancellationToken)"/> completes.
/// The participant can access and update the dehydration context.
/// </summary>
/// <param name="dehydrationContext"></param>
void OnDehydrate(IDehydrationContext dehydrationContext);
/// <summary>
/// Called on the new activation after a migration, before <see cref="IGrainBase.OnActivateAsync(CancellationToken)"/> is called.
/// The participant can restore state from the migration context.
/// </summary>
/// <param name="rehydrationContext">The rehydration context.</param>
void OnRehydrate(IRehydrationContext rehydrationContext);
}
@@ -68,6 +71,7 @@ public PingBenchmark(int numSilos, bool startClient, bool grainsOnSecondariesOnl | |||
} | |||
}); | |||
|
|||
//hostBuilder.ConfigureLogging(logging => logging.AddConsole().SetMinimumLevel(LogLevel.Debug)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/// <summary> | ||
/// Records the state of a grain activation for migration. | ||
/// </summary> | ||
public interface IDehydrationContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have extensions method for Adding/Getting value from the context that uses Orleans serializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds useful. In that case, maybe we should put IServiceProvider or IGrainContext on the IDe/RehydrationContext. I was thinking that this should mostly be used by grain storage and that should use the storage serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in latest commit. Now, the interfaces look like this:
/// <summary>
/// Records the state of a grain activation which is in the process of being dehydrated for migration to another location.
/// </summary>
public interface IDehydrationContext
{
/// <summary>
/// Gets the keys in the context.
/// </summary>
IEnumerable<string> Keys { get; }
/// <summary>
/// Adds a sequence of bytes to the dehydration context, associating the sequence with the provided key.
/// </summary>
/// <param name="key">The key.</param>
/// <param name="value">The value.</param>
void AddBytes(string key, ReadOnlySpan<byte> value);
/// <summary>
/// Adds a sequence of bytes to the dehydration context, associating the sequence with the provided key.
/// </summary>
/// <param name="key">The key.</param>
/// <param name="valueWriter">A delegate used to write the provided value to the context.</param>
/// <param name="value">The value to provide to <paramref name="valueWriter"/>.</param>
void AddBytes<T>(string key, Action<T, IBufferWriter<byte>> valueWriter, T value);
/// <summary>
/// Attempts to a value to the dehydration context, associated with the provided key, serializing it using <see cref="Orleans.Serialization.Serializer"/>.
/// If a serializer is found for the value, and the key has not already been added, then the value is added and the method returns <see langword="true"/>.
/// If no serializer exists or the key has already been added, then the value is not added and the method returns <see langword="false"/>.
/// </summary>
/// <param name="key">The key.</param>
/// <param name="value">The value to add.</param>
bool TryAddValue<T>(string key, T? value);
}
/// <summary>
/// Contains the state of a grain activation which is in the process of being rehydrated after moving from another location.
/// </summary>
public interface IRehydrationContext
{
/// <summary>
/// Gets the keys in the context.
/// </summary>
IEnumerable<string> Keys { get; }
/// <summary>
/// Tries to get a sequence of bytes from the rehydration context, associated with the provided key.
/// </summary>
/// <param name="key">The key.</param>
/// <param name="value">The value, if present.</param>
/// <returns><see langword="true"/> if the key exists in the context, otherwise <see langword="false"/>.</returns>
bool TryGetBytes(string key, out ReadOnlySequence<byte> value);
/// <summary>
/// Tries to get a value from the rehydration context, associated with the provided key, deserializing it using <see cref="Orleans.Serialization.Serializer"/>.
/// If a serializer is found for the value, and the key is present, then the value is deserialized and the method returns <see langword="true"/>.
/// If no serializer exists or the key has already been added, then the value is not added and the method returns <see langword="false"/>.
/// </summary>
/// <param name="key">The key.</param>
/// <param name="value">The value, if present.</param>
/// <returns><see langword="true"/> if the key exists in the context, otherwise <see langword="false"/>.</returns>
bool TryGetValue<T>(string key, out T? value);
}
{ | ||
if (TryGetGrainContext(grainId, out var result)) | ||
{ | ||
rehydrationContext?.Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least log a warning (maybe throw?) if rehydrationContext
isn't null in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be benign - the possibility of rehydration is lost since the grain exists, but the silo sending the context may have just been declared dead. Maybe a counter would be appropriate. "rehydration-attempts-dropped" or something
@mitchdenny & @benjaminpetit, please take another look |
f2a4237
to
466db1c
Compare
I believe I have addressed feedback adequately and have completed the original TODO items (largely, adding test coverage in various places). Please give this a final review and squash+merge when ready, @benjaminpetit |
…, add IGrainDirectory fallback tentatively
… makes some sense
…ain, and ReminderTableGrain
d189925
to
6fcf04c
Compare
Awesome! I see it being used on multiple use-cases, specially in game servers! |
/// Attempts to migrate the current instance to a new location once it becomes idle. | ||
/// </summary> | ||
/// <returns>A <see cref="Task"/> which represents the method call.</returns> | ||
ValueTask MigrateOnIdle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should call it TryMigrateOnIdle
since it's not guaranteed, if people don't read the doc :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It not returning a bool
is my only issue, but I suppose we could make it async and make it return a Task<bool>
or Task<SiloAddress>
(prefer not this). Then again, would those be actionable results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Na it's fine for now imo
|
||
void IGrainMigrationParticipant.OnDehydrate(IDehydrationContext dehydrationContext) | ||
{ | ||
dehydrationContext.TryAddValue("queue", _eventQueue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important right now: we should log when adding the _eventQueue failed, just to help debugging/catch some bugs
This PR implements live grain migration and resolves #7692
It contains only the mechanism for migration for now, no policies for automatic grain migration. The intention is to introduce those in separate PRs. The mechanism is accessible in two ways:
this.MigrateOnIdle()
grain.Cast<IGrainManagementExtension>().MigrateOnIdle()
Migration supports preserving in-memory state without needing to refresh from storage. To access this functionality, grains can implement
IGrainMigrationParticipant
and components can injectIGrainContext
, accessObservableLifecycle
, and useAddGrainMigrationParticipant
. I do not intend for grains to implementIGrainMigrationParticipant
and the relevant interfaces are optimized for performance over usability. Instead, components of the grain which know how to preserve their state should implement the interface. The relevant interfaces here are:The lifecycle of a grain therefore becomes:
Note that rehydration occurs before
OnActivateAsync
is called and dehydration occurs afterOnDeactivateAsync
is called.This PR also involves adding an additional parameter to IGrainDirectory.Register so that registrations can be updated (ideally) atomically.
TODO before merge:
PooledArrayBufferWriter
more versatile, rename #8453)IGrainMigrationParticipant
for storage providers to preserve stateMicrosoft Reviewers: Open in CodeFlow