Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live grain migration (#7692) #8452

Merged
merged 19 commits into from
Jun 6, 2023

Conversation

ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented May 25, 2023

This PR implements live grain migration and resolves #7692

It contains only the mechanism for migration for now, no policies for automatic grain migration. The intention is to introduce those in separate PRs. The mechanism is accessible in two ways:

  1. Within a grain, call this.MigrateOnIdle()
  2. With a grain reference, call grain.Cast<IGrainManagementExtension>().MigrateOnIdle()

Migration supports preserving in-memory state without needing to refresh from storage. To access this functionality, grains can implement IGrainMigrationParticipant and components can inject IGrainContext, access ObservableLifecycle, and use AddGrainMigrationParticipant. I do not intend for grains to implement IGrainMigrationParticipant and the relevant interfaces are optimized for performance over usability. Instead, components of the grain which know how to preserve their state should implement the interface. The relevant interfaces here are:

public interface IGrainMigrationParticipant
{
    /// <summary>
    /// Called on the original activation when migration is initiated, before <see cref="IGrainBase.OnDeactivateAsync(DeactivationReason, CancellationToken)"/>.
    /// The participant can access and update the dehydration context.
    /// </summary>
    /// <param name="dehydrationContext"></param>
    void OnDehydrate(IDehydrationContext dehydrationContext);

    /// <summary>
    /// Called on the new activation after a migration, before <see cref="IGrainBase.OnActivateAsync(CancellationToken)"/>.
    /// The participant can restore state from the migration context.
    /// </summary>
    /// <param name="rehydrationContext">The rehydration context.</param>
    void OnRehydrate(IRehydrationContext rehydrationContext);
}

/// <summary>
/// Records the state of a grain activation for migration.
/// </summary>
public interface IDehydrationContext
{
    /// <summary>
    /// Gets the keys in the context.
    /// </summary>
    IEnumerable<string> Keys { get; }

    /// <summary>
    /// Adds a value to the dehydration context, associated with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value.</param>
    void Add(string key, ReadOnlySpan<byte> value);

    /// <summary>
    /// Adds a value to the dehydration context, associated with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="valueWriter">A delegate used to write the provided value to the context.</param>
    /// <param name="value">The value to provide to <paramref name="valueWriter"/>.</param>
    void Add(string key, Action<object, IBufferWriter<byte>> valueWriter, object value);
}

/// <summary>
/// Contains the state of a grain activation for migration.
/// </summary>
public interface IRehydrationContext
{
    /// <summary>
    /// Gets the keys in the context.
    /// </summary>
    IEnumerable<string> Keys { get; }

    /// <summary>
    /// Tries to get a value from the rehydration context, associated with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value, if present.</param>
    /// <returns><see langword="true"/> if the key exists in the context, otherwise <see langword="false"/>.</returns>
    bool TryGetValue(string key, out ReadOnlySequence<byte> value);
}

The lifecycle of a grain therefore becomes:

stateDiagram-v2
    direction LR
    [*] --> Constructor
    Constructor --> OnActivateAsync
    Constructor --> OnRehydrate
    OnActivateAsync --> Running
    OnRehydrate --> OnActivateAsync 
    OnDeactivateAsync --> OnDehydrate
    Running --> OnDeactivateAsync 
    OnDeactivateAsync --> DisposeAsync
    OnDehydrate --> DisposeAsync
    DisposeAsync --> [*]
Loading

Note that rehydration occurs before OnActivateAsync is called and dehydration occurs after OnDeactivateAsync is called.

This PR also involves adding an additional parameter to IGrainDirectory.Register so that registrations can be updated (ideally) atomically.

TODO before merge:

  • Split PooledBuffer into a separate PR (Make PooledArrayBufferWriter more versatile, rename #8453)
  • Add additional test coverage
  • Add default implementation for new IGrainDirectory.Register overload (make it a new overload) which calls unregister+register in sequence
  • Add specific IGrainDirectory.Register tests
  • Implement IGrainMigrationParticipant for storage providers to preserve state
Microsoft Reviewers: Open in CodeFlow

@ReubenBond ReubenBond added this to the .NET 8 Planning milestone May 25, 2023
@ReubenBond ReubenBond force-pushed the feature/grain-migration/1 branch 5 times, most recently from 881e14e to 9f97751 Compare May 25, 2023 18:33
@ReubenBond ReubenBond changed the title Live grain migration Live grain migration (#7692) May 25, 2023
@@ -17,7 +17,7 @@ public interface IGrainDirectory
/// </summary>
/// <param name="address">The <see cref="GrainAddress"/> to register</param>
/// <returns>The <see cref="GrainAddress"/> that is effectively registered in the directory.</returns>
Task<GrainAddress> Register(GrainAddress address);
Task<GrainAddress> Register(GrainAddress address, GrainAddress? previousAddress);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to introduce another interface for grain directories that support migration - that way this wouldn't be a breaking change for people with custom grain directories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can afford having breaking changes (I don't expect a lot of people having a custom directory...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of using a default interface method which falls back to unregister+register

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a directory that does not implement the "replace" atomically will produce some race conditions during migrations though. Not sure if it is really a big deal, but on some very active grains this migration could fail often.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race is acceptable: if the destination grain loses the race, it drops the state and deactivate to migrate. It's the same as a regular activation race today. I am also happy for us to make a breaking change instead.

var newLocation = await placementService.PlaceGrainAsync(GrainId, requestContext, PlacementStrategy);
if (newLocation == Address.SiloAddress || newLocation is null)
{
// No more appropriate silo was selected for this grain, but we cannot cancel migration now, so we must continue to deactivate and reactivate the grain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What protections do we have around a grain continually calling MigrateOnIdle (because of some condition it detects), migration being attempted but ends on the same place, and the process begins again. Should there be some kind of "cool off" for migration requests to help avoid this from happening (I'm assuming that the cost of migration is in the serialization of state in preparation for migration?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first thing which happens when migrating, so the cost is just the cost of running placement, which is cheap.
I've updated the comment, since this one is now inaccurate: the migration will be canceled, and the grain will not be deactivated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't answer your actual point: do we want to somehow prevent the grain code from spamming MigrateOnIdle()? Maybe. I'm not sure what the behavior should be. It seems like throwing wouldn't be useful. We could make the method async and return ValueTask<bool> or something else (with an implementation/design change). Something makes me uncomfortable about that. Another option is to ignore migration requests if they fall within some time period, but I think the added complexity is not worth it for the small benefit that you gain (not re-running placement, which is typically cheap).

void RemoveMigrationParticipant(IGrainMigrationParticipant participant);
}

public interface IGrainMigrationParticipant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we should rename these events and add some more:

  • OnDehyrdrating
  • OnDehydrated
  • OnRehydrating
  • OnRehydrated

Even if we don't add the additional events I think OnDehydrating and OnRehydrating are more descriptive temporaly speaking in terms of WHEN they are fired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with using different tenses. Can you think of a use case for using past/future tense instead of present tense?

To me, past/future tense make the interface seem more like it's for (passive) observers than for (active) participants. I could be in the minority. I also find it slightly confusing: as an implementer, what should I be doing in each of those methods? Should either of them be asynchronous?

Maybe the interface should be renamed to IGrainLifecycleParticipant or IGrainLifecycleFilter and it can be used as a place to add post-activation hooks, etc, satisfying #4519

As a counter argument (to my argument), you could take precedence from .NET BinaryFormatter which had OnSerializing/OnSerialized/OnDeserializing/OnDeserialized hooks - then again, those aren't the actual serialize/deserialize methods, they are there as some kind of pre/post processing. eg, to grab something from the serialization context after the fields have been deserialized in order to populate some value.

In Orleans, the grain base class has overridable OnActivateAsync and OnDeactivateAsync methods for the existing lifecycle events, which sets some precedence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with present tense is that it is almost never true. You either fire these callbacks before or after something happens and if it fires in the middle it usually makes a case for it being something more specific. I don't think we need both callback methods but I think its good to communicate WHEN something occurs before or after so that there is no doubt.

Fair point about the existing OnActivateAsync and OnDeactivateAsync. I think provided its documented WHEN it occurs it should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchdenny thank you for the feedback, I've added some doc comments to document when the call is made. Hopefully, these fit the bill:

public interface IGrainMigrationParticipant
{
    /// <summary>
    /// Called on the original activation when migration is initiated, after <see cref="IGrainBase.OnDeactivateAsync(DeactivationReason, CancellationToken)"/> completes.
    /// The participant can access and update the dehydration context.
    /// </summary>
    /// <param name="dehydrationContext"></param>
    void OnDehydrate(IDehydrationContext dehydrationContext);

    /// <summary>
    /// Called on the new activation after a migration, before <see cref="IGrainBase.OnActivateAsync(CancellationToken)"/> is called.
    /// The participant can restore state from the migration context.
    /// </summary>
    /// <param name="rehydrationContext">The rehydration context.</param>
    void OnRehydrate(IRehydrationContext rehydrationContext);
}

test/Benchmarks/Ping/PingBenchmark.cs Outdated Show resolved Hide resolved
@@ -68,6 +71,7 @@ public PingBenchmark(int numSilos, bool startClient, bool grainsOnSecondariesOnl
}
});

//hostBuilder.ConfigureLogging(logging => logging.AddConsole().SetMinimumLevel(LogLevel.Debug));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// <summary>
/// Records the state of a grain activation for migration.
/// </summary>
public interface IDehydrationContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have extensions method for Adding/Getting value from the context that uses Orleans serializer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds useful. In that case, maybe we should put IServiceProvider or IGrainContext on the IDe/RehydrationContext. I was thinking that this should mostly be used by grain storage and that should use the storage serializer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in latest commit. Now, the interfaces look like this:

/// <summary>
/// Records the state of a grain activation which is in the process of being dehydrated for migration to another location.
/// </summary>
public interface IDehydrationContext
{
    /// <summary>
    /// Gets the keys in the context.
    /// </summary>
    IEnumerable<string> Keys { get; }

    /// <summary>
    /// Adds a sequence of bytes to the dehydration context, associating the sequence with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value.</param>
    void AddBytes(string key, ReadOnlySpan<byte> value);

    /// <summary>
    /// Adds a sequence of bytes to the dehydration context, associating the sequence with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="valueWriter">A delegate used to write the provided value to the context.</param>
    /// <param name="value">The value to provide to <paramref name="valueWriter"/>.</param>
    void AddBytes<T>(string key, Action<T, IBufferWriter<byte>> valueWriter, T value);

    /// <summary>
    /// Attempts to a value to the dehydration context, associated with the provided key, serializing it using <see cref="Orleans.Serialization.Serializer"/>.
    /// If a serializer is found for the value, and the key has not already been added, then the value is added and the method returns <see langword="true"/>.
    /// If no serializer exists or the key has already been added, then the value is not added and the method returns <see langword="false"/>.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value to add.</param>
    bool TryAddValue<T>(string key, T? value);
}

/// <summary>
/// Contains the state of a grain activation which is in the process of being rehydrated after moving from another location.
/// </summary>
public interface IRehydrationContext
{
    /// <summary>
    /// Gets the keys in the context.
    /// </summary>
    IEnumerable<string> Keys { get; }

    /// <summary>
    /// Tries to get a sequence of bytes from the rehydration context, associated with the provided key.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value, if present.</param>
    /// <returns><see langword="true"/> if the key exists in the context, otherwise <see langword="false"/>.</returns>
    bool TryGetBytes(string key, out ReadOnlySequence<byte> value);

    /// <summary>
    /// Tries to get a value from the rehydration context, associated with the provided key, deserializing it using <see cref="Orleans.Serialization.Serializer"/>.
    /// If a serializer is found for the value, and the key is present, then the value is deserialized and the method returns <see langword="true"/>.
    /// If no serializer exists or the key has already been added, then the value is not added and the method returns <see langword="false"/>.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value, if present.</param>
    /// <returns><see langword="true"/> if the key exists in the context, otherwise <see langword="false"/>.</returns>
    bool TryGetValue<T>(string key, out T? value);
}

{
if (TryGetGrainContext(grainId, out var result))
{
rehydrationContext?.Dispose();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least log a warning (maybe throw?) if rehydrationContext isn't null in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be benign - the possibility of rehydration is lost since the grain exists, but the silo sending the context may have just been declared dead. Maybe a counter would be appropriate. "rehydration-attempts-dropped" or something

@ReubenBond
Copy link
Member Author

@mitchdenny & @benjaminpetit, please take another look

@ReubenBond ReubenBond force-pushed the feature/grain-migration/1 branch 5 times, most recently from f2a4237 to 466db1c Compare June 5, 2023 20:40
@ReubenBond
Copy link
Member Author

I believe I have addressed feedback adequately and have completed the original TODO items (largely, adding test coverage in various places). Please give this a final review and squash+merge when ready, @benjaminpetit

@ReubenBond ReubenBond mentioned this pull request Jun 5, 2023
@ReubenBond ReubenBond force-pushed the feature/grain-migration/1 branch from d189925 to 6fcf04c Compare June 5, 2023 22:32
@galvesribeiro
Copy link
Member

Awesome! I see it being used on multiple use-cases, specially in game servers! :shipit:

/// Attempts to migrate the current instance to a new location once it becomes idle.
/// </summary>
/// <returns>A <see cref="Task"/> which represents the method call.</returns>
ValueTask MigrateOnIdle();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call it TryMigrateOnIdle since it's not guaranteed, if people don't read the doc :p

Copy link
Member Author

@ReubenBond ReubenBond Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It not returning a bool is my only issue, but I suppose we could make it async and make it return a Task<bool> or Task<SiloAddress> (prefer not this). Then again, would those be actionable results?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Na it's fine for now imo


void IGrainMigrationParticipant.OnDehydrate(IDehydrationContext dehydrationContext)
{
dehydrationContext.TryAddValue("queue", _eventQueue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important right now: we should log when adding the _eventQueue failed, just to help debugging/catch some bugs

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Grain migration
4 participants