Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated control loops per Indexer #811

Closed
Tracked by #300
morgsmccauley opened this issue Jun 18, 2024 · 2 comments · Fixed by #866
Closed
Tracked by #300

Dedicated control loops per Indexer #811

morgsmccauley opened this issue Jun 18, 2024 · 2 comments · Fixed by #866

Comments

@morgsmccauley
Copy link
Collaborator

morgsmccauley commented Jun 18, 2024

Our current Control Loops manage all Indexers, this is inconvenient for the following reasons:

  1. Individual Indexer tasks cannot block - Any blocking task would block all Indexers. This specifically affects de-/provisioning where we must start the task, and then poll it every subsequent loops.
  2. It's slow - Indexers are processed serially, as the number of Indexers grow, the loop gets slower and slower.
  3. Managing Indexer Lifecycle is complicated - We currently have the New/Existing/Deleted states, but an Indexer has much more than that, and its hard to retrofit these states in a non-blocking way.
  4. Error prone - all errors must be handled gracefully, otherwise we risk affecting all other indexers synchronisation.

I feel we have outgrown the current control loop. Rather than have a single control loop for all Indexers, I'm thinking we can have dedicated loops for each of them. We could spawn a new task for each Indexer, which then manages its own lifecycle. Then each Indexer is free to wait for as long as it wants, without impacting other Indexers. This would allow us to handle the blocking provisioning step much more elegantly and also parallelise the computation.

@pkudinov
Copy link
Collaborator

Add a grafana metric on the time each control loop takes and based on this determine the priority on this ticket

@morgsmccauley
Copy link
Collaborator Author

The main concern here isn't the speed. It's the complexity involved in handling multiple Indexers, and transitioning those Indexers to different states, within a single loop. Having dedicated loops for each Indexer would make managing the life cycle much easier, which would also make schema editing easier.

@morgsmccauley morgsmccauley linked a pull request Jul 12, 2024 that will close this issue
@morgsmccauley morgsmccauley self-assigned this Jul 12, 2024
morgsmccauley added a commit that referenced this issue Jul 18, 2024
This PR introduces dedicated/self-contained control loops per Indexer,
replacing the single/combined control loop. The motive for this ticket
is described in #811, you can read more about it there.

Overall, there is lots of clean up to be done, but I wanted to get this
out the door as quick as possible as to not block the features required
to build on top of this. I've discussed some of the major concerns
below.

## `LifecycleManager`
These dedicated control loops are managed by the `LifecycleManager`
struct. This is a state machine which progresses the Indexer through
different states depending on the context. The different states and
their transitions are described on the `LifecycleState` enum:
```rust
/// Represents the different lifecycle states of an Indexer
#[derive(Default, Debug, Clone, serde::Serialize, serde::Deserialize, PartialEq, Eq)]
pub enum LifecycleState {
    /// Pre-requisite resources, i.e. Data Layer, are being created.
    ///
    /// Transitions:
    /// - `Running` on success
    /// - `Repairing` on Data Layer provisioning failure
    #[default]
    Initializing,
    /// Indexer is functional, Block Stream and Executors are continouously monitored to ensure
    /// they are running the latest version of the Indexer.
    ///
    /// Transitions:
    /// - `Stopping` if suspended
    /// - `Running` if Block Stream or Executor fails to synchronise, essentially triggering a
    /// retry
    /// - `Running` on success
    Running,
    /// Indexer is being stopped, Block Stream and Executors are being stopped.
    ///
    /// Transitions:
    /// - `Stopping` on failure, triggering a retry
    /// - `Stopped` on success
    Stopping,
    /// Indexer is stopped, Block Stream and Executors are not running.
    ///
    /// Transitions:
    /// - `Running` if unsuspended
    Stopped,
    /// Indexer is in a bad state, currently requires manual intervention, but should eventually
    /// self heal. This is a dead-end state
    ///
    /// Transitions:
    /// - `Repairing` continuously
    Repairing, // TODO Add `error` to enable reparation
    /// Indexer is being deleted, all resources are being cleaned up
    ///
    /// Transitions:
    /// - `Deleting` on failure, triggering a retry
    /// - `Deleted` on success
    Deleting,
    /// Indexer is deleted, all resources are cleaned up, lifecycle manager will exit
    Deleted,
}
```

The logic of this `struct` is very light, triggering high-level actions
required within each state, and then returning the next desired state.
Most of the "doing" logic has has been encapsulated in the other related
`structs` as discussed below.

The lifecycle state is stored in Redis so that the Indexer can pickup
where it left off. A migration has been added to accommodate this new
field, which replaces the existing `provisioned_state` field.

## `Handler`s
Previously, the "handlers", i.e. `BlockStreamsHandler`, were lightweight
`structs` which wrapped the gRPC client/methods. In this PR, I've moved
all "synchronisation" logic to these structs. So rather than calling the
e.g. `data_layer_handler.start_provisioning_task()` method, we can call
`ensure_provisioned()` which manages all related logic. I feel this has
been encapsulation, and allows the `LifecycleManager` to be light.

I've had to remove `automock`, so we don't have mocked versions for this
right now. Cloning mocked versions is tricky, and requires manual
mocking. Rather than bloat this PR, I've left this out. Eventually, I'll
separate the "sync" logic from the "client" logic, so that the latter
can be easily mocked, and the sync logic covered by unit tests.

Additionally, I've added `get` methods for both Block Streamer and
Executors RPC, as listing is no longer convenient given we are managing
Indexers individually. The getters use `account_id` and `function_name`
as opposed to IDs. I'm considering moving away from IDs as the only way
to get them is via list, which isn't helpful. Right now it's somewhat of
a transitory state.
morgsmccauley added a commit that referenced this issue Jul 18, 2024
This PR introduces dedicated/self-contained control loops per Indexer,
replacing the single/combined control loop. The motive for this ticket
is described in #811, you can read more about it there.

Overall, there is lots of clean up to be done, but I wanted to get this
out the door as quick as possible as to not block the features required
to build on top of this. I've discussed some of the major concerns
below.

## `LifecycleManager`
These dedicated control loops are managed by the `LifecycleManager`
struct. This is a state machine which progresses the Indexer through
different states depending on the context. The different states and
their transitions are described on the `LifecycleState` enum:
```rust
/// Represents the different lifecycle states of an Indexer
#[derive(Default, Debug, Clone, serde::Serialize, serde::Deserialize, PartialEq, Eq)]
pub enum LifecycleState {
    /// Pre-requisite resources, i.e. Data Layer, are being created.
    ///
    /// Transitions:
    /// - `Running` on success
    /// - `Repairing` on Data Layer provisioning failure
    #[default]
    Initializing,
    /// Indexer is functional, Block Stream and Executors are continouously monitored to ensure
    /// they are running the latest version of the Indexer.
    ///
    /// Transitions:
    /// - `Stopping` if suspended
    /// - `Running` if Block Stream or Executor fails to synchronise, essentially triggering a
    /// retry
    /// - `Running` on success
    Running,
    /// Indexer is being stopped, Block Stream and Executors are being stopped.
    ///
    /// Transitions:
    /// - `Stopping` on failure, triggering a retry
    /// - `Stopped` on success
    Stopping,
    /// Indexer is stopped, Block Stream and Executors are not running.
    ///
    /// Transitions:
    /// - `Running` if unsuspended
    Stopped,
    /// Indexer is in a bad state, currently requires manual intervention, but should eventually
    /// self heal. This is a dead-end state
    ///
    /// Transitions:
    /// - `Repairing` continuously
    Repairing, // TODO Add `error` to enable reparation
    /// Indexer is being deleted, all resources are being cleaned up
    ///
    /// Transitions:
    /// - `Deleting` on failure, triggering a retry
    /// - `Deleted` on success
    Deleting,
    /// Indexer is deleted, all resources are cleaned up, lifecycle manager will exit
    Deleted,
}
```

The logic of this `struct` is very light, triggering high-level actions
required within each state, and then returning the next desired state.
Most of the "doing" logic has has been encapsulated in the other related
`structs` as discussed below.

The lifecycle state is stored in Redis so that the Indexer can pickup
where it left off. A migration has been added to accommodate this new
field, which replaces the existing `provisioned_state` field.

## `Handler`s
Previously, the "handlers", i.e. `BlockStreamsHandler`, were lightweight
`structs` which wrapped the gRPC client/methods. In this PR, I've moved
all "synchronisation" logic to these structs. So rather than calling the
e.g. `data_layer_handler.start_provisioning_task()` method, we can call
`ensure_provisioned()` which manages all related logic. I feel this has
been encapsulation, and allows the `LifecycleManager` to be light.

I've had to remove `automock`, so we don't have mocked versions for this
right now. Cloning mocked versions is tricky, and requires manual
mocking. Rather than bloat this PR, I've left this out. Eventually, I'll
separate the "sync" logic from the "client" logic, so that the latter
can be easily mocked, and the sync logic covered by unit tests.

Additionally, I've added `get` methods for both Block Streamer and
Executors RPC, as listing is no longer convenient given we are managing
Indexers individually. The getters use `account_id` and `function_name`
as opposed to IDs. I'm considering moving away from IDs as the only way
to get them is via list, which isn't helpful. Right now it's somewhat of
a transitory state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants