Operations orchestration skeleton #543

arbulu89 · 2025-01-08T14:45:24Z

Description

Implement the operation orchestration skeleton.
Operations are defined using actual elixir code. Their are composed basically by a number of steps, that are executed sequentially in all the targets agents. Each step consists of a operator execution (the same way we do with gatherers).

The orchestration works in the next way:

A operation execution is requested to be executed in some targets. For example, if it is a host operation, we will only have one target. If it is a cluster operation, we will have all the members of the cluster as hosts
The N step of the operation is executed (starting from the first of course)
At this point, the operator execution is sent to all targets, it they match the step predicate. This means, that certain steps might be ignored in some hosts.
Once the operation execution is sent, the server waits to receive the report from each of the hosts.
When all hosts have reported, the operation continues running the next step, until finalizing all steps.
If some of the steps fail in a host, the complete operation is aborted.

The steps are executed sequentially, but the order to executed the operation in all related hosts is parallel.

I have excluded from this PR everything related to messaging, storing operation results in the database, and results evaluation.
It only includes the fundamental part of the orchestration.

A solution code can looks something like this:

%Operation{
    id: "abcdef",
    name: "saptune solution",
    required_args: ["solution"],
    steps: [
        %Step{
            operator: "saptune solution",
            predicate: "*"
        }
    ]
}

How was this tested?

UT test added

coveralls · 2025-01-08T14:48:09Z

coverage: 97.456% (-0.3%) from 97.753%
when pulling 65e9c71 on operations-orchestration
into b8f0a58 on main.

gagandeepb

Thanks for this. I wonder what are your thoughts about the pros and cons of using a persistent/DB backed worker processes in the context of operations orchestration/execution ? In my understanding, with something like Oban we get better transactional support, robustness and error recovery/retries (from the start/out of the box) as compared to more ephemeral processes that are not DB backed (where these aspects are up to our own implementation).

arbulu89 · 2025-01-09T09:39:19Z

Thanks for this. I wonder what are your thoughts about the pros and cons of using a persistent/DB backed worker processes in the context of operations orchestration/execution ? In my understanding, with something like Oban we get better transactional support, robustness and error recovery/retries (from the start/out of the box) as compared to more ephemeral processes that are not DB backed (where these aspects are up to our own implementation).

In this case I guess our business logic is not a single background async task by itself.
The task is composed by different actions, which the most important one is to wait for agents to report back.
At the end we would need to handle all this, and I'm not sure if a background jobs worker is what we need.
Since the beginning, we thought thatDynamicServers look simple and powerful enough to do what we need.

Anyway, I have not used oban myself, so I cannot tell if what this code does could be replicated using it.
I guess having a DB backed up jobs system could have the benefit of resuming jobs in the moment they where if for some reason the main app crashes. Besides that, our code should manage all our needs (the checks execution code shows that I guess).

CDimonaco

Beautiful LGTM!

Regarding using a persistent database, I was one the first to create wanda long time ago and we had a look to oban and I used oban myself in the past.

We decided it was not the right tool for the nature of the task that Wanda performs, in the past was the check execution and now the operation orchestration.

It's a dynamic spawned process and it's basically a state machine, and it will come perfectly in the use case of dynamic supervised GenServers

Using oban or other job processing tool means the we have to change the nature and the structure of wanda itself, and the benefits are not tangible because we need to change the architecture and the way we think about wanda processes.
If it was a recurrent job, a queue, wathever simple oban is perfect, but I don't think it suits the usecase we have right now.

If something happen to a process can be re triggered and can be retried, that's of course my two cents.

gagandeepb

@CDimonaco : I have some questions to help improve my understanding, in case you have already evaluated Oban for Wanda:

How do we ensure that any submitted long running requests are idempotent ? Or is idempotence not needed here?
Main app crashes are one thing, but what about crashes in the parent process of the operations orchestration? The app is still up, but the parent process has crashed and has no way of regaining its old state (it was all in-memory). How would the restarted parent process know about previous history of partial execution ? Or is this again not a concern?
You mention retries but dynamically run/without a db backing the execution request/orchestration state, my understanding is that execution request deduplication/uniqueness would be difficult to implement. How do we avoid the impact of partial failures/(partially) duplicated executions in this case ? Or do you have reasons to believe that this is somehow not a concern?
I believe I would benefit from a conversation about the architectural/"nature and structure of wanda" reasons. Let's talk about this, perhaps?
(Comment) There are a few ways to do state machines, but backed by the db, with/without Oban, but I guess that's a topic for a later conversation.
Ways of replicating this using a db backed solution is also a topic for sync/later conversation.

balanza

Great work @arbulu89!

The overall workflow sounds good to me. I left several comments with nitpicking and questions to better understand the code.

I see this is a skeleton, and we might stop here. Anyway, there is something we might want to include in the skeleton, too:

Test the workflow using receive_operation_reports, too
Some docs on the state struct will be helpful

UPDATE: I gave a better look at the tests. I see most of our tests (9 out of 14) do direct calls to lifecycle functions (handle_continue and handle_cast) and make assertions on the state shape. That makes it hard to refactor the implementation when needed (we're testing the implementation not the result).
As we are making ground for the work to come, I think it's worth the investment to have a stronger test suite.

balanza · 2025-01-09T15:49:08Z

lib/wanda/operations/server.ex

+
+    %Step{predicate: predicate, operator: operator} = Enum.at(steps, current_step_index)
+
+    state =


suggestion(immutability): use newState to avoid reassign a variable

balanza · 2025-01-09T16:05:44Z

lib/wanda/operations/server.ex

+    end
+  end
+
+  defp maybe_save_skipped_operation_state(


question: this function sets the report to :skipped to all the agents that don't satisfy the predicate. Right?

balanza · 2025-01-09T16:10:20Z

lib/wanda/operations/server.ex

+         %State{} = state,
+         _operator
+       ) do
+    # publish operation execution to agents


question this is supposed to be a message dispatched to RabbitMQ, right?

balanza · 2025-01-09T17:43:16Z

lib/wanda/operations/server.ex

+
+    pending_targets = List.delete(targets, agent_id)
+
+    state =


suggestion(immutability): use newState to avoid reassign a variable

balanza · 2025-01-09T17:51:43Z

test/wanda/operations/server_test.exs

+      assert :ok ==
+               Server.receive_operation_reports(operation_id, group_id, 1, UUID.uuid4(), :updated)


question: How can this test fail? What are the conditions for which we don't get :ok?

balanza · 2025-01-09T17:52:20Z

test/wanda/operations/server_test.exs

+      end
+    end
+
+    test "should not start opeartion if it is already running for that group_id" do


typo

Suggested change

test "should not start opeartion if it is already running for that group_id" do

test "should not start operation if it is already running for that group_id" do

balanza · 2025-01-09T17:55:02Z

lib/wanda/operations/state.ex

+  defstruct [
+    :engine,
+    :operation_id,
+    :group_id,
+    :operation,
+    :timeout,
+    targets: [],
+    pending_targets_on_step: [],
+    current_step_index: 0,
+    agent_reports: %{},
+    step_failed: false
+  ]


suggestion: it would be helpful a description of the fields and what to expect from them. For example, I didn't get what group_id is and what relationship with the other fields its in.

balanza · 2025-01-09T18:01:03Z

test/wanda/operations/server_test.exs

+    test "should finish operation if all steps are completed" do
+      state = %State{
+        current_step_index: 1,
+        agent_reports: [
+          %StepReport{
+            step_number: 0,
+            agents: [%AgentReport{agent_id: UUID.uuid4(), result: :updated}]
+          }
+        ]
+      }
+
+      assert {:stop, :normal, ^state} =
+               Server.handle_continue(
+                 :execute_step,
+                 state
+               )
+    end


thought: Alongside with calling handle_continue directly it would be helpful to use the Server API, i.e. calling receive_operation_reports on a genserver with the desired internal status.

balanza · 2025-01-09T18:27:39Z

lib/wanda/operations/server.ex

+      %State{pending_targets_on_step: pending_targets} =
+      state
+      |> predicate_targets_execution(predicate)
+      |> maybe_save_skipped_operation_state()
+      |> maybe_request_operation_execution(operator)
+      |> maybe_increase_current_step()
+
+    if pending_targets == [] do
+      {:noreply, state, {:continue, :execute_step}}
+    else
+      {:noreply, state}


thought: I understand that this code does 3 things:

select target agents and dispatch the job to them

set a :skipped report for agents that do not satisfy the predicate

"shortcut" the workflow if no agent satisfies the predicate

I think we can make it simplier with some refactoring. I don't know if it's worth, being this PR a skeleton.

The Idea is, when dispatching the job to all the agents we can:

if predicate_is_met(agent) dispatch on queue else receive_operation_reports(agent.id, :skipped)

Have you considered that already? If you think it's worth we can refactor on this PR, otherwise we can refactor later.

arbulu89 added 4 commits January 8, 2025 15:35

Define operation struct

9a51708

Define server state and implement functionality

4b1944d

Start dynamic server in application bootup

6482783

Test operations server

745b3d9

arbulu89 added the enhancement New feature or request label Jan 8, 2025

arbulu89 added 2 commits January 8, 2025 16:18

Fix credo warnings

f07d121

Improve test coverage

65e9c71

arbulu89 force-pushed the operations-orchestration branch from c35a4c7 to 65e9c71 Compare January 8, 2025 15:18

gagandeepb reviewed Jan 8, 2025

View reviewed changes

arbulu89 marked this pull request as ready for review January 9, 2025 09:06

CDimonaco approved these changes Jan 9, 2025

View reviewed changes

gagandeepb reviewed Jan 9, 2025

View reviewed changes

balanza reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations orchestration skeleton #543

Operations orchestration skeleton #543

arbulu89 commented Jan 8, 2025 •

edited

Loading

coveralls commented Jan 8, 2025 •

edited

Loading

gagandeepb left a comment

arbulu89 commented Jan 9, 2025

CDimonaco left a comment •

edited

Loading

gagandeepb left a comment •

edited

Loading

balanza left a comment •

edited

Loading

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025

balanza Jan 9, 2025


		%Step{predicate: predicate, operator: operator} = Enum.at(steps, current_step_index)

		state =

		assert :ok ==
		Server.receive_operation_reports(operation_id, group_id, 1, UUID.uuid4(), :updated)

	test "should not start opeartion if it is already running for that group_id" do
	test "should not start operation if it is already running for that group_id" do

Operations orchestration skeleton #543

Are you sure you want to change the base?

Operations orchestration skeleton #543

Conversation

arbulu89 commented Jan 8, 2025 • edited Loading

Description

How was this tested?

coveralls commented Jan 8, 2025 • edited Loading

gagandeepb left a comment

Choose a reason for hiding this comment

arbulu89 commented Jan 9, 2025

CDimonaco left a comment • edited Loading

Choose a reason for hiding this comment

gagandeepb left a comment • edited Loading

Choose a reason for hiding this comment

balanza left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arbulu89 commented Jan 8, 2025 •

edited

Loading

coveralls commented Jan 8, 2025 •

edited

Loading

CDimonaco left a comment •

edited

Loading

gagandeepb left a comment •

edited

Loading

balanza left a comment •

edited

Loading