Signaling proposal #24

ghost · 2022-11-20T07:58:29Z

Changes introduced with this PR

This PR adds a proposal to change the Arcaflow engine execution model to use signals instead of front-to-back execution. This also solves arcalot/arcaflow-engine#7.

Easy to read version: https://github.com/arcalot/arcalot-round-table/blob/signals/art-decisions/proposals/2022-11-20-arcaflow-signals.md

This proposal is currently up for debate, no voting period has been set.

By contributing to this repository, I agree to the contribution guidelines.

johnaohara · 2022-11-21T15:14:32Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+[qDup](https://github.com/Hyperfoil/qDup). This tool not only allows people to write workflows, but also dynamically
+react to outputs of programs and publish signals on an internal messaging bus. Workflow parts can communicate with each
+other even while running, sending and waiting for signals.
+


Another aspect of qDup signals that we did not discuss was that they are in effect global, resettable countdown latches. This provides a number of key capabilities for qDup scripts, such as;

a waiting script can wait until for 1..n other scripts to raise a signal before proceeding

when signals have reached a counter of 0, they will no longer block scripts in the future (i.e. a condition has been met)

signals are resettable, so the pre-conditions for proceeding can be reset (useful for loops)

there are specific qDup commands that allow for looping and branching based on signal states, e.g. repeat-until signal state is reached, read-signal for branching depending on signal state, wait-for signal to be raised etc

jaredoconnell

It looks to me like it's a good foundation to solve a lot of the current limitations.

dustinblack · 2023-01-10T13:28:48Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+### Outputs
+
+The current normal output results will also be transformed into signals, which depend on the completion of a plugin.
+However, a plugin is now allowed to not declare an output and work with signals instead.


I can't think of a specific issue (I might need more coffee), but on the surface allowing the case of "no output" feels like it could cause problems.

The intent here was to cover the uperf server (?) use case, which has no meaningful output.

That's fair. I just feel like the output constitutes a sort of "punctuation" for a plugin's successful end. Maybe that's not functionally necessary.

You can always wait for state:finished if you need to wait for the plugin to finish execution.

dustinblack · 2023-01-10T13:55:25Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+outputs will be prefixed with `output:`.
+
+In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree.
+This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle.


How might this affect our feature requirement for looping over sub-workflows?

You can still loop, the loop itself contains a sub-workflow. Signals from within sub-workflows can be propagated to the outside.

jdowni000

I like the concept and agree this is a great direction for the project to be more versatile

dustinblack · 2023-01-11T16:19:59Z

Our example workflow today is shown below. The one major problem here is that we accept a timeout value that we pass to both the PCP and the sysbench plugins independently.

input:
  root: RootObject
  objects:
    RootObject:
      id: RootObject
      properties:
        pmlogger_interval:
          display:
            description: The logger collection interval for PCP pmlogger
            name: PCP pmlogger collection interval
          type:
            type_id: integer
        sysbench_threads:
          display:
            description: The number of threads sysbench will run
            name: sysbench threads
          type:
            type_id: integer
        sysbench_events:
          display:
            description: The number of events sysbench will run
            name: sysbench events
          type:
            type_id: integer
        sysbench_cpumaxprime:
          display:
            description: The upper limit of the number of prime numbers generated
            name: sysbench cpu max primes
          type:
            type_id: integer
        sysbench_runtime:                                      <<== Timeout input
          display:
            description: The total runtime in seconds for the sysbench tests
            name: sysbench runtime seconds
          type:
            type_id: integer
        elastic_host:
          display:
            description: The host URL for the ElasticSearch service
            name: elasticsearch host url
          type:
            type_id: string
        elastic_username:
          display:
            description: The username for the ElasticSearch service
            name: elasticsearch username
          type:
            type_id: string
        elastic_password:
          display:
            description: The password for the ElasticSearch service
            name: elasticsearch password
          type:
            type_id: string
        elastic_index:
          display:
            description: The index for the ElasticSearch service
            name: elasticsearch index
          type:
            type_id: string
steps:
  pcp:
    plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
    step: start-pcp
    input:
      pmlogger_interval: !expr $.input.pmlogger_interval
      run_duration: !expr $.input.sysbench_runtime             <<== Timeout used
  sysbench:
    plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
    step: sysbenchcpu
    input:
      threads: !expr $.input.sysbench_threads
      events: !expr $.input.sysbench_events
      cpu-max-prime: !expr $.input.sysbench_cpumaxprime
      time: !expr $.input.sysbench_runtime                     <<== Timeout used
  metadata:
    plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
    input: {}
  opensearch:
    plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
    input:
      url: !expr $.input.elastic_host
      username: !expr $.input.elastic_username
      password: !expr $.input.elastic_password
      index: !expr $.input.elastic_index
      data:
        pcp: !expr $.steps.pcp.outputs.success
        sysbench: !expr $.steps.sysbench.outputs.success
        metadata: !expr $.steps.metadata.outputs.success
output:
  pcp: !expr $.steps.pcp.outputs.success
  sysbench: !expr $.steps.sysbench.outputs.success
  metadata: !expr $.steps.metadata.outputs.success
  opensearch: !expr $.steps.opensearch.outputs.success

flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success

This is a fine prototype, but to be properly useful there needs to be a set of relationships between PCP and sysbench, where sysbench will only start once PCP reaches a "running" state, and PCP will only stop once sysbench reaches a "finished" state. This will ensure that the data collection time frame fully encapsulates the workload time. With signaling, it may look something like this:

input:
  ...
steps:
  pcp:
    plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
    step: start-pcp
    stop_if:                                                   <<== New stop_if
        - !expr $.steps.sysbench.state.finished                <<== Depends on sysbench finish
    input:
      pmlogger_interval: !expr $.input.pmlogger_interval       <<== Removed timeout
  sysbench:
    plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
    step: sysbenchcpu
    start_if:                                                  <<== New start_if
        - !expr $.steps.pcp.state.running                      <<== Depends on pcp running
    input:
      threads: !expr $.input.sysbench_threads
      events: !expr $.input.sysbench_events
      cpu-max-prime: !expr $.input.sysbench_cpumaxprime
      time: !expr $.input.sysbench_runtime
  metadata:
    plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
    input: {}
  opensearch:
    plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
    input:
      url: !expr $.input.elastic_host
      username: !expr $.input.elastic_username
      password: !expr $.input.elastic_password
      index: !expr $.input.elastic_index
      data:
        pcp: !expr $.steps.pcp.outputs.success
        sysbench: !expr $.steps.sysbench.outputs.success
        metadata: !expr $.steps.metadata.outputs.success
output:
  pcp: !expr $.steps.pcp.outputs.success
  sysbench: !expr $.steps.sysbench.outputs.success
  metadata: !expr $.steps.metadata.outputs.success
  opensearch: !expr $.steps.opensearch.outputs.success

flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.pcp-->steps.pcp.state.running
steps.pcp.state.running-->steps.sysbench
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success
steps.sysbench-->steps.sysbench.state.finished
steps.sysbench.state.finished-->steps.pcp

jaredoconnell · 2023-01-12T14:42:56Z

From an SDK standpoint, how are we differentiating between signals like ones like "stop_if", vs ones that are listened to for a while to use as a stream?
Would the stop if ones just be handled the say way, except they close before any data is sent?

Are are you going for a more single-message format, where no matter what things are processed as single signals instead of connections? And still then, will there be a schema for events like stop_if, that differs from custom signals?

ghost · 2023-01-16T12:53:19Z

From an SDK standpoint, how are we differentiating between signals like ones like "stop_if", vs ones that are listened to for a while to use as a stream? Would the stop if ones just be handled the say way, except they close before any data is sent?

Updating the specification to address this in commit 7b8bb19 . A stop_if causes the plugin to receive a SIGTERM, followed by a SIGKILL 30 seconds later.

Are are you going for a more single-message format, where no matter what things are processed as single signals instead of connections? And still then, will there be a schema for events like stop_if, that differs from custom signals?

No, if this behavior is desired you shouldn't use stop_if, but rather an appropriate listen key that has a schema. The stop_if is purely a termination signal and has no explicit schema.

ghost · 2023-01-16T16:33:32Z

The proposal is now open for voting!

dustinblack

+1 vote for proposal

sandrobonazzola

+1

hubeadmin

Makes sense to me, @dustinblack This is something you were also looking at/wanted to do with the PoC for perf and scale running their tests on in-vehicle OS testing right?

dustinblack · 2023-01-17T11:21:03Z

Makes sense to me, @dustinblack This is something you were also looking at/wanted to do with the PoC for perf and scale running their tests on in-vehicle OS testing right?

Yes. Without this change, or otherwise a series of other changes for feature requirements, our workflows for Perf & Scale automotive and other use cases can't really move beyond the prototype phase.

AvlWx2014 · 2023-01-17T15:08:35Z

+1 vote for proposal.

This seems like a really good idea to me. Is it going to be up to the engine to manage creation of signal channels at the request of the plugins that are going to publish to them? Also, (maybe I missed it) are channels going to be one-way?

ghost · 2023-01-17T15:17:29Z

This seems like a really good idea to me. Is it going to be up to the engine to manage creation of signal channels at the request of the plugins that are going to publish to them?

Plugins should declare signals they publish and accept, along with their schema. The engine will translate only do the piping and notify the plugins when these signals arrive. The SDK will need to be updated.

Also, (maybe I missed it) are channels going to be one-way?

Yes, channels are strictly one-way, otherwise we can't guarantee that a workflow can be executed in a static analysis fashion. It also brings the problem of endless loops if we allow circular dependencies.

ghost · 2023-01-17T18:34:45Z

@AvlWx2014 please vote with a code review.

portante

I am really not sure why there is even a vote for this change.

Why does the community need to vote for this change?

If I vote, yes, what am I saying? That I agree with all the details of the proposal? I don't. Or am I saying that I agree with the general notion of adding signals to the work flow engine? I do.

I think the weakness of the proposal is the lack of a clear mapping of the existing output behavior to signals.

I don't think we should add signals next to the current output behavior but define the old behavior entirely in terms of signals.

If that is not the direction this is taking, then my vote is no.

If that is the direction this is taking, then my vote is yes.

But I found it unclear what direction this is really heading in with regard to the existing output behavior and signals.

art-decisions/proposals/2022-11-20-arcaflow-signals.md

portante · 2023-01-18T00:51:59Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+
+In this proposal, we transform the execution of Arcaflow by adding the ability to send and receive signals via signal
+channels. Each signal channel will have a schema and is declared by a plugin. Workflow authors can take these signals 
+and pipe them into other plugins that have declared they can receive signals.


Is all "output" now a signal? It seems like we need to strengthen this statement. I really like what I think the direction is here, that what was an output is now a signal. That means the previous execution model has a direct mapping to the signal execution model.

If that is not the case, then could we move in that direction?

For steps that just do work and report the result, what would be the advantage of using signals for them?
It may just be easier to have the existing input and output option, with signals for events and mid-step data transfers.

portante · 2023-01-18T00:55:08Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+  uperf_client:
+    ...
+  uperf_server:
+    stop_if:


This feels a bit arbitrary.

Why just pick on "stopping" execution?

It seems like we want to be able to declare a signal a plugin listens for, and let it decide the action to take.

If this is about have the engine listen for a signal and take action on the plugin, then that seems weird.

I feel like what your describing would be an extension of the proposed signaling system. My preference would be to start with the minimum required functionality described in this proposal, and only extend the signaling features as new use-cases might require.

We define a stop condition because we have the need for it. Currently, there is no other life cycle event that needs special handling.

So you are saying you want to embed in the workflow description one kind of stop condition?

Why do we we need that?

We need a way to signal a plugin to stop or forcefully kill it if need be. Uperf needs this, and there are a few other situations when the plugin needs to be stopped regardless of its successful termination.

art-decisions/proposals/2022-11-20-arcaflow-signals.md

portante · 2023-01-18T00:57:26Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+4. `state:finished`
+
+In addition, we will also introduce nodes for each of the signals the plugin emits, prefixed with `signal:`. The
+outputs will be prefixed with `output:`.


It seems an output is a kind of signal, if we are heading that way. Perhaps we could unify these?+

That would remove the ability to properly static type check workflows.

Each output has its own schema. Similarly, each signal is designed to have an explicit schema. If we unify the outputs into one schema, the typing is lost. Similarly, if we merge it with the state changes, you no longer have the ability to wait for one specific output, or you have no way to wait for the plugin to finish regardless of output.

portante · 2023-01-18T00:59:37Z

art-decisions/proposals/2022-11-20-arcaflow-signals.md

+
+In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree.
+This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle.
+This constitutes a limitation, because no two steps can form a constant back and forth communication via signals.


I am not entirely sure why this is a limitation, and why it needs to be prevented.

Why not have a default behavior of disallowing the cycle, but allow the user to describe the cycle and time limit when no progress is made in face of it, where the engine breaks it up?

The Arcaflow engine works by evaluating which steps have the data to proceed and then executing the steps that are ready. This means that you don't have to specify explicit dependencies, they are implicitly evaluated from expressions.

Simultaneously, workflows without cycles are super simple to execute and detect bugs in. Loops can be implemented as subworkflows.

If we allow arbitrary loops in the workflows, we loose simplicity and safety constraints from the workflow execution.

I am not following why we want simplicity after introducing the complexity of signals, and why can't we warn the user when it is not safe instead of "knowing better then they" and preventing it?

I know of no better way to create a system that can have a schema, avoid endless loops, and automatically deduce dependencies.

dustinblack · 2023-01-18T15:38:34Z

I am really not sure why there is even a vote for this change.

Why does the community need to vote for this change?

This proposes a fundamental architectural change to the engine that could break compatibility with plugins. If this were a more modest feature enhancement, then we could certainly hash it out in the PR process. Due diligence is being done here to ensure that the Round Table agrees with the high-level description of the change proposal.

If I vote, yes, what am I saying? That I agree with all the details of the proposal? I don't. Or am I saying that I agree with the general notion of adding signals to the work flow engine? I do.

I think the weakness of the proposal is the lack of a clear mapping of the existing output behavior to signals.

I don't think we should add signals next to the current output behavior but define the old behavior entirely in terms of signals.

If that is not the direction this is taking, then my vote is no.

If that is the direction this is taking, then my vote is yes.

But I found it unclear what direction this is really heading in with regard to the existing output behavior and signals.

To the technical part of your questions, my understanding is that we want to preserve backward compatibility of existing plugins built with the SDK, so the existing input and output mechanics need to remain the same. But truly, exactly how we implement that I do believe is a technical discussion that can be had in the PR process. IMO the spirit of the proposal for an architectural change is what is important here.

The procedural parts of your questions are probably better addressed in a separate forum.

ghost · 2023-01-18T16:40:40Z

@portante the proposal was in draft, asking for content change proposals for over a month. Barely got any feedback. Now it's in the voting phase, which means you can agree or disagree with it. (See the charter.)

If you can write up your changes, perhaps as a separate/modified proposal, I'm happy to withdraw this one, and we can continue the work on that. In general, whatever the changes will be, I would keep the initial design choice of static typing and the possibility to verify the correctness of a workflow in large parts without running it.

If you want to have a go at rewriting this proposal, I'm happy to wait with the merge until you finish it.

ghost self-requested a review November 20, 2022 07:58

ghost force-pushed the signals branch 6 times, most recently from 0a52891 to ef79221 Compare November 20, 2022 08:26

johnaohara reviewed Nov 21, 2022

View reviewed changes

jaredoconnell approved these changes Nov 21, 2022

View reviewed changes

dustinblack reviewed Jan 10, 2023

View reviewed changes

jdowni000 approved these changes Jan 10, 2023

View reviewed changes

tsebastiani approved these changes Jan 11, 2023

View reviewed changes

ghost force-pushed the signals branch 2 times, most recently from 09fa225 to d5ed458 Compare January 16, 2023 12:45

Janos Bonic added 3 commits January 16, 2023 13:46

Signaling proposal

e4a43f3

Clarification on outputs

bd501f9

Typo fix

3b4a034

ghost force-pushed the signals branch from d5ed458 to 3b4a034 Compare January 16, 2023 12:46

Updated stop_if specification to detail behavior

7b8bb19

ghost force-pushed the signals branch from 3dc28bd to 7b8bb19 Compare January 16, 2023 12:54

Harshith-umesh approved these changes Jan 16, 2023

View reviewed changes

ghost marked this pull request as ready for review January 16, 2023 16:33

dustinblack approved these changes Jan 16, 2023

View reviewed changes

sandrobonazzola approved these changes Jan 17, 2023

View reviewed changes

hubeadmin approved these changes Jan 17, 2023

View reviewed changes

AvlWx2014 approved these changes Jan 17, 2023

View reviewed changes

portante reviewed Jan 18, 2023

View reviewed changes

dustinblack merged commit b14119f into main Jan 23, 2023

dustinblack deleted the signals branch January 23, 2023 16:48

dustinblack mentioned this pull request Jan 23, 2023

add records for last two approved ART decisions #27

Merged

Signaling proposal #24

Signaling proposal #24

Conversation

ghost commented Nov 20, 2022 • edited by ghost Loading

Changes introduced with this PR

Choose a reason for hiding this comment

jaredoconnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdowni000 left a comment

Choose a reason for hiding this comment

dustinblack commented Jan 11, 2023

jaredoconnell commented Jan 12, 2023

ghost commented Jan 16, 2023 • edited by ghost Loading

ghost commented Jan 16, 2023

dustinblack left a comment

Choose a reason for hiding this comment

sandrobonazzola left a comment

Choose a reason for hiding this comment

hubeadmin left a comment

Choose a reason for hiding this comment

dustinblack commented Jan 17, 2023

AvlWx2014 commented Jan 17, 2023

ghost commented Jan 17, 2023

ghost commented Jan 17, 2023

portante left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dustinblack commented Jan 18, 2023

ghost commented Jan 18, 2023

ghost commented Nov 20, 2022 •

edited by ghost

Loading

ghost commented Jan 16, 2023 •

edited by ghost

Loading

portante left a comment •

edited

Loading