Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signaling proposal #24

Merged
merged 4 commits into from
Jan 23, 2023
Merged

Signaling proposal #24

merged 4 commits into from
Jan 23, 2023

Conversation

ghost
Copy link

@ghost ghost commented Nov 20, 2022

Changes introduced with this PR

This PR adds a proposal to change the Arcaflow engine execution model to use signals instead of front-to-back execution. This also solves arcalot/arcaflow-engine#7.

Easy to read version: https://github.com/arcalot/arcalot-round-table/blob/signals/art-decisions/proposals/2022-11-20-arcaflow-signals.md

This proposal is currently up for debate, no voting period has been set.


By contributing to this repository, I agree to the contribution guidelines.

@ghost ghost self-requested a review November 20, 2022 07:58
@ghost ghost force-pushed the signals branch 6 times, most recently from 0a52891 to ef79221 Compare November 20, 2022 08:26
[qDup](https://github.com/Hyperfoil/qDup). This tool not only allows people to write workflows, but also dynamically
react to outputs of programs and publish signals on an internal messaging bus. Workflow parts can communicate with each
other even while running, sending and waiting for signals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect of qDup signals that we did not discuss was that they are in effect global, resettable countdown latches. This provides a number of key capabilities for qDup scripts, such as;

  • a waiting script can wait until for 1..n other scripts to raise a signal before proceeding
  • when signals have reached a counter of 0, they will no longer block scripts in the future (i.e. a condition has been met)
  • signals are resettable, so the pre-conditions for proceeding can be reset (useful for loops)
  • there are specific qDup commands that allow for looping and branching based on signal states, e.g. repeat-until signal state is reached, read-signal for branching depending on signal state, wait-for signal to be raised etc

Copy link
Contributor

@jaredoconnell jaredoconnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like it's a good foundation to solve a lot of the current limitations.

### Outputs

The current normal output results will also be transformed into signals, which depend on the completion of a plugin.
However, a plugin is now allowed to not declare an output and work with signals instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a specific issue (I might need more coffee), but on the surface allowing the case of "no output" feels like it could cause problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here was to cover the uperf server (?) use case, which has no meaningful output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I just feel like the output constitutes a sort of "punctuation" for a plugin's successful end. Maybe that's not functionally necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always wait for state:finished if you need to wait for the plugin to finish execution.

outputs will be prefixed with `output:`.

In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree.
This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How might this affect our feature requirement for looping over sub-workflows?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still loop, the loop itself contains a sub-workflow. Signals from within sub-workflows can be propagated to the outside.

Copy link
Contributor

@jdowni000 jdowni000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept and agree this is a great direction for the project to be more versatile

@dustinblack
Copy link
Member

Our example workflow today is shown below. The one major problem here is that we accept a timeout value that we pass to both the PCP and the sysbench plugins independently.

input:
  root: RootObject
  objects:
    RootObject:
      id: RootObject
      properties:
        pmlogger_interval:
          display:
            description: The logger collection interval for PCP pmlogger
            name: PCP pmlogger collection interval
          type:
            type_id: integer
        sysbench_threads:
          display:
            description: The number of threads sysbench will run
            name: sysbench threads
          type:
            type_id: integer
        sysbench_events:
          display:
            description: The number of events sysbench will run
            name: sysbench events
          type:
            type_id: integer
        sysbench_cpumaxprime:
          display:
            description: The upper limit of the number of prime numbers generated
            name: sysbench cpu max primes
          type:
            type_id: integer
        sysbench_runtime:                                      <<== Timeout input
          display:
            description: The total runtime in seconds for the sysbench tests
            name: sysbench runtime seconds
          type:
            type_id: integer
        elastic_host:
          display:
            description: The host URL for the ElasticSearch service
            name: elasticsearch host url
          type:
            type_id: string
        elastic_username:
          display:
            description: The username for the ElasticSearch service
            name: elasticsearch username
          type:
            type_id: string
        elastic_password:
          display:
            description: The password for the ElasticSearch service
            name: elasticsearch password
          type:
            type_id: string
        elastic_index:
          display:
            description: The index for the ElasticSearch service
            name: elasticsearch index
          type:
            type_id: string
steps:
  pcp:
    plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
    step: start-pcp
    input:
      pmlogger_interval: !expr $.input.pmlogger_interval
      run_duration: !expr $.input.sysbench_runtime             <<== Timeout used
  sysbench:
    plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
    step: sysbenchcpu
    input:
      threads: !expr $.input.sysbench_threads
      events: !expr $.input.sysbench_events
      cpu-max-prime: !expr $.input.sysbench_cpumaxprime
      time: !expr $.input.sysbench_runtime                     <<== Timeout used
  metadata:
    plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
    input: {}
  opensearch:
    plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
    input:
      url: !expr $.input.elastic_host
      username: !expr $.input.elastic_username
      password: !expr $.input.elastic_password
      index: !expr $.input.elastic_index
      data:
        pcp: !expr $.steps.pcp.outputs.success
        sysbench: !expr $.steps.sysbench.outputs.success
        metadata: !expr $.steps.metadata.outputs.success
output:
  pcp: !expr $.steps.pcp.outputs.success
  sysbench: !expr $.steps.sysbench.outputs.success
  metadata: !expr $.steps.metadata.outputs.success
  opensearch: !expr $.steps.opensearch.outputs.success
flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success
Loading

This is a fine prototype, but to be properly useful there needs to be a set of relationships between PCP and sysbench, where sysbench will only start once PCP reaches a "running" state, and PCP will only stop once sysbench reaches a "finished" state. This will ensure that the data collection time frame fully encapsulates the workload time. With signaling, it may look something like this:

input:
  ...
steps:
  pcp:
    plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
    step: start-pcp
    stop_if:                                                   <<== New stop_if
        - !expr $.steps.sysbench.state.finished                <<== Depends on sysbench finish
    input:
      pmlogger_interval: !expr $.input.pmlogger_interval       <<== Removed timeout
  sysbench:
    plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
    step: sysbenchcpu
    start_if:                                                  <<== New start_if
        - !expr $.steps.pcp.state.running                      <<== Depends on pcp running
    input:
      threads: !expr $.input.sysbench_threads
      events: !expr $.input.sysbench_events
      cpu-max-prime: !expr $.input.sysbench_cpumaxprime
      time: !expr $.input.sysbench_runtime
  metadata:
    plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
    input: {}
  opensearch:
    plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
    input:
      url: !expr $.input.elastic_host
      username: !expr $.input.elastic_username
      password: !expr $.input.elastic_password
      index: !expr $.input.elastic_index
      data:
        pcp: !expr $.steps.pcp.outputs.success
        sysbench: !expr $.steps.sysbench.outputs.success
        metadata: !expr $.steps.metadata.outputs.success
output:
  pcp: !expr $.steps.pcp.outputs.success
  sysbench: !expr $.steps.sysbench.outputs.success
  metadata: !expr $.steps.metadata.outputs.success
  opensearch: !expr $.steps.opensearch.outputs.success
flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.pcp-->steps.pcp.state.running
steps.pcp.state.running-->steps.sysbench
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success
steps.sysbench-->steps.sysbench.state.finished
steps.sysbench.state.finished-->steps.pcp
Loading

@jaredoconnell
Copy link
Contributor

From an SDK standpoint, how are we differentiating between signals like ones like "stop_if", vs ones that are listened to for a while to use as a stream?
Would the stop if ones just be handled the say way, except they close before any data is sent?

Are are you going for a more single-message format, where no matter what things are processed as single signals instead of connections? And still then, will there be a schema for events like stop_if, that differs from custom signals?

@ghost ghost force-pushed the signals branch 2 times, most recently from 09fa225 to d5ed458 Compare January 16, 2023 12:45
@ghost
Copy link
Author

ghost commented Jan 16, 2023

From an SDK standpoint, how are we differentiating between signals like ones like "stop_if", vs ones that are listened to for a while to use as a stream? Would the stop if ones just be handled the say way, except they close before any data is sent?

Updating the specification to address this in commit 7b8bb19 . A stop_if causes the plugin to receive a SIGTERM, followed by a SIGKILL 30 seconds later.

Are are you going for a more single-message format, where no matter what things are processed as single signals instead of connections? And still then, will there be a schema for events like stop_if, that differs from custom signals?

No, if this behavior is desired you shouldn't use stop_if, but rather an appropriate listen key that has a schema. The stop_if is purely a termination signal and has no explicit schema.

@ghost ghost marked this pull request as ready for review January 16, 2023 16:33
@ghost
Copy link
Author

ghost commented Jan 16, 2023

The proposal is now open for voting!

Copy link
Member

@dustinblack dustinblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 vote for proposal

Copy link
Contributor

@sandrobonazzola sandrobonazzola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@hubeadmin hubeadmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, @dustinblack This is something you were also looking at/wanted to do with the PoC for perf and scale running their tests on in-vehicle OS testing right?

@dustinblack
Copy link
Member

Makes sense to me, @dustinblack This is something you were also looking at/wanted to do with the PoC for perf and scale running their tests on in-vehicle OS testing right?

Yes. Without this change, or otherwise a series of other changes for feature requirements, our workflows for Perf & Scale automotive and other use cases can't really move beyond the prototype phase.

@AvlWx2014
Copy link
Contributor

+1 vote for proposal.

This seems like a really good idea to me. Is it going to be up to the engine to manage creation of signal channels at the request of the plugins that are going to publish to them? Also, (maybe I missed it) are channels going to be one-way?

@ghost
Copy link
Author

ghost commented Jan 17, 2023

This seems like a really good idea to me. Is it going to be up to the engine to manage creation of signal channels at the request of the plugins that are going to publish to them?

Plugins should declare signals they publish and accept, along with their schema. The engine will translate only do the piping and notify the plugins when these signals arrive. The SDK will need to be updated.

Also, (maybe I missed it) are channels going to be one-way?

Yes, channels are strictly one-way, otherwise we can't guarantee that a workflow can be executed in a static analysis fashion. It also brings the problem of endless loops if we allow circular dependencies.

@ghost
Copy link
Author

ghost commented Jan 17, 2023

@AvlWx2014 please vote with a code review.

Copy link
Contributor

@portante portante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not sure why there is even a vote for this change.

Why does the community need to vote for this change?

If I vote, yes, what am I saying? That I agree with all the details of the proposal? I don't. Or am I saying that I agree with the general notion of adding signals to the work flow engine? I do.

I think the weakness of the proposal is the lack of a clear mapping of the existing output behavior to signals.

I don't think we should add signals next to the current output behavior but define the old behavior entirely in terms of signals.

If that is not the direction this is taking, then my vote is no.

If that is the direction this is taking, then my vote is yes.

But I found it unclear what direction this is really heading in with regard to the existing output behavior and signals.


In this proposal, we transform the execution of Arcaflow by adding the ability to send and receive signals via signal
channels. Each signal channel will have a schema and is declared by a plugin. Workflow authors can take these signals
and pipe them into other plugins that have declared they can receive signals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all "output" now a signal? It seems like we need to strengthen this statement. I really like what I think the direction is here, that what was an output is now a signal. That means the previous execution model has a direct mapping to the signal execution model.

If that is not the case, then could we move in that direction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For steps that just do work and report the result, what would be the advantage of using signals for them?
It may just be easier to have the existing input and output option, with signals for events and mid-step data transfers.

uperf_client:
...
uperf_server:
stop_if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit arbitrary.

Why just pick on "stopping" execution?

It seems like we want to be able to declare a signal a plugin listens for, and let it decide the action to take.

If this is about have the engine listen for a signal and take action on the plugin, then that seems weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like what your describing would be an extension of the proposed signaling system. My preference would be to start with the minimum required functionality described in this proposal, and only extend the signaling features as new use-cases might require.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define a stop condition because we have the need for it. Currently, there is no other life cycle event that needs special handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying you want to embed in the workflow description one kind of stop condition?

Why do we we need that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to signal a plugin to stop or forcefully kill it if need be. Uperf needs this, and there are a few other situations when the plugin needs to be stopped regardless of its successful termination.

4. `state:finished`

In addition, we will also introduce nodes for each of the signals the plugin emits, prefixed with `signal:`. The
outputs will be prefixed with `output:`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems an output is a kind of signal, if we are heading that way. Perhaps we could unify these?+

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would remove the ability to properly static type check workflows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each output has its own schema. Similarly, each signal is designed to have an explicit schema. If we unify the outputs into one schema, the typing is lost. Similarly, if we merge it with the state changes, you no longer have the ability to wait for one specific output, or you have no way to wait for the plugin to finish regardless of output.


In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree.
This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle.
This constitutes a limitation, because no two steps can form a constant back and forth communication via signals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure why this is a limitation, and why it needs to be prevented.

Why not have a default behavior of disallowing the cycle, but allow the user to describe the cycle and time limit when no progress is made in face of it, where the engine breaks it up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arcaflow engine works by evaluating which steps have the data to proceed and then executing the steps that are ready. This means that you don't have to specify explicit dependencies, they are implicitly evaluated from expressions.

Simultaneously, workflows without cycles are super simple to execute and detect bugs in. Loops can be implemented as subworkflows.

If we allow arbitrary loops in the workflows, we loose simplicity and safety constraints from the workflow execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following why we want simplicity after introducing the complexity of signals, and why can't we warn the user when it is not safe instead of "knowing better then they" and preventing it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of no better way to create a system that can have a schema, avoid endless loops, and automatically deduce dependencies.

@dustinblack
Copy link
Member

I am really not sure why there is even a vote for this change.

Why does the community need to vote for this change?

This proposes a fundamental architectural change to the engine that could break compatibility with plugins. If this were a more modest feature enhancement, then we could certainly hash it out in the PR process. Due diligence is being done here to ensure that the Round Table agrees with the high-level description of the change proposal.

If I vote, yes, what am I saying? That I agree with all the details of the proposal? I don't. Or am I saying that I agree with the general notion of adding signals to the work flow engine? I do.

I think the weakness of the proposal is the lack of a clear mapping of the existing output behavior to signals.

I don't think we should add signals next to the current output behavior but define the old behavior entirely in terms of signals.

If that is not the direction this is taking, then my vote is no.

If that is the direction this is taking, then my vote is yes.

But I found it unclear what direction this is really heading in with regard to the existing output behavior and signals.

To the technical part of your questions, my understanding is that we want to preserve backward compatibility of existing plugins built with the SDK, so the existing input and output mechanics need to remain the same. But truly, exactly how we implement that I do believe is a technical discussion that can be had in the PR process. IMO the spirit of the proposal for an architectural change is what is important here.

The procedural parts of your questions are probably better addressed in a separate forum.

@ghost
Copy link
Author

ghost commented Jan 18, 2023

@portante the proposal was in draft, asking for content change proposals for over a month. Barely got any feedback. Now it's in the voting phase, which means you can agree or disagree with it. (See the charter.)

If you can write up your changes, perhaps as a separate/modified proposal, I'm happy to withdraw this one, and we can continue the work on that. In general, whatever the changes will be, I would keep the initial design choice of static typing and the possibility to verify the correctness of a workflow in large parts without running it.

If you want to have a go at rewriting this proposal, I'm happy to wait with the merge until you finish it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants