-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signaling proposal #24
Conversation
0a52891
to
ef79221
Compare
[qDup](https://github.com/Hyperfoil/qDup). This tool not only allows people to write workflows, but also dynamically | ||
react to outputs of programs and publish signals on an internal messaging bus. Workflow parts can communicate with each | ||
other even while running, sending and waiting for signals. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another aspect of qDup signals that we did not discuss was that they are in effect global, resettable countdown latches. This provides a number of key capabilities for qDup scripts, such as;
- a waiting script can wait until for
1..n
other scripts to raise a signal before proceeding - when signals have reached a counter of 0, they will no longer block scripts in the future (i.e. a condition has been met)
- signals are resettable, so the pre-conditions for proceeding can be reset (useful for loops)
- there are specific qDup commands that allow for looping and branching based on signal states, e.g.
repeat-until
signal state is reached,read-signal
for branching depending on signal state,wait-for
signal to be raised etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like it's a good foundation to solve a lot of the current limitations.
### Outputs | ||
|
||
The current normal output results will also be transformed into signals, which depend on the completion of a plugin. | ||
However, a plugin is now allowed to not declare an output and work with signals instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a specific issue (I might need more coffee), but on the surface allowing the case of "no output" feels like it could cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here was to cover the uperf server (?) use case, which has no meaningful output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I just feel like the output
constitutes a sort of "punctuation" for a plugin's successful end. Maybe that's not functionally necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always wait for state:finished
if you need to wait for the plugin to finish execution.
outputs will be prefixed with `output:`. | ||
|
||
In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree. | ||
This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How might this affect our feature requirement for looping over sub-workflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still loop, the loop itself contains a sub-workflow. Signals from within sub-workflows can be propagated to the outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the concept and agree this is a great direction for the project to be more versatile
Our example workflow today is shown below. The one major problem here is that we accept a timeout value that we pass to both the PCP and the sysbench plugins independently. input:
root: RootObject
objects:
RootObject:
id: RootObject
properties:
pmlogger_interval:
display:
description: The logger collection interval for PCP pmlogger
name: PCP pmlogger collection interval
type:
type_id: integer
sysbench_threads:
display:
description: The number of threads sysbench will run
name: sysbench threads
type:
type_id: integer
sysbench_events:
display:
description: The number of events sysbench will run
name: sysbench events
type:
type_id: integer
sysbench_cpumaxprime:
display:
description: The upper limit of the number of prime numbers generated
name: sysbench cpu max primes
type:
type_id: integer
sysbench_runtime: <<== Timeout input
display:
description: The total runtime in seconds for the sysbench tests
name: sysbench runtime seconds
type:
type_id: integer
elastic_host:
display:
description: The host URL for the ElasticSearch service
name: elasticsearch host url
type:
type_id: string
elastic_username:
display:
description: The username for the ElasticSearch service
name: elasticsearch username
type:
type_id: string
elastic_password:
display:
description: The password for the ElasticSearch service
name: elasticsearch password
type:
type_id: string
elastic_index:
display:
description: The index for the ElasticSearch service
name: elasticsearch index
type:
type_id: string
steps:
pcp:
plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
step: start-pcp
input:
pmlogger_interval: !expr $.input.pmlogger_interval
run_duration: !expr $.input.sysbench_runtime <<== Timeout used
sysbench:
plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
step: sysbenchcpu
input:
threads: !expr $.input.sysbench_threads
events: !expr $.input.sysbench_events
cpu-max-prime: !expr $.input.sysbench_cpumaxprime
time: !expr $.input.sysbench_runtime <<== Timeout used
metadata:
plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
input: {}
opensearch:
plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
input:
url: !expr $.input.elastic_host
username: !expr $.input.elastic_username
password: !expr $.input.elastic_password
index: !expr $.input.elastic_index
data:
pcp: !expr $.steps.pcp.outputs.success
sysbench: !expr $.steps.sysbench.outputs.success
metadata: !expr $.steps.metadata.outputs.success
output:
pcp: !expr $.steps.pcp.outputs.success
sysbench: !expr $.steps.sysbench.outputs.success
metadata: !expr $.steps.metadata.outputs.success
opensearch: !expr $.steps.opensearch.outputs.success flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success
This is a fine prototype, but to be properly useful there needs to be a set of relationships between PCP and sysbench, where sysbench will only start once PCP reaches a "running" state, and PCP will only stop once sysbench reaches a "finished" state. This will ensure that the data collection time frame fully encapsulates the workload time. With signaling, it may look something like this: input:
...
steps:
pcp:
plugin: quay.io/arcalot/arcaflow-plugin-pcp:0.2.0
step: start-pcp
stop_if: <<== New stop_if
- !expr $.steps.sysbench.state.finished <<== Depends on sysbench finish
input:
pmlogger_interval: !expr $.input.pmlogger_interval <<== Removed timeout
sysbench:
plugin: quay.io/arcalot/arcaflow-plugin-sysbench:0.1.0
step: sysbenchcpu
start_if: <<== New start_if
- !expr $.steps.pcp.state.running <<== Depends on pcp running
input:
threads: !expr $.input.sysbench_threads
events: !expr $.input.sysbench_events
cpu-max-prime: !expr $.input.sysbench_cpumaxprime
time: !expr $.input.sysbench_runtime
metadata:
plugin: quay.io/arcalot/arcaflow-plugin-metadata:0.1.0
input: {}
opensearch:
plugin: quay.io/arcalot/arcaflow-plugin-opensearch:0.1.0
input:
url: !expr $.input.elastic_host
username: !expr $.input.elastic_username
password: !expr $.input.elastic_password
index: !expr $.input.elastic_index
data:
pcp: !expr $.steps.pcp.outputs.success
sysbench: !expr $.steps.sysbench.outputs.success
metadata: !expr $.steps.metadata.outputs.success
output:
pcp: !expr $.steps.pcp.outputs.success
sysbench: !expr $.steps.sysbench.outputs.success
metadata: !expr $.steps.metadata.outputs.success
opensearch: !expr $.steps.opensearch.outputs.success flowchart LR
subgraph input
input.sysbench_threads
input.sysbench_events
input.elastic_password
input.sysbench_runtime
input.elastic_index
input.pmlogger_interval
input.sysbench_cpumaxprime
input.elastic_username
input.elastic_host
end
steps.metadata-->steps.metadata.outputs.success
steps.metadata-->steps.metadata.outputs.error
input.elastic_password-->steps.opensearch
steps.opensearch.outputs.success-->output
steps.pcp.outputs.success-->steps.opensearch
steps.pcp.outputs.success-->output
input.pmlogger_interval-->steps.pcp
steps.sysbench.outputs.success-->steps.opensearch
steps.sysbench.outputs.success-->output
steps.metadata.outputs.success-->steps.opensearch
steps.metadata.outputs.success-->output
input.elastic_index-->steps.opensearch
input.sysbench_runtime-->steps.pcp
input.sysbench_runtime-->steps.sysbench
input.elastic_host-->steps.opensearch
input.elastic_username-->steps.opensearch
input.sysbench_events-->steps.sysbench
input.sysbench_cpumaxprime-->steps.sysbench
steps.opensearch-->steps.opensearch.outputs.success
steps.opensearch-->steps.opensearch.outputs.error
input.sysbench_threads-->steps.sysbench
steps.pcp-->steps.pcp.outputs.error
steps.pcp-->steps.pcp.outputs.success
steps.pcp-->steps.pcp.state.running
steps.pcp.state.running-->steps.sysbench
steps.sysbench-->steps.sysbench.outputs.error
steps.sysbench-->steps.sysbench.outputs.success
steps.sysbench-->steps.sysbench.state.finished
steps.sysbench.state.finished-->steps.pcp
|
From an SDK standpoint, how are we differentiating between signals like ones like "stop_if", vs ones that are listened to for a while to use as a stream? Are are you going for a more single-message format, where no matter what things are processed as single signals instead of connections? And still then, will there be a schema for events like stop_if, that differs from custom signals? |
09fa225
to
d5ed458
Compare
Updating the specification to address this in commit 7b8bb19 . A
No, if this behavior is desired you shouldn't use |
The proposal is now open for voting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 vote for proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, @dustinblack This is something you were also looking at/wanted to do with the PoC for perf and scale running their tests on in-vehicle OS testing right?
Yes. Without this change, or otherwise a series of other changes for feature requirements, our workflows for Perf & Scale automotive and other use cases can't really move beyond the prototype phase. |
+1 vote for proposal. This seems like a really good idea to me. Is it going to be up to the engine to manage creation of signal channels at the request of the plugins that are going to publish to them? Also, (maybe I missed it) are channels going to be one-way? |
Plugins should declare signals they publish and accept, along with their schema. The engine will translate only do the piping and notify the plugins when these signals arrive. The SDK will need to be updated.
Yes, channels are strictly one-way, otherwise we can't guarantee that a workflow can be executed in a static analysis fashion. It also brings the problem of endless loops if we allow circular dependencies. |
@AvlWx2014 please vote with a code review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not sure why there is even a vote for this change.
Why does the community need to vote for this change?
If I vote, yes, what am I saying? That I agree with all the details of the proposal? I don't. Or am I saying that I agree with the general notion of adding signals to the work flow engine? I do.
I think the weakness of the proposal is the lack of a clear mapping of the existing output behavior to signals.
I don't think we should add signals next to the current output behavior but define the old behavior entirely in terms of signals.
If that is not the direction this is taking, then my vote is no.
If that is the direction this is taking, then my vote is yes.
But I found it unclear what direction this is really heading in with regard to the existing output behavior and signals.
|
||
In this proposal, we transform the execution of Arcaflow by adding the ability to send and receive signals via signal | ||
channels. Each signal channel will have a schema and is declared by a plugin. Workflow authors can take these signals | ||
and pipe them into other plugins that have declared they can receive signals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all "output" now a signal? It seems like we need to strengthen this statement. I really like what I think the direction is here, that what was an output is now a signal. That means the previous execution model has a direct mapping to the signal execution model.
If that is not the case, then could we move in that direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For steps that just do work and report the result, what would be the advantage of using signals for them?
It may just be easier to have the existing input and output option, with signals for events and mid-step data transfers.
uperf_client: | ||
... | ||
uperf_server: | ||
stop_if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit arbitrary.
Why just pick on "stopping" execution?
It seems like we want to be able to declare a signal a plugin listens for, and let it decide the action to take.
If this is about have the engine listen for a signal and take action on the plugin, then that seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like what your describing would be an extension of the proposed signaling system. My preference would be to start with the minimum required functionality described in this proposal, and only extend the signaling features as new use-cases might require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define a stop condition because we have the need for it. Currently, there is no other life cycle event that needs special handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are saying you want to embed in the workflow description one kind of stop condition?
Why do we we need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to signal a plugin to stop or forcefully kill it if need be. Uperf needs this, and there are a few other situations when the plugin needs to be stopped regardless of its successful termination.
4. `state:finished` | ||
|
||
In addition, we will also introduce nodes for each of the signals the plugin emits, prefixed with `signal:`. The | ||
outputs will be prefixed with `output:`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems an output is a kind of signal, if we are heading that way. Perhaps we could unify these?+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would remove the ability to properly static type check workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each output has its own schema. Similarly, each signal is designed to have an explicit schema. If we unify the outputs into one schema, the typing is lost. Similarly, if we merge it with the state changes, you no longer have the ability to wait for one specific output, or you have no way to wait for the plugin to finish regardless of output.
|
||
In order to ensure that a workflow can be executed, signals constitute a dependency on the step in the dependency tree. | ||
This means, that two steps cannot depend on each other's signals, nor can three or more steps form a dependency circle. | ||
This constitutes a limitation, because no two steps can form a constant back and forth communication via signals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure why this is a limitation, and why it needs to be prevented.
Why not have a default behavior of disallowing the cycle, but allow the user to describe the cycle and time limit when no progress is made in face of it, where the engine breaks it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arcaflow engine works by evaluating which steps have the data to proceed and then executing the steps that are ready. This means that you don't have to specify explicit dependencies, they are implicitly evaluated from expressions.
Simultaneously, workflows without cycles are super simple to execute and detect bugs in. Loops can be implemented as subworkflows.
If we allow arbitrary loops in the workflows, we loose simplicity and safety constraints from the workflow execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following why we want simplicity after introducing the complexity of signals, and why can't we warn the user when it is not safe instead of "knowing better then they" and preventing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of no better way to create a system that can have a schema, avoid endless loops, and automatically deduce dependencies.
This proposes a fundamental architectural change to the engine that could break compatibility with plugins. If this were a more modest feature enhancement, then we could certainly hash it out in the PR process. Due diligence is being done here to ensure that the Round Table agrees with the high-level description of the change proposal.
To the technical part of your questions, my understanding is that we want to preserve backward compatibility of existing plugins built with the SDK, so the existing input and output mechanics need to remain the same. But truly, exactly how we implement that I do believe is a technical discussion that can be had in the PR process. IMO the spirit of the proposal for an architectural change is what is important here. The procedural parts of your questions are probably better addressed in a separate forum. |
@portante the proposal was in draft, asking for content change proposals for over a month. Barely got any feedback. Now it's in the voting phase, which means you can agree or disagree with it. (See the charter.) If you can write up your changes, perhaps as a separate/modified proposal, I'm happy to withdraw this one, and we can continue the work on that. In general, whatever the changes will be, I would keep the initial design choice of static typing and the possibility to verify the correctness of a workflow in large parts without running it. If you want to have a go at rewriting this proposal, I'm happy to wait with the merge until you finish it. |
Changes introduced with this PR
This PR adds a proposal to change the Arcaflow engine execution model to use signals instead of front-to-back execution. This also solves arcalot/arcaflow-engine#7.
Easy to read version: https://github.com/arcalot/arcalot-round-table/blob/signals/art-decisions/proposals/2022-11-20-arcaflow-signals.md
This proposal is currently up for debate, no voting period has been set.
By contributing to this repository, I agree to the contribution guidelines.