Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design DTT1 - Define (DAG Directed Acyclic Graph) #4766

Closed
23 tasks done
fcaffieri opened this issue Dec 12, 2023 · 10 comments
Closed
23 tasks done

Design DTT1 - Define (DAG Directed Acyclic Graph) #4766

fcaffieri opened this issue Dec 12, 2023 · 10 comments
Assignees

Comments

@fcaffieri
Copy link
Member

fcaffieri commented Dec 12, 2023

EPIC: #4495

Description

The objective of the issue is to define the directed graph orchestrator without cycles, to orchestrate the execution of the DTT modules.

It is necessary to analyze the following possible orchestrators to use and analyze the advantages and disadvantages of each one to make an appropriate decision.

Platforms

Python frameworks

Others

Restrictions on tool choice

  • Python is desirable
  • A single tool should cover what we need
  • As long as it covers what we need, it should be as simple as possible.
  • Flexibility/modularization
  • Widely adopted tool - Maturity

Tool selection new criteria

Based on a discussion with @rauldpm and @jnasselle , we've decided to use a framework rather than a platform. This is motivated to avoid managing one extra service.

DoD

  • Search as much as we can frameworks that could help us to achieve our goal. HINT: do not go deeper, just list it.
  • Based on the search result, make a first cleanup based on the most important selection criteria (TBD)
  • Based on a curated search result, go deeper and aim to have a max of three alternatives
  • Pick one of those alternatives -> Taskflow
@jnasselle
Copy link
Member

jnasselle commented Dec 13, 2023

Update

Nice article about DAG as a formal model of a workflow https://www.prefect.io/blog/you-probably-dont-need-a-dag

Not aiming to change our current approach, but good reading

Define terminology

Based on industry/similar software, most of them use the next set of names

  • task/operators: atomic and indivisible action
  • workflow/flow: A workflow is a concatenation and combination of tasks/operator
  • execution/job: a workflow/flow instance

@jnasselle jnasselle self-assigned this Dec 13, 2023
@rauldpm rauldpm self-assigned this Dec 13, 2023
@fcaffieri
Copy link
Member Author

fcaffieri commented Dec 13, 2023

Update

Install Airflow DAG to perform a basic proof of concept.

At the moment it seems that it has too many functionalities that we will not use, it looks very powerful.

image

Something important to highlight is that at first glance, the creation of taskflows must be done using Python code.
image

Continue the analysis of Airflow and comparative between the others.


Restrictions

Python is desirable 🟢
A single tool should cover what we need 🟢
As long as it covers what we need, it should be as simple as possible. 🟡
Flexibility/modularization 🟢
Widely adopted tool - Maturity 🟢

@jnasselle
Copy link
Member

jnasselle commented Dec 14, 2023

Update - Testing Prefect

Restrictions

Python is desirable 🟢
A single tool should cover what we need 🟡
As long as it covers what we need, it should be as simple as possible. 🟡
Flexibility/modularization 🟢
Widely adopted tool - Maturity 🟢

image

@jnasselle
Copy link
Member

jnasselle commented Dec 14, 2023

Update - Testing Dagster

Python is desirable 🟢
A single tool should cover what we need 🟡
As long as it covers what we need, it should be as simple as possible. 🟢
Flexibility/modularization 🟢
Widely adopted tool - Maturity 🟢

Not going further given the new design criteria: from platforms to frameworks

@fcaffieri
Copy link
Member Author

fcaffieri commented Dec 14, 2023

Update - Testing Yadage

Yadage use JSON Reference to develop the workflows


Restrictions

Python is desirable 🟢
A single tool should cover what we need 🟡
As long as it covers what we need, it should be as simple as possible. 🟡
Flexibility/modularization
Widely adopted tool - Maturity


Implementation

  • Easy to install.
  • Not having a front-end to visualise the workflow.

@jnasselle
Copy link
Member

jnasselle commented Dec 19, 2023

Update - Testing Taskflow

A library to do [jobs, tasks, flows] in a highly available, easy to understand and declarative manner (and more!) to be used with OpenStack and other projects.

https://wiki.openstack.org/wiki/TaskFlow/Paradigm_shifts

Restrictions

Python is desirable 🟢
A single tool should cover what we need 🟡 no yaml as dag input
As long as it covers what we need, it should be as simple as possible. 🟢
Flexibility/modularization . 🟢
Widely adopted tool - Maturity . 🟢


Implementation

@fcaffieri
Copy link
Member Author

Update

Finish the schema validation for the input yaml used by the task flow.
The next step is to incorporate this validator into the task flow.

Input yaml example:
inputTask.yml.txt

Schema definition:
schema.json.txt

Validator:
schemeValidator_v2.py.txt

@jnasselle
Copy link
Member

Update

Based on previous investigations about DAG engines and the results that give TaskFlow the pick, we agreed to attempt our DAG engine with the bare basic functionality based on the next criteria

  • No framework provides OOTB functionality: a lot of glue is needed (YAML parser, workflow to TaskFlow,....)
  • No framework has outstanding maintenance or is widely adopted
  • Python TopologicalSort (graphlib) is now part of the standard

Conclusion

We are going to develop our DAG engine based on graphlib

@QU3B1M
Copy link
Member

QU3B1M commented Dec 26, 2023

LGTM!

@fcaffieri
Copy link
Member Author

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants