Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Explore integration with Dagster #3180

Open
astrojuanlu opened this issue Oct 16, 2023 · 9 comments
Open

[Spike] Explore integration with Dagster #3180

astrojuanlu opened this issue Oct 16, 2023 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@astrojuanlu
Copy link
Member

Description

I have heard from several data people that they're happy with Dagster, which is probably the only "modern", widely used orchestrator that is not mentioned in our docs.

There was a request upstream to add Kedro integration to Dagster dagster-io/dagster#2062 but it's unclear what finally happened.

@astrojuanlu astrojuanlu added Issue: Feature Request New feature or improvement to existing feature Component: Documentation 📄 Issue/PR for markdown and API documentation labels Oct 16, 2023
@stichbury
Copy link
Contributor

I'm not clear what the ticket here is for. Is this documentation along the lines of #2817 ?

@datajoely
Copy link
Contributor

I think it involves more of spike to work out how it would actually work. I think Flyte (LFAI), Dagster and Metaflow all fall into the modern orchestrator space which isn't served by Kedro. I also would push we address some of the fundamentals outlined in #3094 before doing this.

@stichbury
Copy link
Contributor

Thanks! But in that case, it's not a docs ticket so I'll remove the label.

@stichbury stichbury removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Oct 16, 2023
@astrojuanlu
Copy link
Member Author

Thanks both - yeah initially I thought about it as a docs ticket (even though the phrasing didn't match) but you're right, this should be a spike first.

And good point @datajoely on looking at Flyte and Metaflow too (let's call them Tier 3), although both have 0.1x times the PyPI downloads of Dagster, so I wouldn't consider them on the same level of adoption. For reference, Dagster and Prefect (Tier 2) have about the same number of downloads, and both have 0.05x times Airflow (Tier 1). Kedro lies between Tier 2 and 3 at the moment.

@astrojuanlu astrojuanlu changed the title Explore integration with Dagster [Spike] Explore integration with Dagster Oct 16, 2023
@datajoely
Copy link
Contributor

Aligned - I also think Dagster is closer to Kedro than the others in terms of granularity. In recent years they've really invested in their dbt integration and perhaps we can take inspiration in how they've done that.

@MatthiasRoels
Copy link

I never explored Dagster as much as I should have, I really like the idea of software defined assets. However, Dagster looks complicated as it has many concepts to understand. Also not sure on how individual task run (especially in a Kubernetes context).

@astrojuanlu
Copy link
Member Author

@gtauzin experimenting with Kedro & Dagster! https://github.com/gtauzin/kedro-spaceflights-dagster

@gtauzin
Copy link

gtauzin commented Nov 5, 2024

Hey! Thanks @astrojuanlu for pinging me, it's nice to see some interest for a dagster integration!

It seems to me kedro and dagster are nicely complementary:

  • dagster has an asset-driven perspective which pushes you to define nodes in a graph from the perspective of the assets they generate. However, the node function do not have to return the assets or take the one they depend on as inputs. The data assets I/O is left to the user to write. This can be very confusing at first.
  • kedro has numerous data connectors and dataset factories which help provide some structure and clarity to complex pipelines. They can be directly of use to help define dagster "asset factories" and remove the trouble of having to handle I/O.

I also feel as @MatthiasRoels that dagster has a lot of concepts. Each of them separately is not necessarily complex, but the way their relate to each other is not always clear to me from the documentation (and the chatbot in there has confused me more than anything else so far).

For example, there are several way of mapping kedro to dagster because dagster has many concepts around generic tasks:

  • an op: a task not necessarily associated to an asset;
  • a graph of op;
  • an asset: which is also an op;
  • a multi asset: an op that defines multiple assets;
  • an asset graph: a graph of op that ends up defining an asset.

In practice, to map kedro nodes, I believe multi assets would make sense even in the case of a node that does not have any outputs (and therefore does not define any assets). This is because ops are second-rate citizens in dagster: they do not even appear on their DAG visualization (the global asset lineage) on the UI, but are presented in a form a of a list lost somewhere in a menu. In the case of the spaceflights example, the last node, "evaluate_model_node" does not have any outputs. Defining it as a multi_asset with a corresponding asset that is intangible allows to have it as a part of the asset DAG:

image

This small projects is a way for me to deepen my understanding of both kedro and dagster and this is also something I am planning on using for work in the near future. So if you're interested or are also looking into it, don't hesistate to ping me on the kedro slack, I'd be happy to discuss it more.

@datajoely
Copy link
Contributor

Super cool work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

5 participants