Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use of dagster-dask on premis #1914

Closed
sephib opened this issue Nov 18, 2019 · 5 comments
Closed

use of dagster-dask on premis #1914

sephib opened this issue Nov 18, 2019 · 5 comments
Labels
area: integrations Related to general integrations, including requests for a new integration integration: dask Related to dagster-dask

Comments

@sephib
Copy link

sephib commented Nov 18, 2019

Hi,
We are thinking of using dagster with a local (on-premis) yarn cluster with dask-yarn.
From the documentation we can see that the limitations is to :

must use S3 for intermediates and run storage.

How can we use dagster in our environment?

@schrockn
Copy link
Member

schrockn commented Nov 18, 2019

Hey @sephib that sounds really exciting.

In order to use dagster in new or custom environment like this you must implement a SystemStorageDefinition (usually via the @system_storage decorator)

See python_modules/libraries/dagster-aws/dagster_aws/s3/system_storage.py for an example.

The system can operate on an arbitrary instance of SystemStorageDefinition as long as it is faithful to the APIs that it needs. The interesting question is what storage you want to use in your yarn cluster.

Please feel free to hop in our slack! We'd love to hear more about your use case.

@sephib
Copy link
Author

sephib commented Nov 18, 2019

The storage would be on hdfs with the output format being parquet or csv.
To access the hdfs I think we would try to work with pyarrow
Do you have any additional thoughts?
Do you know of a running dagster instance on Yarn?

@natekupp
Copy link
Contributor

@sephib - makes sense! I do think it would be fairly straightforward to implement storage on HDFS instead of S3 using the system storage system that Nick mentioned above—I'm happy to work with you on this.

Re: YARN integration, we haven't deployed in that context yet but something we'd like to support. Definitely would love to hear more about your use case, if you are able to join our Slack (linked on the github page here: https://github.com/dagster-io/dagster) - would love to hear more about what your needs are!

@mgasner mgasner added area: integrations Related to general integrations, including requests for a new integration integration: dask Related to dagster-dask labels Nov 21, 2019
@natekupp
Copy link
Contributor

we have a prototype of HDFS system storage working now: https://dagster.phacility.com/D2259 - still have not explored dask-yarn. See also #2273

@sephib
Copy link
Author

sephib commented Mar 30, 2020

Thx for the update - however not yet sure when we will be able to check it out.

@natekupp natekupp added this to the Future Release milestone Jun 1, 2020
@mgasner mgasner removed this from the Future Release milestone Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integrations Related to general integrations, including requests for a new integration integration: dask Related to dagster-dask
Projects
None yet
Development

No branches or pull requests

5 participants