-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples to use large remote dataset(s3 or minio) #2497
Comments
Hey @Jeffwan, yes we support s3 / minio and any remote object storage supported by fsspec. Reading the data from minio with Dask is one way to do it. This is the easiest way to go if your environment is not configured to automatically connect to the remote storage backend. We provide a wrapper
The other option is to pass a string. This also works with Minio, but it assumes that your environment is already setup to connect to s3 / minio without specifying any additional credentials. However, the One thing we could do, if it would make things easier, is allow you to provide credentials (either path to credentials file or directly) within the Ludwig config, similar to how we let the user specify the cache credentials: https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/backend/ Let me know if that would help simplify things. One last thing to note: it is true that s3fs needs to be installed to connect to s3 / minio. We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image. |
Let me give
I asked this question is because I was not sure whether using dask dataframe is the recommended pattern since the image doesn't have it. Now it makes more sense. |
I have following envs defined:
Option 1: ludwig.utils.data_utils.use_credential -> failedI tried
Option 2: storage_options -> success
Anyway, option2 works for me now. |
There's a follow up question. Seems there's some issue reported from
|
Regarding the issue with So:
But if Option 2 works well for your use case, then that works too. For reading from files given as string paths (so not needing to manually load from Dask), what would be your preferred way to provide the credentials? I was thinking about adding something to the Ludwig config to specify credentials, like:
For environment variables, we could provide a syntax similar to Skaffold:
Finally, we could also let the user provide a path:
Let me know if any of these would be useful or preferred over reading from Dask directly. |
We programmatically generate the config file. I feel either way works for us. My program will receive custom |
I can confirm following ways works fine for my case. The only tricky thing is I need to use credential ENV instead client_kwargs to overcome the following issue.
If you have following config support in future, it would save us additional efforts configuring s3_creds. This is not a blocking issue and I will close this issue now.
|
Is your feature request related to a problem? Please describe.
I want to use remote dataset hosted in S3 or minio. Do you have any examples? Seems most examples in ludwig website are inbuilt dataset or local files. Do you have guidance using S3 or minio?
import dask.dataframe as dd; dataset_df = dd.read_csv('s3://bucket/myfiles.*.csv')
but notice I have to handles3fs
(required by dask). Is this a right way or there's easier way?endpoint
andsignature
Describe the use case
Use remote dataset
Describe the solution you'd like
Provide an easy to use wrapper.
The text was updated successfully, but these errors were encountered: