Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement _get_kwargs_from_urls #273

Closed
remram44 opened this issue Dec 12, 2019 · 15 comments
Closed

Implement _get_kwargs_from_urls #273

remram44 opened this issue Dec 12, 2019 · 15 comments

Comments

@remram44
Copy link

Using URLs is very convenient, it allows configuring the storage via a single environment variable.

However S3FileSystem does not implement _get_kwargs_from_urls so it is impossible to specify anything via the URL, only the bucket.

It would be good to have S3FileSystem parse query arguments, such as endpoint_url (useful for testing locally with Minio for instance), region_name, requester_pays, signature_version, and timeouts.

Right now I have to do my own parsing on the URL to pass it to S3FileSystem, which duplicates the effort in fsspec.

@martindurant
Copy link
Member

martindurant commented Dec 12, 2019

What kind of URL do you have in mind, and what information would you get from it?

@remram44
Copy link
Author

Using query arguments, something like s3://my-bucket/key?endpoint_url=http://localhost:9000&signature_version=s3v4, would be enough for my use.

@martindurant
Copy link
Member

Do such URL patterns exist elsewhere? It seems to me like a bit of an overuse of the one string, especially when many parameters are not themselves expressed as strings.

@remram44
Copy link
Author

Maybe I'm assuming too much for the intended use of _get_kwargs_from_urls()? But since fsspec has support for URLs, I find it a shame not to be able to use that whenever a parameter needs to be changed.

I can implement this on top of fsspec somehow if fsspec's URLs are not meant to be used this way. It's just a shame to have all those different filesystems supported by the same API unified, and still have to switch-case on the ones I want to support to pass it the right options.

@martindurant
Copy link
Member

Then kind of thing we were targeting comes from SSH, also used by hadoop world, like "hdfs://user:pw@host:port/path". I suppose it could be extended to query parameters where we know that the path itself doesn't contain a query (as opposed, for example, to real HTTP URLs).

Let's leave this open a little while and see if the functionality would be useful to anyone else.

@DavidMStraub
Copy link

I think this is a great idea.

I bumped into this thread while trying to find out if I can specify the endpoint_url when using the s3fs implementation in pandas, when doing things like pd.from_csv("s3://..."). Given the proliferation of S3-compatible object storage provides, I think this is an important use case. Unfortunately, the pandas implementation does not allow to pass any arguments to S3FileSystem.

Moreover, I found a recent PR in pandas implementing this via environment variables (pandas-dev/pandas#29050) and it was closed by the panda devs arguing that this should be implemented in s3fs instead.

I think making these options configurable via the URL string would be the perfect solution.

@martindurant
Copy link
Member

Note that following pandas-dev/pandas#34266 , it should become possible to pass parameters to the filesystem backend sometime in the future.

@martindurant
Copy link
Member

The above happened, so this is less of a priority. Please reopen if you think it important.

@joshmoore
Copy link

I'd vote for any of a with-context, environment variable or URL parsing to be able to pass s3 strings down to clients that I don't have control of. Call graph:

 My code
  --> aicsimageio
    --> zarr/tiff/hdf5
      --> s3fs

and the intermediate two layers would currently (and don't) adhere to the same "storage_options" behavior.

@martindurant
Copy link
Member

You can control the "default" s3fs instance with configuration, in files or environment variables https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration

@joshmoore
Copy link

Thanks, @martindurant. For what I'm currently attempting (pytest-driven benchmarking) it'll be fairly simple to mock out FSSPEC_CONFIG_DIR. For a server process, e.g., that will be less ideal. I'm also looking into https://stackoverflow.com/questions/32618216/overwrite-s3-endpoint-using-boto3-configuration-file as a possible boto-level workaround, but I do see the value of having everything encoded in the URL for simplifying working with non-AWS S3 stores.

@martindurant
Copy link
Member

You can also edit the config dictionary directly, if you wish.

As with the conversation above, I have trouble imagining how to encode all the things you might want to into a single URL.

@joshmoore
Copy link

As with the conversation above, I have trouble imagining how to encode all the things you might want to into a single URL.

I imagine AWS did as well hence the situation we're in. (OR they actively wanted to make it difficult to use custom endpoints) Take home: it'd be valuable but is obviously a prickly issue. Of course, if fsspec did find a way, others might follow suit. ;)

@joshmoore
Copy link

For anyone who tries to go down the environment variable route, see also #432 and stick to the file method.

(z) /opt/ome-zarr-py $env FSSPEC_S3_ANON=true FSSPEC_S3_CLIENT_KWARGS='{"endpoint_url": "https://s3.embassy.ebi.ac.uk"}' ome_zarr -vvvv info s3://idr/zarr/v0.1/6001240.zarr
....
packages/s3fs/core.py", line 328, in _connect
    client_kwargs = self.client_kwargs.copy()
AttributeError: 'str' object has no attribute 'copy'

but

cat ~/.config/fsspec/conf.json
{
    "s3": {
        "anon": true,
        "client_kwargs": {
            "endpoint_url": "https://s3.embassy.ebi.ac.uk"
        }
    }
}

👍

@martindurant
Copy link
Member

Hm, right, https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration only mentions that INI values are limited to strings, I suppose I thought that was obvious for env-vars. We could, of course, do a literal_eval, but then there would need to be some logic around merging.
In fact, you might have noticed, that if you were to provide client_kwargs to the constructor with your given config, then the new value would take precedence whether or not it contained "endpoint_url", so you could loose your config value. This could be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants