Implement _get_kwargs_from_urls #273

remram44 · 2019-12-12T19:46:27Z

Using URLs is very convenient, it allows configuring the storage via a single environment variable.

However S3FileSystem does not implement _get_kwargs_from_urls so it is impossible to specify anything via the URL, only the bucket.

It would be good to have S3FileSystem parse query arguments, such as endpoint_url (useful for testing locally with Minio for instance), region_name, requester_pays, signature_version, and timeouts.

Right now I have to do my own parsing on the URL to pass it to S3FileSystem, which duplicates the effort in fsspec.

The text was updated successfully, but these errors were encountered:

martindurant · 2019-12-12T19:49:22Z

What kind of URL do you have in mind, and what information would you get from it?

remram44 · 2019-12-12T19:51:00Z

Using query arguments, something like s3://my-bucket/key?endpoint_url=http://localhost:9000&signature_version=s3v4, would be enough for my use.

martindurant · 2019-12-12T19:54:48Z

Do such URL patterns exist elsewhere? It seems to me like a bit of an overuse of the one string, especially when many parameters are not themselves expressed as strings.

remram44 · 2019-12-12T20:02:02Z

Maybe I'm assuming too much for the intended use of _get_kwargs_from_urls()? But since fsspec has support for URLs, I find it a shame not to be able to use that whenever a parameter needs to be changed.

I can implement this on top of fsspec somehow if fsspec's URLs are not meant to be used this way. It's just a shame to have all those different filesystems supported by the same API unified, and still have to switch-case on the ones I want to support to pass it the right options.

martindurant · 2019-12-12T20:06:28Z

Then kind of thing we were targeting comes from SSH, also used by hadoop world, like "hdfs://user:pw@host:port/path". I suppose it could be extended to query parameters where we know that the path itself doesn't contain a query (as opposed, for example, to real HTTP URLs).

Let's leave this open a little while and see if the functionality would be useful to anyone else.

DavidMStraub · 2020-01-07T09:59:41Z

I think this is a great idea.

I bumped into this thread while trying to find out if I can specify the endpoint_url when using the s3fs implementation in pandas, when doing things like pd.from_csv("s3://..."). Given the proliferation of S3-compatible object storage provides, I think this is an important use case. Unfortunately, the pandas implementation does not allow to pass any arguments to S3FileSystem.

Moreover, I found a recent PR in pandas implementing this via environment variables (pandas-dev/pandas#29050) and it was closed by the panda devs arguing that this should be implemented in s3fs instead.

I think making these options configurable via the URL string would be the perfect solution.

martindurant · 2020-06-10T17:11:19Z

Note that following pandas-dev/pandas#34266 , it should become possible to pass parameters to the filesystem backend sometime in the future.

martindurant · 2020-07-31T20:18:44Z

The above happened, so this is less of a priority. Please reopen if you think it important.

joshmoore · 2021-02-09T14:56:03Z

I'd vote for any of a with-context, environment variable or URL parsing to be able to pass s3 strings down to clients that I don't have control of. Call graph:

 My code
  --> aicsimageio
    --> zarr/tiff/hdf5
      --> s3fs

and the intermediate two layers would currently (and don't) adhere to the same "storage_options" behavior.

martindurant · 2021-02-09T14:59:11Z

You can control the "default" s3fs instance with configuration, in files or environment variables https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration

joshmoore · 2021-02-09T15:19:31Z

Thanks, @martindurant. For what I'm currently attempting (pytest-driven benchmarking) it'll be fairly simple to mock out FSSPEC_CONFIG_DIR. For a server process, e.g., that will be less ideal. I'm also looking into https://stackoverflow.com/questions/32618216/overwrite-s3-endpoint-using-boto3-configuration-file as a possible boto-level workaround, but I do see the value of having everything encoded in the URL for simplifying working with non-AWS S3 stores.

martindurant · 2021-02-09T15:25:47Z

You can also edit the config dictionary directly, if you wish.

As with the conversation above, I have trouble imagining how to encode all the things you might want to into a single URL.

joshmoore · 2021-02-09T20:01:48Z

As with the conversation above, I have trouble imagining how to encode all the things you might want to into a single URL.

I imagine AWS did as well hence the situation we're in. (OR they actively wanted to make it difficult to use custom endpoints) Take home: it'd be valuable but is obviously a prickly issue. Of course, if fsspec did find a way, others might follow suit. ;)

joshmoore · 2021-04-15T09:24:15Z

For anyone who tries to go down the environment variable route, see also #432 and stick to the file method.

(z) /opt/ome-zarr-py $env FSSPEC_S3_ANON=true FSSPEC_S3_CLIENT_KWARGS='{"endpoint_url": "https://s3.embassy.ebi.ac.uk"}' ome_zarr -vvvv info s3://idr/zarr/v0.1/6001240.zarr
....
packages/s3fs/core.py", line 328, in _connect
    client_kwargs = self.client_kwargs.copy()
AttributeError: 'str' object has no attribute 'copy'

but

cat ~/.config/fsspec/conf.json
{
    "s3": {
        "anon": true,
        "client_kwargs": {
            "endpoint_url": "https://s3.embassy.ebi.ac.uk"
        }
    }
}

👍

martindurant · 2021-04-15T12:50:54Z

Hm, right, https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration only mentions that INI values are limited to strings, I suppose I thought that was obvious for env-vars. We could, of course, do a literal_eval, but then there would need to be some logic around merging.
In fact, you might have noticed, that if you were to provide client_kwargs to the constructor with your given config, then the new value would take precedence whether or not it contained "endpoint_url", so you could loose your config value. This could be improved.

martindurant mentioned this issue Jan 7, 2020

Support customised S3 servers endpoint URL pandas-dev/pandas#29050

Closed

5 tasks

martindurant closed this as completed Jul 31, 2020

joshmoore mentioned this issue May 27, 2022

Read data from protected s3? ome/ome-zarr-py#203

Closed

joshmoore mentioned this issue Oct 8, 2022

Referencing zarr stores, groups, arrays, and attributes ome/ngff#144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement _get_kwargs_from_urls #273

Implement _get_kwargs_from_urls #273

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019 •

edited

Loading

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019

DavidMStraub commented Jan 7, 2020

martindurant commented Jun 10, 2020

martindurant commented Jul 31, 2020

joshmoore commented Feb 9, 2021

martindurant commented Feb 9, 2021

joshmoore commented Feb 9, 2021

martindurant commented Feb 9, 2021

joshmoore commented Feb 9, 2021

joshmoore commented Apr 15, 2021

martindurant commented Apr 15, 2021

Implement _get_kwargs_from_urls #273

Implement _get_kwargs_from_urls #273

Comments

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019 • edited Loading

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019

remram44 commented Dec 12, 2019

martindurant commented Dec 12, 2019

DavidMStraub commented Jan 7, 2020

martindurant commented Jun 10, 2020

martindurant commented Jul 31, 2020

joshmoore commented Feb 9, 2021

martindurant commented Feb 9, 2021

joshmoore commented Feb 9, 2021

martindurant commented Feb 9, 2021

joshmoore commented Feb 9, 2021

joshmoore commented Apr 15, 2021

martindurant commented Apr 15, 2021

martindurant commented Dec 12, 2019 •

edited

Loading