Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guiding the use of HTTPS vs. S3 when working in-region #163

Closed
asteiker opened this issue Jan 26, 2023 · 3 comments
Closed

Guiding the use of HTTPS vs. S3 when working in-region #163

asteiker opened this issue Jan 26, 2023 · 3 comments
Labels
discussions Discussions and planning

Comments

@asteiker
Copy link
Contributor

@ashiklom Brought up such a great question during the ESIP Earthdata Cloud session that I’m paraphrasing, but essentially: Why not simply work with HTTPS links even if you’re in-region, instead of switching over to s3 which comes with so many additional challenges? Some feedback from @bilts:

I think it’s a both/and situation. HTTP could be improved. s3:// gives you some nice things (list operations, guaranteed partial file access, parallelism in access, some straightforward way to mount as a filesystem) and has a lot of traction with things like zarr/xarray

But I’m left wondering what our role is here as far as teaching or promoting this option. I’d love to know more about the pros/cons that Patrick listed off (I’m sure there are many other considerations), as Alexey pointed out, if we can identify where the limitations are (i.e. in the tooling or system itself).

Some ideas:

From @briannapagan:

This is where you would need some benchmarking to explain the use cases for why between the two. Specifically cloud optimized formats, your performance could greatly vary, and should be improved using s3 in-region. Specifically performance for analysis-in-place (i.e. I am going to subset/average any other common operation on some cloud files before (if at all) downloading)

@ashiklom:

I think the most useful thing OpenScapes can do here in the short term is give a high-level survey of available options for s3 vs. http access as well as their pros/cons.
Also, check out this Twitter thread: https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ
@[email protected]@[email protected] @charlesstern
@ashiklom711 @StellarGeay @ProjectJupyter This is the public http endpoint Pangeo Forge's S3 bucket on @OpenStorageNet. Generic HTTP(S) servers won't be able to handle the scaled parallel reads a Zarr store is designed for. But yes, if the HTTP(S) link points to Zarr on cloud storage this should "just work".
TwitterTwitter | Sep 26th, 2022

And this closely-related thread:
https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ
@[email protected]@[email protected] @charlesstern
@_jhamman @ashiklom711 @StellarGeay @ProjectJupyter Good point!
TwitterTwitter | Sep 26th, 2022

See comment in earthaccess issue: nsidc/earthaccess#188 (comment)
Comment on #188 Document why signed S3 URLs might be giving 400s when called from inside us-west-2

@betolink :

Great points, HTTPS could also scale (with some latency as Yuvi pointed out), in this case TEA (thin egress app) is only a proxy for S3. And I agree with
[@brianna Pagán]
we need to do some benchmarking to verify how it impacts speed and parallelism

This could be a topic for a future hackday. Further suggestions from @ScienceCat18 :

100% agree we should outline the different access options (https/s3) and their pros & cons (as it's already been said in this thread), and i think key is to do so from an end-user perspective (so high level and not too technical). What are the use cases that are best suited for https, and those for s3.
Agreed with updating the cheatsheets with this info as well!
+1 to having this as a topic for cookbook hack-day

@asteiker asteiker added the discussions Discussions and planning label Jan 26, 2023
@yuvipanda
Copy link

When accessing NASA S3 hosted data in-region via HTTPS links, you're still actually just using S3! It is done via https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html automatically for you. There are a couple of automatic steps in between that add latency, but it's not a generic HTTP server sending things to you - it's the exact same serving infrastructure that S3 uses, just with a different authentication mechanism.

So the question really is 'what performance benefit do we get from using direct S3 authentication vs the presigned URLs given to us by earthdata redirects?'. That could be settled by some benchmarking.

My intuition is that we should tell users to use the HTTPS url by default, only switching to s3 in very specific (to be determined) cases. The advantages are:

  1. Users don't need to write different code based on where the code is executing!
  2. Users don't have to write fundamentally different code to access data based on where it lives (S3 vs on-prem)
  3. We stop giving AWS free publicity by spending our resources educating end users on AWS best practices they might not need :)

@yuvipanda
Copy link

I do agree that hard data in terms of a quick benchmark would be necessary to move forward here.

@yuvipanda
Copy link

At least in pangeo / xarray land, one big problem was that xarray / fsspec did not work with .netrc files for accessing data in the cloud. I've been working deep in the bowels of aiohttp (with this PR: aio-libs/aiohttp#7131) to fix that. Once that is done, the netrc based solution will now work for both on-prem and cloud access, regardless of where the calling code is.

@NASA-Openscapes NASA-Openscapes locked and limited conversation to collaborators Oct 17, 2024
@asteiker asteiker converted this issue into discussion #360 Oct 17, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussions Discussions and planning
Projects
None yet
Development

No branches or pull requests

2 participants