-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return S3 data links by default when in region #318
Conversation
I will automatically update this comment whenever this PR is modified
|
Okay, this seems to working as expected. For reference, here's the testing code I'm using import os
import tempfile
import coiled
import earthaccess
import xarray as xr
earthaccess.login()
granules = earthaccess.search_data(
short_name="SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205",
temporal=("2020", "2022"),
count=2,
)
# Processing function for each data file
@coiled.function(
region="us-west-2",
environ={"EARTHDATA_USERNAME": os.environ["EARTHDATA_USERNAME"], "EARTHDATA_PASSWORD": os.environ["EARTHDATA_PASSWORD"]},
keepalive="20 minutes",
)
def process(granule):
results = []
earthaccess.login()
with tempfile.TemporaryDirectory() as tmpdir:
files = earthaccess.download([granule], tmpdir)
for file in files:
ds = xr.open_dataset(file)
ds = ds.sel(Latitude=slice(23, 50), Longitude=slice(270, 330))
ds = ds.SLA.where((ds.SLA >= 0) & (ds.SLA < 10))
results.append(ds)
return xr.concat(results, dim="Time")
# Run processing on all the data granules
chunks = process.map(granules)
# Combine and plot results
ds = xr.concat(chunks, dim="Time")
ds.std("Time").plot(figsize=(14, 6), x="Longitude", y="Latitude").figure.savefig("foo.png") |
assert g.data_links(access="direct")[0].startswith("s3://") | ||
assert g.data_links(access="external")[0].startswith("https://") | ||
# `in_region` specified | ||
assert g.data_links(in_region=True)[0].startswith("s3://") | ||
assert g.data_links(in_region=False)[0].startswith("https://") | ||
# When `access` and `in_region` are both specified, `access` takes priority | ||
assert g.data_links(access="direct", in_region=True)[0].startswith("s3://") | ||
assert g.data_links(access="direct", in_region=False)[0].startswith("s3://") | ||
assert g.data_links(access="external", in_region=True)[0].startswith("https://") | ||
assert g.data_links(access="external", in_region=False)[0].startswith("https://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the intended behavior we want, but let me know if I'm missing something.
As a side note, I'm not sure why we have separate access
and in_region
kwargs for determining if we want to use s3
or https
urls. Is one kwarg sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, I'm not sure why we have separate access and in_region kwargs for determining if we want to use s3 or https urls. Is one kwarg sufficient?
I can't answer the question directly, but these keywords also feel unintuitive to me. What about access="s3"
? To me, "direct" and "external" don't mean anything without more context, but "s3" and "https" do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that two kwargs is clunky and direct/external aren't the most descriptive names, and we could likely handle it with a single kwarg.
that said, since they are the current interface, we should probably open an issue for possibly refactoring it and not block this PR.
@@ -325,7 +325,6 @@ def data_links( | |||
else: | |||
# we are not in us-west-2, even cloud collections have HTTPS links | |||
return https_links | |||
return https_links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just cosmetic (this line would never be called, so I decided to remove it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I tend to go the other way and drop the else
statement, but six-of-one...
@betolink @MattF-NSIDC do either of you have bandwidth to take a quick look at this PR? Totally fine if not -- I'm also okay just merging this (I think the actual changes here should be pretty uncontroversial) and owning any follow-up work if there is any. |
Hey James, I've been refraining from speaking on your questions about intended behavior because I don't know :) The change itself looks great, I love the addition of a unit test for this aspect of the interface. I'm 100% on board with merging and dealing with any unexpected results as they come! 🚀 |
Sounds good thanks @mfisher87 It turns out that while I can approve / merge PRs, I don't have sufficient permissions to override the "Review required" check on GitHub. Would you (or someone else) mind approving? |
@@ -325,7 +325,6 @@ def data_links( | |||
else: | |||
# we are not in us-west-2, even cloud collections have HTTPS links | |||
return https_links | |||
return https_links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I tend to go the other way and drop the else
statement, but six-of-one...
assert g.data_links(access="direct")[0].startswith("s3://") | ||
assert g.data_links(access="external")[0].startswith("https://") | ||
# `in_region` specified | ||
assert g.data_links(in_region=True)[0].startswith("s3://") | ||
assert g.data_links(in_region=False)[0].startswith("https://") | ||
# When `access` and `in_region` are both specified, `access` takes priority | ||
assert g.data_links(access="direct", in_region=True)[0].startswith("s3://") | ||
assert g.data_links(access="direct", in_region=False)[0].startswith("s3://") | ||
assert g.data_links(access="external", in_region=True)[0].startswith("https://") | ||
assert g.data_links(access="external", in_region=False)[0].startswith("https://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that two kwargs is clunky and direct/external aren't the most descriptive names, and we could likely handle it with a single kwarg.
that said, since they are the current interface, we should probably open an issue for possibly refactoring it and not block this PR.
Thanks @mfisher87 @jhkennedy! |
@jhkennedy see #327 for the follow up issue on |
Shoot, let's fix that! My work laptop is put away, and I similarly don't have those permissions on this account, and also can't change branch protection rules (I'll do it on Wednesday if not resolved sooner). We could make James and other trusted maintainers admins of the repo (I think this is the best way forward so maintainers can also maintain repo settings), or we could remove the branch protection rule or add James to the list of people that can bypass it. @betolink @jrbourbeau what do you think? |
@jrbourbeau you can bypass the protection rule now if you feel it's needed! |
I went to run the code referenced here #316 (comment) and got an error because it turns out
granule.data_links(access="direct", in_region=True)
returns a HTTPS link. This PR fixes that specific case and also makes it sogranule.data_links(in_region=True)
returns S3 links by default (which seems like it's the expected behavior, but @betolink @MattF-NSIDC let me know if you think otherwise).I'm testing this PR out now to make sure if in fact fixes things