Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement download retry mechanism #600

Open
mfisher87 opened this issue Jun 12, 2024 · 7 comments
Open

Implement download retry mechanism #600

mfisher87 opened this issue Jun 12, 2024 · 7 comments
Labels
type: enhancement New feature or request

Comments

@mfisher87
Copy link
Collaborator

We discussed this in a few places now:

#481 (comment)

#594 (comment)

Figured it's time for a dedicated issue 😁

@zfasnacht1013
Copy link

Any updates on this plan?

I'm trying to process a month of data using the earthaccess tool to grab 1 PACE file at a time but for some reason earthdata is giving timeout errors quite often making it difficult to actually process multiple days of files. Not sure if there is an issue with earthdata so I've reached out to them (awaiting a response), but also wondering if the retry option might be available soon. (or is there at least a way to get an error message returned from earthaccess to implement a manual retry after a 30 or 60 second wait)

Thanks!

@itcarroll
Copy link
Collaborator

Hi @zfasnacht1013, Could you provide a code snippet that's giving you timeout errors? I would like to bring it to the attention of the OB.DAAC if it seems that's where the problem is. Thanks for reporting!

@zfasnacht1013
Copy link

zfasnacht1013 commented Oct 28, 2024

@itcarroll

It's something as simple as

import earthaccess 

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon = -120; max_lon = -100; min_lat = 20; max_lat = 40
earthaccess.login(persist=True)
results = earthaccess.search_data(short_name = 'PACE_OCI_L1B_SCI',cloud_hosted=True,temporal=(start_date,end_date),count=400,bounding_box=(min_lon,min_lat,max_lon,max_lat),version='2')

earthaccess.download(results,'')

I don't think it's an OBDAAC issue though, I've been having issues with TROPOMI data as well.

The problem is, I'm trying to grab say 6 PACE granules at a time. Normally only 1 or 2 fails, but of course then I have gaps. Also, I'm trying to download the files temporarily, then delete when I'm done with them because I don't want to be storing TB's of PACE data locally.

@zmoon
Copy link

zmoon commented Oct 28, 2024

Similar to @zfasnacht1013 , I tried using earthaccess.search_data/earthaccess.download to download multiple files (~ a month of GPM_MERGIR files in my case). The first time, 3 failed with HTTP error 500. The second time, 1 (of those 3) failed. Third time OK.

@chuckwondo
Copy link
Collaborator

Until we support a configurable retry mechanism for downloading, here is a workaround (a modification of the code given in a previous comment), which makes use of the tenacity library:

import earthaccess 
import tenacity  # NEW IMPORT

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon, max_lon, min_lat, max_lat = -120, -100, 20, 40
earthaccess.login(persist=True)

# ----- BEGIN NEW CODE (must appear AFTER calling earthaccess.login)

# Create a retrier function, wrapping the earthaccess.Store._download_file function so
# that it will simply retry each failing download (using exponential backoff to help
# avoid resource contention). By replacing the existing function with the wrapper, when
# we call earthaccess.download, it will use our wrapper to download each file.
always_retry = tenacity.retry(wait=tenacity.wait_random_exponential(multiplier=1, max=60))
tenaciously_download_file = always_retry(earthaccess.__store__._download_file)
earthaccess.__store__._download_file = tenaciously_download_file

# ----- END NEW CODE

results = earthaccess.search_data(
    short_name='PACE_OCI_L1B_SCI',
    cloud_hosted=True,
    temporal=(start_date, end_date),
    count=400,
    bounding_box=(min_lon, min_lat, max_lon, max_lat),
    version='2',
)

earthaccess.download(results, '')

@chuckwondo
Copy link
Collaborator

@zfasnacht1013, although the workaround above should do the trick, and can also serve as a basis for directly adding such functionality to earthaccess, would you mind elaborating your use case, if you can?

In general, we want to discourage fully downloading files, and instead provide advice on how to avoid such downloading, and instead on how to perform direct reads, grabbing only the parts of the files containing the data you need, assuming you don't actually need the files in their entirety.

@zfasnacht1013
Copy link

@chuckwondo I'm developing research trace gas algorithms for PACE. I'm using the PACE L1b at 1km and use the full spectra of reflectances, so there's really not much of a way to subset the files before I use them. I need to start upscaling a produce 1-2 months to use for validation. Since it's a research product, I'm developing it in NCCS and not in any PACE sand box.

This is going to be a general theme moving forward, not only with PACE, but other instruments. We are doing something similar with TEMPO and will also need to grab large chunks of data to process and develop our products for validation.

I'm not sure this is something that has been considered much yet at Earthdata. I would assume the ideal scenario for earth data is that folks work in AWS for development to limit the network transfer, but since AWS is pay-to-play and NCCS is not, with budgets being tight, we are left to develop in NCCS.

This might be something for further discussion to solution ideas on how to move forward with this kind of use case. Feel free to reach out to me if we should have a meeting to further discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants