-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement download retry mechanism #600
Comments
Any updates on this plan? I'm trying to process a month of data using the earthaccess tool to grab 1 PACE file at a time but for some reason earthdata is giving timeout errors quite often making it difficult to actually process multiple days of files. Not sure if there is an issue with earthdata so I've reached out to them (awaiting a response), but also wondering if the retry option might be available soon. (or is there at least a way to get an error message returned from earthaccess to implement a manual retry after a 30 or 60 second wait) Thanks! |
Hi @zfasnacht1013, Could you provide a code snippet that's giving you timeout errors? I would like to bring it to the attention of the OB.DAAC if it seems that's where the problem is. Thanks for reporting! |
It's something as simple as
I don't think it's an OBDAAC issue though, I've been having issues with TROPOMI data as well. The problem is, I'm trying to grab say 6 PACE granules at a time. Normally only 1 or 2 fails, but of course then I have gaps. Also, I'm trying to download the files temporarily, then delete when I'm done with them because I don't want to be storing TB's of PACE data locally. |
Similar to @zfasnacht1013 , I tried using |
Until we support a configurable retry mechanism for downloading, here is a workaround (a modification of the code given in a previous comment), which makes use of the tenacity library: import earthaccess
import tenacity # NEW IMPORT
start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'
min_lon, max_lon, min_lat, max_lat = -120, -100, 20, 40
earthaccess.login(persist=True)
# ----- BEGIN NEW CODE (must appear AFTER calling earthaccess.login)
# Create a retrier function, wrapping the earthaccess.Store._download_file function so
# that it will simply retry each failing download (using exponential backoff to help
# avoid resource contention). By replacing the existing function with the wrapper, when
# we call earthaccess.download, it will use our wrapper to download each file.
always_retry = tenacity.retry(wait=tenacity.wait_random_exponential(multiplier=1, max=60))
tenaciously_download_file = always_retry(earthaccess.__store__._download_file)
earthaccess.__store__._download_file = tenaciously_download_file
# ----- END NEW CODE
results = earthaccess.search_data(
short_name='PACE_OCI_L1B_SCI',
cloud_hosted=True,
temporal=(start_date, end_date),
count=400,
bounding_box=(min_lon, min_lat, max_lon, max_lat),
version='2',
)
earthaccess.download(results, '') |
@zfasnacht1013, although the workaround above should do the trick, and can also serve as a basis for directly adding such functionality to earthaccess, would you mind elaborating your use case, if you can? In general, we want to discourage fully downloading files, and instead provide advice on how to avoid such downloading, and instead on how to perform direct reads, grabbing only the parts of the files containing the data you need, assuming you don't actually need the files in their entirety. |
@chuckwondo I'm developing research trace gas algorithms for PACE. I'm using the PACE L1b at 1km and use the full spectra of reflectances, so there's really not much of a way to subset the files before I use them. I need to start upscaling a produce 1-2 months to use for validation. Since it's a research product, I'm developing it in NCCS and not in any PACE sand box. This is going to be a general theme moving forward, not only with PACE, but other instruments. We are doing something similar with TEMPO and will also need to grab large chunks of data to process and develop our products for validation. I'm not sure this is something that has been considered much yet at Earthdata. I would assume the ideal scenario for earth data is that folks work in AWS for development to limit the network transfer, but since AWS is pay-to-play and NCCS is not, with budgets being tight, we are left to develop in NCCS. This might be something for further discussion to solution ideas on how to move forward with this kind of use case. Feel free to reach out to me if we should have a meeting to further discuss. |
We discussed this in a few places now:
#481 (comment)
#594 (comment)
Figured it's time for a dedicated issue 😁
The text was updated successfully, but these errors were encountered: