Implement download retry mechanism #600

mfisher87 · 2024-06-12T00:16:32Z

We discussed this in a few places now:

Figured it's time for a dedicated issue 😁

zfasnacht1013 · 2024-10-27T21:34:50Z

Any updates on this plan?

I'm trying to process a month of data using the earthaccess tool to grab 1 PACE file at a time but for some reason earthdata is giving timeout errors quite often making it difficult to actually process multiple days of files. Not sure if there is an issue with earthdata so I've reached out to them (awaiting a response), but also wondering if the retry option might be available soon. (or is there at least a way to get an error message returned from earthaccess to implement a manual retry after a 30 or 60 second wait)

Thanks!

itcarroll · 2024-10-28T14:18:18Z

Hi @zfasnacht1013, Could you provide a code snippet that's giving you timeout errors? I would like to bring it to the attention of the OB.DAAC if it seems that's where the problem is. Thanks for reporting!

zfasnacht1013 · 2024-10-28T14:27:54Z

@itcarroll

It's something as simple as

import earthaccess 

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon = -120; max_lon = -100; min_lat = 20; max_lat = 40
earthaccess.login(persist=True)
results = earthaccess.search_data(short_name = 'PACE_OCI_L1B_SCI',cloud_hosted=True,temporal=(start_date,end_date),count=400,bounding_box=(min_lon,min_lat,max_lon,max_lat),version='2')

earthaccess.download(results,'')

I don't think it's an OBDAAC issue though, I've been having issues with TROPOMI data as well.

The problem is, I'm trying to grab say 6 PACE granules at a time. Normally only 1 or 2 fails, but of course then I have gaps. Also, I'm trying to download the files temporarily, then delete when I'm done with them because I don't want to be storing TB's of PACE data locally.

zmoon · 2024-10-28T15:14:13Z

Similar to @zfasnacht1013 , I tried using earthaccess.search_data/earthaccess.download to download multiple files (~ a month of GPM_MERGIR files in my case). The first time, 3 failed with HTTP error 500. The second time, 1 (of those 3) failed. Third time OK.

chuckwondo · 2024-10-31T15:44:54Z

Until we support a configurable retry mechanism for downloading, here is a workaround (a modification of the code given in a previous comment), which makes use of the tenacity library:

import earthaccess 
import tenacity  # NEW IMPORT

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon, max_lon, min_lat, max_lat = -120, -100, 20, 40
earthaccess.login(persist=True)

# ----- BEGIN NEW CODE (must appear AFTER calling earthaccess.login)

# Create a retrier function, wrapping the earthaccess.Store._download_file function so
# that it will simply retry each failing download (using exponential backoff to help
# avoid resource contention). By replacing the existing function with the wrapper, when
# we call earthaccess.download, it will use our wrapper to download each file.
always_retry = tenacity.retry(wait=tenacity.wait_random_exponential(multiplier=1, max=60))
tenaciously_download_file = always_retry(earthaccess.__store__._download_file)
earthaccess.__store__._download_file = tenaciously_download_file

# ----- END NEW CODE

results = earthaccess.search_data(
    short_name='PACE_OCI_L1B_SCI',
    cloud_hosted=True,
    temporal=(start_date, end_date),
    count=400,
    bounding_box=(min_lon, min_lat, max_lon, max_lat),
    version='2',
)

earthaccess.download(results, '')

chuckwondo · 2024-11-02T14:15:33Z

@zfasnacht1013, although the workaround above should do the trick, and can also serve as a basis for directly adding such functionality to earthaccess, would you mind elaborating your use case, if you can?

In general, we want to discourage fully downloading files, and instead provide advice on how to avoid such downloading, and instead on how to perform direct reads, grabbing only the parts of the files containing the data you need, assuming you don't actually need the files in their entirety.

zfasnacht1013 · 2024-11-02T14:40:55Z

@chuckwondo I'm developing research trace gas algorithms for PACE. I'm using the PACE L1b at 1km and use the full spectra of reflectances, so there's really not much of a way to subset the files before I use them. I need to start upscaling a produce 1-2 months to use for validation. Since it's a research product, I'm developing it in NCCS and not in any PACE sand box.

This is going to be a general theme moving forward, not only with PACE, but other instruments. We are doing something similar with TEMPO and will also need to grab large chunks of data to process and develop our products for validation.

I'm not sure this is something that has been considered much yet at Earthdata. I would assume the ideal scenario for earth data is that folks work in AWS for development to limit the network transfer, but since AWS is pay-to-play and NCCS is not, with budgets being tight, we are left to develop in NCCS.

This might be something for further discussion to solution ideas on how to move forward with this kind of use case. Feel free to reach out to me if we should have a meeting to further discuss.

github-project-automation bot added this to earthaccess project Jun 12, 2024

github-project-automation bot moved this to 🆕 New in earthaccess project Jun 12, 2024

mfisher87 added the type: enhancement New feature or request label Jun 12, 2024

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report: 2024-06-01..2024-06-30 #739

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement download retry mechanism #600

Implement download retry mechanism #600

mfisher87 commented Jun 12, 2024

zfasnacht1013 commented Oct 27, 2024

itcarroll commented Oct 28, 2024

zfasnacht1013 commented Oct 28, 2024 •

edited

Loading

zmoon commented Oct 28, 2024 •

edited

Loading

chuckwondo commented Oct 31, 2024

chuckwondo commented Nov 2, 2024

zfasnacht1013 commented Nov 2, 2024

Implement download retry mechanism #600

Implement download retry mechanism #600

Comments

mfisher87 commented Jun 12, 2024

zfasnacht1013 commented Oct 27, 2024

itcarroll commented Oct 28, 2024

zfasnacht1013 commented Oct 28, 2024 • edited Loading

zmoon commented Oct 28, 2024 • edited Loading

chuckwondo commented Oct 31, 2024

chuckwondo commented Nov 2, 2024

zfasnacht1013 commented Nov 2, 2024

zfasnacht1013 commented Oct 28, 2024 •

edited

Loading

zmoon commented Oct 28, 2024 •

edited

Loading