Modify data-scraping methods with crul #94
Labels
Features
This would be cool to have
High Priority
NOW NOW NOW
Technical Debt
Yea, this isn't going to be pretty but needs to be done
Milestone
crul
is a relatively new http client that can make extracting data significantly faster.Currently,
rrricanes
accesses each page one at a time; regardless if getting a list of storms by year or products for a storm. To build an entire dataset of all storm/products combinations takes several hours.Worse, because timeouts become issues with the NHC archives, it's inevitable that consecutive attempts are required to build a full dataset.
The basic process of getting all storms for all years, all basins is:
This will return a 647x4 dataframe of all known storms for both basins since 1998. On my system this takes ~15 seconds.
Using
crul
as an alternative, we can make asynchronous requests to get the data. There are 20 web hits total (each archive page has both basins). So I'd like to be able to hit as many as possible, simultaneously.We can accomplish the same task above using
crul
like:The code seems a bit longer but this is only because the first example uses additional functions not visible here (
build_archive_df
,extract_storms
and additionaldplyr
andrvest
calls).The first example is loaded under function
test_a
; the second undertest_b
:And using
microbenchmark
:As expected, the results are better (> 300%). So I've decided to modify this chain of calls (as well as all other
get_*
functions).Now, there are several issues to be aware of. First, hitting the NHC website too frequently. I emailed the NHC webmaster for information who then forwarded me to the NOAA Web Operation Center. They're response:
So requests must not be more than 80 links <= 10 seconds.
Second, status codes of each link must be checked and equal to 200. Any links with invalid status codes must be hit again (use
rrricanes.http_attempts
).Third, the timeout issue. Using the
timeout
parameter andrrricanes.http_timeout
option in addition torrricanes.http_attempts
make additional requests if a page is non-responsive.Any additional attempts shall not be on links that were executed correctly; only those that are causing temporary problems.
get_storms_benchmark.r
The text was updated successfully, but these errors were encountered: