Modify data-scraping methods with crul #94

timtrice · 2017-07-08T13:52:01Z

crul is a relatively new http client that can make extracting data significantly faster.

Currently, rrricanes accesses each page one at a time; regardless if getting a list of storms by year or products for a storm. To build an entire dataset of all storm/products combinations takes several hours.

Worse, because timeouts become issues with the NHC archives, it's inevitable that consecutive attempts are required to build a full dataset.

The basic process of getting all storms for all years, all basins is:

# test_a
year_archives <- map(c(1998:2017), 
                     .f = rrricanes:::year_archives_link) %>% 
  flatten_chr()

l <- map_df(year_archives, 
            .f = rrricanes:::build_archive_df, c("AL", "EP"), 
            p = progress_estimated(n = length(year_archives)))

This will return a 647x4 dataframe of all known storms for both basins since 1998. On my system this takes ~15 seconds.

Using crul as an alternative, we can make asynchronous requests to get the data. There are 20 web hits total (each archive page has both basins). So I'd like to be able to hit as many as possible, simultaneously.

We can accomplish the same task above using crul like:

# test_b
get_basin_cyclones <- function(basin, res) {
  
  if (basin == "AL") {
    link_xpath <- "//td[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a"
  } else if (basin == "EP") {
    link_xpath <- "//td[(((count(preceding-sibling::*) + 1) = 2) and parent::*)]//a"
  } else {
    stop("No basin")
  }
  
  contents <- map(res, ~.$parse("UTF-8")) %>% 
    map(read_html)
  
  years <- map(contents, html_nodes, xpath = "//title") %>% 
    map(html_text) %>% 
    str_sub(0L, 4L) %>% 
    as.numeric()
  
  storms <- map(contents, html_nodes, xpath = link_xpath)
  names <- map(storms, html_text) %>% map(str_to_title)
  links <- map(storms, html_attr, name = "href") %>% 
    map2(years, ~paste0(rrricanes:::year_archives_link(.y), .x))
  basins <- map(names, rep_along, basin)
  years <- map2(names, years, rep_along)

   df <- data_frame("Year" = years %>% flatten_dbl(), 
                   "Name" = names %>% flatten_chr(), 
                   "Basin" = basins %>% flatten_chr(), 
                   "Link" = links %>% flatten_chr())
  
  return(df)
}

year_archives <- map(c(1998:2017), .f = rrricanes:::year_archives_link) %>% 
  flatten_chr()

# 1998 is only year with slightly different URL. Modify accordingly
year_archives[1] <- paste0(year_archives[1], "1998archive.shtml")

l <- Async$new(urls = year_archives)

res <- l$get()

Sys.sleep(3)

storm_df <- map_df(c("AL", "EP"), get_basin_cyclones, res)

The code seems a bit longer but this is only because the first example uses additional functions not visible here (build_archive_df, extract_storms and additional dplyr and rvest calls).

The first example is loaded under function test_a; the second under test_b:

df_a <- test_a() %>% arrange(Basin, Year, Name)
df_b <- test_b() %>% arrange(Basin, Year, Name)
identical(df_a, df_b)

> identical(df_a, df_b)
[1] TRUE

And using microbenchmark:

microbenchmark(test_a(), test_b(), times = 5L)

Unit: seconds
     expr       min        lq      mean    median        uq       max neval
 test_a() 14.038077 14.091104 14.517616 14.295890 15.070774 15.092238     5
 test_b()  3.832888  3.915871  3.983316  3.962955  4.097887  4.106977     5

As expected, the results are better (> 300%). So I've decided to modify this chain of calls (as well as all other get_* functions).

Now, there are several issues to be aware of. First, hitting the NHC website too frequently. I emailed the NHC webmaster for information who then forwarded me to the NOAA Web Operation Center. They're response:

If we see more than 80 connections within 10 seconds of each other, our Security Team will notice us and the IP space could be blocked.
IBJ-144-29526_ Fwd_ Asynchronous Requests to Website.pdf

So requests must not be more than 80 links <= 10 seconds.

Second, status codes of each link must be checked and equal to 200. Any links with invalid status codes must be hit again (use rrricanes.http_attempts).

Third, the timeout issue. Using the timeout parameter and rrricanes.http_timeout option in addition to rrricanes.http_attempts make additional requests if a page is non-responsive.

Any additional attempts shall not be on links that were executed correctly; only those that are causing temporary problems.

get_storms_benchmark.r

The text was updated successfully, but these errors were encountered:

timtrice · 2017-07-11T12:02:01Z

Rough funcs were added in commits 9cbb4f4, 8b06391, and bd9b55c. This issue will become the parent issue of redeveloping each of the primary get_* functions to access storm data.

In releases up to v0.2.0-5.1 have each of the product get_* functions (get_discus, get_fstadv, etc., get_storms being the exception) as standalone function with get_storm_data serving as a wrapper calling each of the functions.

That process will be modified for release v0.2.0-6. The get_* funcs will serve as wrappers calling get_storm_data instead where datasets will be accessed. The product scraping funcs, e.g., discus, fstadv, public, etc., will be called from get_storm_data.

get_storm_data will then return a list of dataframes based on the products parameters. So, if a user calls get_storm_data with multiple products:

get_storm_data(link, products = c("discus", "fstadv")

Then a list containing both dataframes will be returned as it is now. The difference being that instead of get_storm_data going through get_discus and get_fstadv, those two funcs will be ignored and get_storm_data will process the data itself using discus and fstadv.

A call to get_discus, however, will be the same as calling

get_storm_data(link, "discus")

and the output will be a single dataframe containing all discus products.

HTTP Limits

As mentioned earlier, there is a limit of 80 requests per 10 seconds. Accepting this as-is left the issue of keeping track of how many links were sent through at a given time and recording those times to avoid complications. Yea, no.

Instead a maximum of four requests per half second will be sent through from get_storm_data. The dplyr::progress_bar will serve as the delay between execution. The downside is if only one or two links are sent there is a wait of 0.5 seconds. But, 0.5 seconds; really?

The upside is very simple implementation of avoiding hitting the limit. The progress bar will move relatively quickly (given good internet access) so users won't be left staring at a progress bar that's barely moving.

timtrice · 2018-12-24T12:11:00Z

I'm reopening this issue.

If there are less than 80 links, then there should be no delay in downloading them. I believe the default is set to only process 4 links every half second but that may depend on what URL is being access (are we getting a list of storms for a/many season(s)? Or, are we downloading text products.

This is a bottleneck, regardless, that needs to be addressed.

Source

timtrice added Features This would be cool to have High Priority NOW NOW NOW Technical Debt Yea, this isn't going to be pretty but needs to be done labels Jul 8, 2017

timtrice closed this as completed Jul 18, 2017

timtrice reopened this Dec 24, 2018

timtrice added this to the 0.2.1 milestone Jan 2, 2019

timtrice closed this as completed Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify data-scraping methods with crul #94

Modify data-scraping methods with crul #94

timtrice commented Jul 8, 2017 •

edited

Loading

timtrice commented Jul 11, 2017

timtrice commented Dec 24, 2018 •

edited

Loading

Modify data-scraping methods with crul #94

Modify data-scraping methods with crul #94

Comments

timtrice commented Jul 8, 2017 • edited Loading

timtrice commented Jul 11, 2017

HTTP Limits

timtrice commented Dec 24, 2018 • edited Loading

timtrice commented Jul 8, 2017 •

edited

Loading

timtrice commented Dec 24, 2018 •

edited

Loading