Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squish_df helper #38

Closed
JosiahParry opened this issue Mar 20, 2024 · 2 comments · Fixed by #42
Closed

squish_df helper #38

JosiahParry opened this issue Mar 20, 2024 · 2 comments · Fixed by #42

Comments

@JosiahParry
Copy link
Collaborator

A common need from processing many requests at once is to combine the results into a single data frame. This is done ad hoc in arcgislayers and arcpbf.

arcgislayers uses do.call(rbind.data.frame) which is the slowest approach. arcpbf has adopted a hierarchy of the fastest implementations using collapse, data.table, dplyr, and base R. This should be provided in arcgisutils. It is needed in arcgeocode at the moment as well.

See R-ArcGIS/arcgislayers#167

Also: https://github.com/R-ArcGIS/arcpbf/blob/main/R/post-process.R#L109-L121

@elipousson
Copy link

This may be a question more relevant to {arcgislayers} than {arcgisutils} but: have you looked at httr2::resps_data()?

I noticed that parse_esri_json() is converting each response to an sf object before they get combined – but if you combine them with vctrs::list_rbind() the sf class is dropped and needs to be applied again with sf::st_as_sf(). If the conversion to sf happens after all of the elements are combined into a single list, that could make it easier to figure out the flow within arc_select().

@JosiahParry
Copy link
Collaborator Author

I think this is actually something I want to formalize a bit more and get sorted out. I've had a chance to look at resps_data() and I think it's a really good pattern and I'm prototyping with it currently in {arcgeocode}.

The workflow in general for these packages is pretty standard:

  1. Create a bunch of requests (lapply, for loop, whatever)
  2. Send a bunch of requests (httr2::req_perform_parallel())
  3. Process the results (httr2::resps_data())
  4. Combine and return to the user - some utils function e.g. rbind_results()

Question : how do we formalize this in a function signature / standard?

Using resps_data() we can get a list of responses that are already pre-processed that we can then pass to a squish_results() or rbind_results() type of function.

I'd like to clean this up in arcgislayers in particular because I want to use arcpbf when a service supports protocol buffers. So what might this look like? Right now there is arcpbf::resps_data_pbf(), should there be a arcgisutils::resps_data_json() as well? Then in arcgeocode::resps_data_rev_geocode() (maybe not exported)?

Error handling?

One of the challenges here is how do we handle errors in the responses? We want to keep all of the responses that work because 1) they might have cost us money to execute and 2) it might have been slow. I'd like to be able to capture the errors and then return them as an attribute to the result so that they can be handled afterwards. But what does that look like?

Row-binding results

Regarding combining results, collapse is without a doubt the fastest way to do this and it respects the input classes. You can also do this with the .ptype argument of vctrs::vec_rbind() but its quite slow.

x <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))

bench::mark(
  collapse = collapse::rowbind(x, x, x, x, x, x, x, x, x),
  data.table = data.table::rbindlist(list(x, x, x, x, x, x, x, x, x)) |> 
    sf::st_sf(),
  vctrs = vctrs::vec_rbind(x, x, x, x, x, x, x, x, x, .ptype = x[0,]),
  check = F
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 collapse     19.8µs   28.9µs    31690. 1007.98KB     66.7
#> 2 data.table 228.21µs  246.2µs     3895.    3.34MB     28.3
#> 3 vctrs        7.18ms    7.6ms      131.  826.91KB     67.4


# illustrate ptype argument
vctrs::vec_rbind(x, x, x, x, x, x, x, x, x, .ptype = x[0,])
#> Simple feature collection with 900 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> # A tibble: 900 × 15
#>     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#>    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
#>  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
#>  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
#>  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
#>  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
#>  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
#>  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
#>  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
#>  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
#>  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
#> 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
#> # ℹ 890 more rows
#> # ℹ 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> #   geometry <MULTIPOLYGON [°]>

Created on 2024-03-21 with reprex v2.0.2

elipousson added a commit to elipousson/arcgislayers that referenced this issue Apr 9, 2024
Also implement new rbind_results function added w/ R-ArcGIS/arcgisutils#38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants