Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_intersect faster than geos_intersects_matrix() #66

Open
vronizor opened this issue Feb 15, 2022 · 2 comments · Fixed by #73
Open

st_intersect faster than geos_intersects_matrix() #66

vronizor opened this issue Feb 15, 2022 · 2 comments · Fixed by #73

Comments

@vronizor
Copy link

vronizor commented Feb 15, 2022

As per @paleolimbot 's suggestion, here is a reprex of a benchmark comparing spatial intersects with sf vs geos . st_intersects does seem faster than geos_intersects_matrix() at the moment.

library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.0, PROJ 8.1.1; sf_use_s2() is TRUE
library(geos)
library(data.table)
library(microbenchmark)
options(timeout=10000)

rm(list = ls())
gc()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  901867 48.2    1397940 74.7         NA  1397940 74.7
#> Vcells 1578643 12.1    8388608 64.0      16384  2650514 20.3

url_trips = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv"

download.file(url = url_trips, destfile = "trip_records.csv", mode = "wb") # takes some time and is pretty heavy!
trips.dt = fread("trip_records.csv", nrows = 100000) # enough rows to benchmark performance

# Keep only pickup (origin) coordinate columns
cols_drop = grep("pickup_l", names(trips.dt), value = T, invert = T)
trips.dt[, (cols_drop) := NULL]
trips.dt = trips.dt[!(pickup_longitude == 0 | pickup_latitude == 0)] # obviously invalid coordinates

# Generate geom
trips.dt[, `:=`(o_geom = geos_make_point(pickup_longitude, 
                                         pickup_latitude, 
                                         crs=4326))]

# Generate sf
trips.sf = st_as_sf(trips.dt)
trips.sf = st_transform(trips.sf, 4326)

# Get polygons to intersect
url_zones = "https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip"
download.file(url_zones, destfile = "zones_polygons.zip", mode = "wb")
unzip("zones_polygons.zip", exdir = file.path("zones"))

# sf version
zones.sf = st_read(file.path("zones", "taxi_zones.shp"))
#> Reading layer `taxi_zones' from data source 
#>   `/private/var/folders/y3/cg53v8tx3g538m3751w1_9v80000gn/T/RtmpwdNjKW/reprex-4d044a0e522a-cilia-stag/zones/taxi_zones.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 263 features and 6 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
#> Projected CRS: NAD83 / New York Long Island (ftUS)
zones.sf = st_transform(zones.sf, 4326)

# geom version
zones.dt = as.data.table(zones.sf)[, geom := as_geos_geometry(geometry, crs = 4326)]
zones.dt[, geometry := NULL]

# Functions to test
geos.dt = function(x) {
  geos_intersects_matrix(zones.dt[, geom], trips.dt[, o_geom])
}

sf = function(x) {
  st_intersects(zones.sf, trips.sf)
}

# Benchmark
microbenchmark(geos = geos.dt(),
               sf = sf(),
               times = 10)
#> Unit: seconds
#>  expr      min       lq     mean   median       uq      max neval
#>  geos 4.217741 4.627104 5.214595 5.222753 5.842305 5.876518    10
#>    sf 3.014916 3.103549 3.493810 3.328212 3.455878 5.066740    10

Created on 2022-02-15 by the reprex package (v2.0.1)

@kadyb
Copy link
Contributor

kadyb commented Feb 16, 2022

Shouldn't you use a planar CRS (eg 2263)? It looks like the difference is even greater then.

result = bench::mark(
  check = FALSE, iterations = 10, time_unit = "s",
  geos.dt(),
  sf_4326(),
  sf_2263()
)
result[, 1:6]

  expression   min median `itr/sec` mem_alloc `gc/sec`
1 geos.dt()  4.80   5.05      0.199    3.79MB    2.79 
2 sf_4326()  2.62   2.79      0.351    6.61MB    0.351
3 sf_2263()  0.709  0.730     1.31     5.47MB    1.31

I'm also not sure that {sf} and {geos} work exactly the same way. geos_intersects_matrix() expects STRTree as the second argument and it takes the most time to convert to this type, but @paleolimbot will explain it best.

system.time(trips.dt2 <- as_geos_strtree(trips.dt[, o_geom]))
#> user  system elapsed 
#> 4.88    0.02    4.89
system.time(geos_intersects_matrix(zones.dt[, geom], trips.dt2))
#> user  system elapsed 
#> 0.14    0.00    0.14

@paleolimbot
Copy link
Owner

Thanks for this!

I'm 99% certain that the difference is because of how accumulating potentially matching features (based on bbox) is implemented (in sf it's std::vector and here I over-allocate an intermediary buffer that's way too big and I should just use std::vector).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants