Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile #30

lgtm70b · 2018-09-06T04:52:15Z

Here's a reprex using a larger shapefile with US congressional districts. Based on a single polygon taking 10 seconds to pass through hungarian_costmin(), this should take over 80 minutes to render the full cartogram. I've reduced the shapefile size from 65 MB to 3 MB already, but is there a way the hungarian algorithm can be optimized to run faster during grid assignment?

library(geogrid)
library(sf)
library(tidyverse)
library(tmap)

# Download and read example data
tmp_shps <- tempfile()
tmp_dir <- tempdir()
download.file(
  "https://www2.census.gov/geo/tiger/TIGER2016/CD/tl_2016_us_cd115.zip",
  tmp_shps
)
unzip(tmp_shps, exdir = tmp_dir)
us_shp <- st_read(paste(tmp_dir, "tl_2016_us_cd115.shp", sep = "/"))
  # > 64394144 bytes
shp <- us_shp %>% 
  filter(!STATEFP %in% c('15', '02', '60', '72', '69', '78', '66')) %>% # just mainland
  as('Spatial') %>%
  rmapshaper::ms_simplify() %>% # simplification using rmapshaper. reduces size from 65 MB to 3 MB
  st_as_sf()

# shp %>% object.size()
# > 3594432 bytes

# commented out: initial rawplot as in README example #
# substrRight <- function(x, n){
# substr(x, nchar(x)-n+1, nchar(x))
# }
# shp$SNAME <- paste(shp$STATEFP %>% as.character(), 
#                   substrRight(shp$NAMELSAD %>% as.character(), 2), 
#                               sep = '_') # get congr distr #
# rawplot <- tm_shape(shp) +
#   tm_polygons() +
#   tm_text("SNAME")
# rawplot

# commented out: exploring different grid layouts, settled on hexgrid seed 6 #
# par(mfrow = c(2, 3), mar = c(0, 0, 2, 0))
# for (i in 1:6) {
#   new_cells <- calculate_grid(shape = original_shapes, grid_type = "hexagonal", seed = i)
#   plot(new_cells, main = paste("Seed", i, sep = " "))
# }

# Calculate Grid
    new_cells_hex <- calculate_grid(shape = shp, grid_type = "hexagonal", seed = 6)

# Assign Polygons
    ## assign_polygons() doesn't complete here, sometimes crashes. takes extremely long time to run
    # resulthex3 <- assign_polygons(shp, new_cells_hex) 
    #####################

##
# Where the delay is: hungarian_costmin within assign_polygons()
##

# within assign_polygons.sf, which converts to spdf, and uses
  # assign_polygons.SpatialPolygonsDataFrame()
# running equivalent contents line by line, just 1st element / polygon within loop

  shape <- shp %>% as('Spatial')
  original_points <- rgeos::gCentroid(spf, byid = TRUE)
  shape@data$CENTROIX <- original_points$x
  shape@data$CENTROIY <- original_points$y
  shape@data$key_orig <- paste(original_points$x, original_points$y, 
                               sep = "_")
  
  new_polygons <- new_cells_hex
  new_points <- new_polygons[[1]]
  vector_length <- length(shape)
  new_polygons2 <- new_polygons[[2]]
  s_poly <- sp::SpatialPolygonsDataFrame(new_polygons2, as.data.frame(sp::coordinates(new_polygons2)))
  s_poly$key_new <- paste(s_poly@data$V1, s_poly@data$V2, 
                          sep = "_")
  closest_site_vec <- vector(mode = "numeric", length = vector_length)
  min_dist_vec <- vector(mode = "numeric", length = vector_length)
  taken_vec <- vector(mode = "numeric", length = vector_length)
  taken_vec_index <- integer(vector_length)

## within loop ##
  i <- 1 # just first iteration
  dist_vec <- sp::spDistsN1(original_points, new_points[i], 
                            longlat = FALSE)
  min_dist_vec[i] <- min(dist_vec)
    closest_site_vec[i] <- which.min(dist_vec)

  taken_vec[i] <- which.min(dist_vec)
  taken_vec_index <- taken_vec[taken_vec > 0]
  costmatrix <- sp::spDists(original_points, new_points, 
                            longlat = FALSE)
  colnames(costmatrix) <- paste(s_poly@data$V1, s_poly@data$V2, 
                                sep = "_")
  rownames(costmatrix) <- paste(original_points@coords[, 1], original_points@coords[, 2], sep = "_")
  #################################################
  ### everything up to here evaluates instantly ###
  #################################################

  system.time(hungarian_costmin <- hungarian_cc(costmatrix)) # Just one element takes 10.8 s, MBP 2.2 GHz i7 16 GB RAM
  # user  system elapsed 
  # 10.796   0.033  10.876 
  
  # end loop
  # 11 * 436 obs ~ 4796 seconds / 80 minutes

The text was updated successfully, but these errors were encountered:

sassalley · 2018-09-28T06:17:13Z

Hi @ddheart thanks for using the package and for providing a useful reprex. The short answer (to the best of my knowledge) is - not easily. There is a faster implementation of the algorithm which is O(n³) whereas this particular implementation is only O(n⁴). This is the reason it slows down quite noticeably as the number of assignments (polygons) increases. More information about the implementations can be found here and here. If you are able to provide a faster implementation then that would be great. Nonetheless, I hope that despite taking time, it is helpful for your use case.

aecoleman · 2020-04-04T17:55:49Z

I believe I was able to optimize the assignment algorithm and have submitted a pull request. After reading the c++ code, I realized the for loop wasn't necessary. On the 50 US States, my variant gave results in ~200ms vs ~600ms with the existing implementation. (Intel i7-7700HQ @ 2.80 GHz). The assigning the 440 US Congressional Districts took 32 seconds. Assigning the 3000+ CONUS counties took ~34 hours.

mrajeev08 · 2020-05-22T15:14:52Z

Thanks for making this package! Also had run into a similar issue and found this q & a on SO using clue::solve_LSAP which is quite fast.

lgtm70b changed the title ~~Slow processing with assign_polygons() (hungarian_costmin())~~ Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile #30

Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile #30

lgtm70b commented Sep 6, 2018 •

edited

Loading

sassalley commented Sep 28, 2018

aecoleman commented Apr 4, 2020

mrajeev08 commented May 22, 2020

Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile #30

Slow processing with assign_polygons() (hungarian_costmin()) for 3 MB shapefile #30

Comments

lgtm70b commented Sep 6, 2018 • edited Loading

sassalley commented Sep 28, 2018

aecoleman commented Apr 4, 2020

mrajeev08 commented May 22, 2020

lgtm70b commented Sep 6, 2018 •

edited

Loading