The goal of lazysf is to provide interactive delayed read of GDAL vector data sources.
Vector data sources, drawings (a.k.a. “shapefiles”) are files or web services or databases that provide tables of data fields. These fields may include spatial geometry data such as points, lines, polygons, and other planar types composed of paths of coordinates.
lazysf uses the dplyr/dbplyr ‘tbl_lazy’ mechanism by providing a GDAL
DBI-backend like many database packages in R. The convenience function
lazysf()
provides a single-argument wrapper around the database-like
workflows.
See it in action!
library(lazysf)
library(sf)
library(dplyr)
url <- "https://github.com/Nowosad/spData/raw/master/inst/shapes/NY8_bna_utm18.gpkg"
(x <- lazysf(url))
#> # Source: table<sf_bna2_utm18> [?? x 13]
#> # Database: SFSQLConnection
#> AREAKEY AREANAME X Y POP8 TRACTCAS PROPCAS PCTOWNHOME PCTAGE65P
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 360070… Bingham… 4.07 -67.4 3540 3.08 8.70e-4 0.328 0.147
#> 2 360070… Bingham… 4.64 -66.9 3560 4.08 1.15e-3 0.427 0.235
#> 3 360070… Bingham… 5.71 -67.0 3739 1.09 2.92e-4 0.338 0.138
#> 4 360070… Bingham… 7.61 -66.0 2784 1.07 3.84e-4 0.462 0.119
#> 5 360070… Bingham… 7.32 -67.3 2571 3.06 1.19e-3 0.192 0.142
#> 6 360070… Bingham… 8.56 -66.9 2729 1.06 3.88e-4 0.365 0.141
#> 7 360070… Bingham… 9.21 -67.2 3952 2.09 5.29e-4 0.666 0.231
#> 8 360070… Bingham… 10.2 -66.9 993 0.02 2.00e-5 0.667 0.279
#> 9 360070… Bingham… 8.70 -68.3 1908 2.04 1.07e-3 0.459 0.172
#> 10 360070… Bingham… 7.40 -68.1 948 0.02 2.10e-5 0.166 0.179
#> # … with more rows, and 4 more variables: Z <dbl>, AVGIDIST <dbl>,
#> # PEXPOSURE <dbl>, geom <MULTIPOLYGON [m]>
x %>% distinct(AREANAME) %>% arrange(AREANAME)
#> # Source: lazy query [?? x 1]
#> # Database: SFSQLConnection
#> # Ordered by: AREANAME
#> AREANAME
#> <chr>
#> 1 <NA>
#> 2 Auburn city
#> 3 Baldwinsville village
#> 4 Barker town
#> 5 Bayberry-Lynelle Mead
#> 6 Binghamton city
#> 7 Binghamton town
#> 8 Brookfield town
#> 9 Camillus village
#> 10 Canastota village
#> # … with more rows
plot(st_as_sf(x %>%
dplyr::filter(!(AREANAME %LIKE% "Ca%" | AREANAME %LIKE% "Bi%")) %>%
dplyr::select(AREANAME, geom)))
This is very largely format dependent, and by “format” we mean the driver as provided by GDAL.
We make no claims about performance or convenience, it will be affected by your system and your sf installation - lazysf just takes you closer the GDAL capabilities.
Performance can be excellent, and may be very competitive compared to reading an entire data source layer into memory. Really good drivers include ESRI Shapefile, Geopackage, PostgreSQL/PostGIS, MapInfo File, ESRI FileGDB, but there are dozens to choose from.
A query on a CSV, GeoJSON, or KML file (local or remote) is entirely subject to the performance of the matching GDAL driver.
- big text files will be slow, they aren’t a format suitable for database-like access
- geometry is not automatic for real database formats, and will depend on the SQL used
- geometry is automatic for non-DB formats
- non-DB formats without a geometry column name will be called
_ogr_geometry_
, other non-DB formats like ESRI’s geodatabase have other names likeSHAPE
- non-DB formats have access to special variable names and functions, listed on the OGRSQL page.
Real DBs don’t have these special OGRSQL features, but they do have their own special syntax which for the most part can be sent straight through.
When using dplyr verbs (filter()
, select()
, mutate()
,
transmute()
, arrange()
, left_join()
, …) we are also subject to the
rules of SQL translation. There are no specific ones provided by lazysf
but that might change.
Wrappers around lazysf could provide more specific tools for particular formats.
Yes (actually that is what lazysf uses) but with sf alone you get a fully materialized sf data frame, so you better get that query right first time!
With lazysf you get some control over intermediate steps, potentially expensive queries will only be run for a preview of the data until you are ready to fetch it.
You can install the dev version of lazysf from GitHub with:
remotes::install_github("mdsumner/lazysf")
This is a basic example.
library(lazysf)
f <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)
## specify only the data source
lazysf(f)
#> # Source: table<nc.gpkg> [?? x 15]
#> # Database: SFSQLConnection
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
#> 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
#> 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
#> 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
#> 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
#> 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
#> # … with more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
#> # NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>
## specify the data source and a query to run
lazysf(f, query = "SELECT AREA, FIPS, geom FROM \"nc.gpkg\" WHERE AREA < 0.1")
#> # Source: SQL [?? x 3]
#> # Database: SFSQLConnection
#> AREA FIPS geom
#> <dbl> <chr> <MULTIPOLYGON [°]>
#> 1 0.061 37005 (((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.40504, -…
#> 2 0.07 37053 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288 36.33598, -7…
#> 3 0.097 37091 (((-76.74506 36.23392, -76.98069 36.23024, -76.99475 36.23558, -…
#> 4 0.062 37029 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134 36.16973, -7…
#> 5 0.091 37073 (((-76.56251 36.34057, -76.60424 36.31498, -76.64822 36.31532, -…
#> 6 0.072 37181 (((-78.49252 36.17359, -78.51472 36.17522, -78.51709 36.46148, -…
#> 7 0.053 37139 (((-76.29893 36.21423, -76.32423 36.23362, -76.37242 36.25235, -…
#> 8 0.081 37189 (((-81.80622 36.10456, -81.81715 36.10939, -81.82231 36.15786, -…
#> 9 0.063 37143 (((-76.48053 36.07979, -76.53696 36.08792, -76.5756 36.10266, -7…
#> 10 0.044 37041 (((-76.68874 36.29452, -76.64822 36.31532, -76.60424 36.31498, -…
#> # … with more rows
## specify the data source and the table/layer to access
lazysf(f, layer = "nc.gpkg") %>%
dplyr::select(AREA, FIPS, geom) %>%
dplyr::filter(AREA < 0.1)
#> # Source: lazy query [?? x 3]
#> # Database: SFSQLConnection
#> AREA FIPS geom
#> <dbl> <chr> <MULTIPOLYGON [°]>
#> 1 0.061 37005 (((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.40504, -…
#> 2 0.07 37053 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288 36.33598, -7…
#> 3 0.097 37091 (((-76.74506 36.23392, -76.98069 36.23024, -76.99475 36.23558, -…
#> 4 0.062 37029 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134 36.16973, -7…
#> 5 0.091 37073 (((-76.56251 36.34057, -76.60424 36.31498, -76.64822 36.31532, -…
#> 6 0.072 37181 (((-78.49252 36.17359, -78.51472 36.17522, -78.51709 36.46148, -…
#> 7 0.053 37139 (((-76.29893 36.21423, -76.32423 36.23362, -76.37242 36.25235, -…
#> 8 0.081 37189 (((-81.80622 36.10456, -81.81715 36.10939, -81.82231 36.15786, -…
#> 9 0.063 37143 (((-76.48053 36.07979, -76.53696 36.08792, -76.5756 36.10266, -7…
#> 10 0.044 37041 (((-76.68874 36.29452, -76.64822 36.31532, -76.60424 36.31498, -…
#> # … with more rows
## above was a real database (Geopackage), now with an actual shapefile
shp <- lazysf(system.file("shape/nc.shp", package = "sf", mustWork = TRUE))
library(dplyr)
shp %>%
filter(NAME %LIKE% 'A%') %>%
mutate(abc = 1.3) %>%
select(abc, NAME, `_ogr_geometry_`) %>%
arrange(desc(NAME)) #%>% show_query()
#> # Source: lazy query [?? x 3]
#> # Database: SFSQLConnection
#> # Ordered by: desc(NAME)
#> abc NAME `_ogr_geometry_`
#> <dbl> <chr> <POLYGON [°]>
#> 1 1.3 Avery ((-81.94135 35.95498, -81.9614 35.93922, -81.94495 35.91861, -…
#> 2 1.3 Ashe ((-81.47276 36.23436, -81.54084 36.27251, -81.56198 36.27359, …
#> 3 1.3 Anson ((-79.91995 34.80792, -80.32528 34.81476, -80.27512 35.19311, …
#> 4 1.3 Allegha… ((-81.23989 36.36536, -81.24069 36.37942, -81.26284 36.40504, …
#> 5 1.3 Alexand… ((-81.10889 35.7719, -81.12728 35.78897, -81.1414 35.82332, -8…
#> 6 1.3 Alamance ((-79.24619 35.86815, -79.23799 35.83725, -79.54099 35.83699, …
Online sources can also work if your build of sf supports.
# online sources can work
geojson <- file.path("https://raw.githubusercontent.com/SymbolixAU",
"geojsonsf/master/inst/examples/geo_melbourne.geojson")
lazysf(geojson)
#> # Source: table<geo_melbourne> [?? x 8]
#> # Database: SFSQLConnection
#> SA2_NAME polygonId SA3_NAME AREASQKM fillColor strokeColor strokeWeight
#> <chr> <int> <chr> <dbl> <chr> <chr> <int>
#> 1 Abbotsf… 70 Yarra 1.74 #440154 #440154 1
#> 2 Albert … 59 Port Ph… 4.67 #450457 #450457 1
#> 3 Alphing… 41 Darebin… 2.89 #46075A #46075A 1
#> 4 Armadale 66 Stonnin… 2.18 #460A5D #460A5D 1
#> 5 Ascot V… 44 Essendon 3.84 #460C5F #460C5F 1
#> 6 Brunswi… 36 Brunswi… 5.14 #472D7B #472D7B 1
#> 7 Brunswi… 37 Brunswi… 2.17 #472D7B #472D7B 1
#> 8 Brunswi… 38 Brunswi… 3.18 #472E7C #472E7C 1
#> 9 Carlton 48 Melbour… 1.82 #443A83 #443A83 1
#> 10 Carlton… 71 Yarra 2.30 #443A83 #443A83 1
#> # … with more rows, and 1 more variable: `_ogr_geometry_` <POLYGON [°]>
Also works on PostgreSQL and many others as per GDAL vector driver support.
To create a connection string for GDAL for PostgreSQL use something like
DSN <- glue::glue("PG:host='{host}' dbname='{dbname}' user='{user}' password='{password}'")
dbConnect(SFSQL(), DSN)
but the same can be done with generic DBI and (for example) the
Rpostgres package. With SFSQL()
we
just know that it’s executed by GDAL (via sf).
Note that GDAL drivers can be confusing, and it can be important to see the behaviours GDAL will provide by default. Here see that we can read from a Geopackage file but not as it was intended. We have used the driver-prefix to make GDAL choose its SQLite driver rather than the Geopackage driver.
library(lazysf)
gpkgfile <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)
lazysf(glue::glue("SQLite:{gpkgfile}"))
#> # Source: table<gpkg_contents> [?? x 10]
#> # Database: SFSQLConnection
#> table_name data_type identifier description last_change min_x min_y
#> <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl>
#> 1 nc.gpkg features nc.gpkg "" 2016-09-28 14:57:13 -84.3 33.9
#> # … with 3 more variables: max_x <dbl>, max_y <dbl>, srs_id <int>
That’s not very spatial, but we can dig in to find out what else is there.
See https://twitter.com/mdsumner/status/1306937694467940353?s=20 for more (but a SQLite in GDAL returns character not binary geom … so WIP)
Please note that the lazysf project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.