Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for mask parameter #285

Merged
merged 14 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@
unknown count (count is used in `read_info`, `read`, `read_dataframe`) (#271).
- Raise error if `read` or `read_dataframe` is called with parameters to read no
columns, geometry, or fids (#280)

- Automatically detect supported driver by extension for all available
write drivers and addition of `detect_write_driver` (#270)
- Addition of `mask` parameter to `open_arrow`, `read`, `read_dataframe`,
and `read_bounds` functions to select only the features in the dataset that
intersect the mask geometry (#285). Note: there is a known GDAL bug when
using the Arrow interface that causes it to return too many features; a fix
is expected in GDAL 3.8.0.

### Bug fixes

Expand Down
27 changes: 27 additions & 0 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,32 @@ will be returned; if GEOS is not available or not used by GDAL, all geometries
with bounding boxes that intersect this bbox will be returned.
`pyogrio.__gdal_geos_version__` will be `None` if GEOS is not detected.

## Filter records by a geometry

You can use the `mask` parameter to select only those features that intersect
with a Shapely (>= 2.0) geometry.

```python
>>> mask = shapely.Polygon(([-80,8], [-80, 10], [-85,10], [-85,8], [-80,8]))
>>> read_dataframe('ne_10m_admin_0_countries.shp', mask)
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
```

Note: the `mask` values must be in the same CRS as the dataset.

If your mask geometry is in some other representation, such as GeoJSON, you will
need to convert it to a Shapely geometry before using `mask`.

```python
>>> mask_geojson = '{"type":"Polygon","coordinates":[[[-80.0,8.0],[-80.0,10.0],[-85.0,10.0],[-85.0,8.0],[-80.0,8.0]]]}'
>>> mask = shapely.from_geojson(mask_geojson)
>>> read_dataframe('ne_10m_admin_0_countries.shp', mask)
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
```

Note: if GEOS is present and used by GDAL, only geometries that intersect `mask`
will be returned; if GEOS is not available or not used by GDAL, all geometries
with bounding boxes that intersect the bounding box of `mask` will be returned.
`pyogrio.__gdal_geos_version__` will be `None` if GEOS is not detected.

## Execute a sql query

You can use the `sql` parameter to execute a sql query on a dataset.
Expand Down Expand Up @@ -324,6 +350,7 @@ This function supports options to subset features from the dataset:
- `max_features`
- `where`
- `bbox`
- `mask`

## Write a GeoPandas GeoDataFrame

Expand Down
10 changes: 10 additions & 0 deletions docs/source/known_issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,13 @@ We recommend the following to sidestep performance issues:
- the `use_arrow=True` option may speed up reading from OSM files
- if possible, use a different tool such as `ogr2ogr` to translate the OSM
data source into a more performant format for reading by layer, such as GPKG

## Incorrect results when using a spatial filter and Arrow interface

Due to [a bug in GDAL](https://github.com/OSGeo/gdal/issues/8347), when using
the Arrow interface (e.g., via `use_arrow` on `read_dataframe`) certain drivers
(e.g., GPKG, FlatGeobuf, Arrow, Parquet) returned features whose bounding boxes
intersected the bounding box specified by `bbox` or `mask` geometry instead of
those whose geometry intersected the `bbox` or `mask`.

A fix is expected in GDAL 3.8.0.
61 changes: 52 additions & 9 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -565,8 +565,8 @@ cdef apply_where_filter(OGRLayerH ogr_layer, str where):
raise ValueError(f"Invalid SQL query for layer '{OGR_L_GetName(ogr_layer)}': '{where}'")


cdef apply_spatial_filter(OGRLayerH ogr_layer, bbox):
"""Applies spatial filter to layer.
cdef apply_bbox_filter(OGRLayerH ogr_layer, bbox):
"""Applies bounding box spatial filter to layer.

Parameters
----------
Expand All @@ -586,6 +586,28 @@ cdef apply_spatial_filter(OGRLayerH ogr_layer, bbox):
OGR_L_SetSpatialFilterRect(ogr_layer, xmin, ymin, xmax, ymax)


cdef apply_geometry_filter(OGRLayerH ogr_layer, wkb):
"""Applies geometry spatial filter to layer.

Parameters
----------
ogr_layer : pointer to open OGR layer
wkb: WKB encoding of geometry
"""

cdef OGRGeometryH ogr_geometry = NULL
cdef unsigned char *wkb_buffer = wkb

err = OGR_G_CreateFromWkb(wkb_buffer, NULL, &ogr_geometry, len(wkb))
if err:
if ogr_geometry != NULL:
OGR_G_DestroyGeometry(ogr_geometry)
raise GeometryError("Could not create mask geometry") from None

OGR_L_SetSpatialFilter(ogr_layer, ogr_geometry)
OGR_G_DestroyGeometry(ogr_geometry)


cdef validate_feature_range(OGRLayerH ogr_layer, int skip_features=0, int max_features=0):
"""Limit skip_features and max_features to bounds available for dataset.

Expand Down Expand Up @@ -1001,6 +1023,7 @@ def ogr_read(
int max_features=0,
object where=None,
tuple bbox=None,
object mask=None,
object fids=None,
str sql=None,
str sql_dialect=None,
Expand All @@ -1021,10 +1044,10 @@ def ogr_read(
path_c = path_b

if fids is not None:
if where is not None or bbox is not None or sql is not None or skip_features or max_features:
if where is not None or bbox is not None or mask is not None or sql is not None or skip_features or max_features:
raise ValueError(
"cannot set both 'fids' and any of 'where', 'bbox', 'sql', "
"'skip_features' or 'max_features'"
"cannot set both 'fids' and any of 'where', 'bbox', 'mask', "
"'sql', 'skip_features' or 'max_features'"
)
fids = np.asarray(fids, dtype=np.intc)

Expand All @@ -1037,6 +1060,9 @@ def ogr_read(
"be None or non-empty"
)

if bbox and mask:
raise ValueError("cannot set both 'bbox' and 'mask'")

try:
dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)
Expand Down Expand Up @@ -1107,7 +1133,10 @@ def ogr_read(

# Apply the spatial filter
if bbox is not None:
apply_spatial_filter(ogr_layer, bbox)
apply_bbox_filter(ogr_layer, bbox)

elif mask is not None:
apply_geometry_filter(ogr_layer, mask)

# Limit feature range to available range
skip_features, num_features = validate_feature_range(
Expand Down Expand Up @@ -1164,6 +1193,7 @@ def ogr_open_arrow(
int max_features=0,
object where=None,
tuple bbox=None,
object mask=None,
object fids=None,
str sql=None,
str sql_dialect=None,
Expand Down Expand Up @@ -1205,6 +1235,9 @@ def ogr_open_arrow(
"be None or non-empty"
)

if bbox and mask:
raise ValueError("cannot set both 'bbox' and 'mask'")

reader = None
try:
dataset_options = dict_to_options(dataset_kwargs)
Expand Down Expand Up @@ -1253,7 +1286,10 @@ def ogr_open_arrow(

# Apply the spatial filter
if bbox is not None:
apply_spatial_filter(ogr_layer, bbox)
apply_bbox_filter(ogr_layer, bbox)

elif mask is not None:
apply_geometry_filter(ogr_layer, mask)

# Limit to specified columns
if ignored_fields:
Expand Down Expand Up @@ -1331,7 +1367,8 @@ def ogr_read_bounds(
int skip_features=0,
int max_features=0,
object where=None,
tuple bbox=None):
tuple bbox=None,
object mask=None):

cdef int err = 0
cdef const char *path_c = NULL
Expand All @@ -1341,6 +1378,9 @@ def ogr_read_bounds(
cdef int feature_count = 0
cdef double xmin, ymin, xmax, ymax

if bbox and mask:
raise ValueError("cannot set both 'bbox' and 'mask'")

path_b = path.encode('utf-8')
path_c = path_b

Expand All @@ -1357,7 +1397,10 @@ def ogr_read_bounds(

# Apply the spatial filter
if bbox is not None:
apply_spatial_filter(ogr_layer, bbox)
apply_bbox_filter(ogr_layer, bbox)

elif mask is not None:
apply_geometry_filter(ogr_layer, mask)

# Limit feature range to available range
skip_features, num_features = validate_feature_range(ogr_layer, skip_features, max_features)
Expand Down
2 changes: 2 additions & 0 deletions pyogrio/_ogr.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,7 @@ cdef extern from "ogr_api.h":
void OGR_Fld_SetSubType(OGRFieldDefnH fielddefn, OGRFieldSubType subtype)

OGRGeometryH OGR_G_CreateGeometry(int wkbtypecode)
OGRErr OGR_G_CreateFromWkb(const void *bytes, OGRSpatialReferenceH srs, OGRGeometryH *geometry, int nbytes)
void OGR_G_DestroyGeometry(OGRGeometryH geometry)
void OGR_G_ExportToWkb(OGRGeometryH geometry, int endianness, unsigned char *buffer)
void OGR_G_GetEnvelope(OGRGeometryH geometry, OGREnvelope* envelope)
Expand Down Expand Up @@ -290,6 +291,7 @@ cdef extern from "ogr_api.h":
OGRErr OGR_L_SetNextByIndex(OGRLayerH layer, int nIndex)
int OGR_L_GetFeatureCount(OGRLayerH layer, int m)
void OGR_L_SetSpatialFilterRect(OGRLayerH layer, double xmin, double ymin, double xmax, double ymax)
void OGR_L_SetSpatialFilter(OGRLayerH layer, OGRGeometryH geometry)
OGRErr OGR_L_SetIgnoredFields(OGRLayerH layer, const char** fields)

void OGRSetNonLinearGeometriesEnabledFlag(int bFlag)
Expand Down
12 changes: 11 additions & 1 deletion pyogrio/core.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from pyogrio._env import GDALEnv
from pyogrio.util import get_vsi_path, _preprocess_options_key_value
from pyogrio.util import get_vsi_path, _preprocess_options_key_value, _mask_to_wkb


with GDALEnv():
Expand Down Expand Up @@ -127,6 +127,7 @@ def read_bounds(
max_features=None,
where=None,
bbox=None,
mask=None,
):
"""Read bounds of each feature.

Expand Down Expand Up @@ -159,6 +160,14 @@ def read_bounds(
and used by GDAL, only geometries that intersect this bbox will be
returned; if GEOS is not available or not used by GDAL, all geometries
with bounding boxes that intersect this bbox will be returned.
mask : Shapely geometry, optional (default: None)
If present, will be used to filter records whose geometry intersects
this geometry. This must be in the same CRS as the dataset. If GEOS is
present and used by GDAL, only geometries that intersect this geometry
will be returned; if GEOS is not available or not used by GDAL, all
geometries with bounding boxes that intersect the bounding box of this
geometry will be returned. Requires Shapely >= 2.0.
Cannot be combined with ``bbox`` keyword.

Returns
-------
Expand All @@ -177,6 +186,7 @@ def read_bounds(
max_features=max_features or 0,
where=where,
bbox=bbox,
mask=_mask_to_wkb(mask),
)
finally:
if buffer is not None:
Expand Down
22 changes: 17 additions & 5 deletions pyogrio/geopandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def read_dataframe(
max_features=None,
where=None,
bbox=None,
mask=None,
fids=None,
sql=None,
sql_dialect=None,
Expand Down Expand Up @@ -95,13 +96,23 @@ def read_dataframe(
and used by GDAL, only geometries that intersect this bbox will be
returned; if GEOS is not available or not used by GDAL, all geometries
with bounding boxes that intersect this bbox will be returned.
Cannot be combined with ``mask`` keyword.
mask : Shapely geometry, optional (default: None)
If present, will be used to filter records whose geometry intersects
this geometry. This must be in the same CRS as the dataset. If GEOS is
present and used by GDAL, only geometries that intersect this geometry
will be returned; if GEOS is not available or not used by GDAL, all
geometries with bounding boxes that intersect the bounding box of this
geometry will be returned. Requires Shapely >= 2.0.
Cannot be combined with ``bbox`` keyword.
fids : array-like, optional (default: None)
Array of integer feature id (FID) values to select. Cannot be combined
with other keywords to select a subset (``skip_features``, ``max_features``,
``where``, ``bbox`` or ``sql``). Note that the starting index is driver and file
specific (e.g. typically 0 for Shapefile and 1 for GeoPackage, but can
still depend on the specific file). The performance of reading a large
number of features usings FIDs is also driver specific.
with other keywords to select a subset (``skip_features``,
``max_features``, ``where``, ``bbox``, ``mask``, or ``sql``). Note that
the starting index is driver and file specific (e.g. typically 0 for
Shapefile and 1 for GeoPackage, but can still depend on the specific
file). The performance of reading a large number of features usings FIDs
is also driver specific.
sql : str, optional (default: None)
The SQL statement to execute. Look at the sql_dialect parameter for
more information on the syntax to use for the query. When combined
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -179,6 +190,7 @@ def read_dataframe(
max_features=max_features,
where=where,
bbox=bbox,
mask=mask,
fids=fids,
sql=sql,
sql_dialect=sql_dialect,
Expand Down
Loading