BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

theroggy · 2022-08-19T14:44:36Z

When reading a .gpkg with a bbox filter, some read operations trigger the following error:
pyogrio.errors.FeatureError: Could not read feature at index ???

After investigating a bit it seems that for .gpkg files OGR_L_GetFeatureCount returns the number of results doing a "coarse" bbox comparison (probably it just uses the spatial index), but when fetching the results with OGR_L_GetNextFeature it does a more thorough check (probably checks for an intersection).
Because the result of OGR_L_GetFeatureCount is used to determine the range of the loop in _io.pyx::get_features() this results in the error above.

For some other file types (.geojson, shapefile) this doesn't seem to give an issue at first sight.

The gdal documentation for OGR_L_GetFeatureCount also gives a hint that the result of it is not 100% reliable: For dynamic databases the count may not be exact.

The text was updated successfully, but these errors were encountered:

brendan-ward · 2022-08-19T17:10:52Z

Is this consistently reproducible for the same GPKG? If this is reproducible, perhaps characteristics of the failing record may help us here. What version of GDAL is failing?

I had interpreted "dynamic" to mean that the actual set of records within the database may be changing when running the feature count or iterating over features; I assume that is not the case here?

It is possible that bounding boxes of a given geometry intersect that of the spatial filter and yet the actual geometry does not intersect the filter box. But I'm surprised we haven't seen errors on this before, though maybe spatial filter has not yet been used extensively in practice for pyogrio (I rarely use it) and not sufficiently covered in our test cases for this sort of case.

We do need an accurate count in advance in order to set up the arrays to receive the data. We currently set the bForce flag in 'OGR_L_GetFeatureCount` to get this even if expensive, so I'm unsure of what else we should do to get an accurate count.

theroggy · 2022-08-19T17:54:41Z

Is this consistently reproducible for the same GPKG? If this is reproducible, perhaps characteristics of the failing record may help us here.

Yes. As I mentioned above, the problem for the case I encountered is

After investigating a bit it seems that for .gpkg files OGR_L_GetFeatureCount returns the number of results doing a "coarse" bbox comparison (probably it just uses the spatial index), but when fetching the results with OGR_L_GetNextFeature it does a more thorough check (probably checks for an intersection).

In the mean time, I've also reproduced it in the pyogrio tests: if I enlarge the bbox in the test cases that use a bbox so another country's MBR comes in the bbox, but is actually outside the bbox, the same issue surfaces.

What version of GDAL is failing?

I'm running on 3.5.1

I had interpreted "dynamic" to mean that the actual set of records within the database may be changing when running the feature count or iterating over features; I assume that is not the case here?

Yes, understandable... but that's not the case here.

It is possible that bounding boxes of a given geometry intersect that of the spatial filter and yet the actual geometry does not intersect the filter box. But I'm surprised we haven't seen errors on this before, though maybe spatial filter has not yet been used extensively in practice for pyogrio (I rarely use it) and not sufficiently covered in our test cases for this sort of case.

We do need an accurate count in advance in order to set up the arrays to receive the data. We currently set the bForce flag in 'OGR_L_GetFeatureCount` to get this even if expensive, so I'm unsure of what else we should do to get an accurate count.

I'm looking into it as we speak to write a fix and I saw indeed that the code relies a lot on the count being available beforehand.

If the count isn't known beforehand:

the double filtering that is most likely happening (once for the count, once when fetching rows) would be eliminated
but the memory allocation will need to happen row per row, which will have quite an impact on de code + will need quite some testing + tuning because (re)allocating many numpy arrays is slow... so another approach will be needed I think.

So, to minimize the impact I'm keeping the count for now...

brendan-ward · 2022-08-19T18:10:41Z

Thanks for looking further into this!

It sounds like count via OGR_L_GetFeatureCount should thus be >= the true count of features that we iterate over, and all valid features would be read before an iteration where the feature is returned as NULL. In that case, we can still use the count to setup the arrays to read into, but then gracefully stop once OGR_L_GetNextFeature returns a NULL. Then we can return a slice against the original arrays.

theroggy · 2022-08-19T18:43:18Z

Thanks for looking further into this!

It sounds like count via OGR_L_GetFeatureCount should thus be >= the true count of features that we iterate over, and all valid features would be read before an iteration where the feature is returned as NULL. In that case, we can still use the count to setup the arrays to read into, but then gracefully stop once OGR_L_GetNextFeature returns a NULL. Then we can return a slice against the original arrays.

Indeed, that's the way I implemented the fix. I also added an explicit check so if more rows would be available than the "prediction" by count OGR_L_GetFeatureCount, a clear error is thrown.

theroggy · 2022-08-19T21:26:30Z

I had a look at the time it takes to execute OGR_L_GetFeatureCount. Also usefull to note: apparently in validate_feature_range there is another OGR_L_GetFeatureCount that is run every time.

Logically it depends on the file format you are reading from + the availability of a spatial index. I tested it on 3 different but similar files with 550.000 polygons of +- 300 MB (including attibutes), where +- 1000 rows were selected using the bbox:

format	read nb	OGR_L_GetFeatureCount 1	OGR_L_GetFeatureCount 2	total time
shp, no index	1st	10.5s	2.9s	26.5s
shp, no index	2nd	5.3s	2.9s	14.7s
shp, index	1st	6.2s	0.01s	15.8s
shp, index	2nd	0.55s	0.005s	1.8s
gpkg, index	1st	1.5s	0.001s	14s
gpkg, index	2nd	0.03s	0.001s	1.7s

Conclusion?

for files having a spatial index it probably doesn't really matter. The second OGR_L_GetFeatureCount is almost instantanous so if the count(s) would be eliminated, most likely the total time would barely be affected because once the index and the relevant data is in RAM the count/filtering is very fast.
for (shp) files having no spatial index there is a difference: each count probably gives an overhead of 3s... so 6 seconds could possibly/theoretically be saved... Then again... if you care about (filter) speed, it is logical that you should use a spatial index... so it doesn't feel like a use case that is worth putting effort in.

brendan-ward · 2022-08-20T00:04:09Z

Thanks for the additional investigation and good catch on the performance impact of getting feature count twice when there are filters applied.

I think we should consolidate the use of OGR_L_GetFeatureCount so that we call it no more than once. Just added #151 for that.

theroggy mentioned this issue Aug 19, 2022

Fixes an error when reading GPKG with bbox filter + that no error is raised for an invalid where clause for GPKG #150

Merged

brendan-ward mentioned this issue Aug 20, 2022

FIX: Avoid duplicate count of features #151

Merged

brendan-ward closed this as completed in #150 Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

theroggy commented Aug 19, 2022 •

edited

Loading

brendan-ward commented Aug 19, 2022

theroggy commented Aug 19, 2022

brendan-ward commented Aug 19, 2022

theroggy commented Aug 19, 2022

theroggy commented Aug 19, 2022

brendan-ward commented Aug 20, 2022

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

Comments

theroggy commented Aug 19, 2022 • edited Loading

brendan-ward commented Aug 19, 2022

theroggy commented Aug 19, 2022

brendan-ward commented Aug 19, 2022

theroggy commented Aug 19, 2022

theroggy commented Aug 19, 2022

brendan-ward commented Aug 20, 2022

theroggy commented Aug 19, 2022 •

edited

Loading