Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

Closed
theroggy opened this issue Aug 19, 2022 · 6 comments · Fixed by #150
Closed

BUG: (sometimes) error in read on .gpkg when using a bbox filter #149

theroggy opened this issue Aug 19, 2022 · 6 comments · Fixed by #150

Comments

@theroggy
Copy link
Member

theroggy commented Aug 19, 2022

When reading a .gpkg with a bbox filter, some read operations trigger the following error:
pyogrio.errors.FeatureError: Could not read feature at index ???

After investigating a bit it seems that for .gpkg files OGR_L_GetFeatureCount returns the number of results doing a "coarse" bbox comparison (probably it just uses the spatial index), but when fetching the results with OGR_L_GetNextFeature it does a more thorough check (probably checks for an intersection).
Because the result of OGR_L_GetFeatureCount is used to determine the range of the loop in _io.pyx::get_features() this results in the error above.

For some other file types (.geojson, shapefile) this doesn't seem to give an issue at first sight.

The gdal documentation for OGR_L_GetFeatureCount also gives a hint that the result of it is not 100% reliable: For dynamic databases the count may not be exact.

@brendan-ward
Copy link
Member

Is this consistently reproducible for the same GPKG? If this is reproducible, perhaps characteristics of the failing record may help us here. What version of GDAL is failing?

I had interpreted "dynamic" to mean that the actual set of records within the database may be changing when running the feature count or iterating over features; I assume that is not the case here?

It is possible that bounding boxes of a given geometry intersect that of the spatial filter and yet the actual geometry does not intersect the filter box. But I'm surprised we haven't seen errors on this before, though maybe spatial filter has not yet been used extensively in practice for pyogrio (I rarely use it) and not sufficiently covered in our test cases for this sort of case.

We do need an accurate count in advance in order to set up the arrays to receive the data. We currently set the bForce flag in 'OGR_L_GetFeatureCount` to get this even if expensive, so I'm unsure of what else we should do to get an accurate count.

@theroggy
Copy link
Member Author

Is this consistently reproducible for the same GPKG? If this is reproducible, perhaps characteristics of the failing record may help us here.

Yes. As I mentioned above, the problem for the case I encountered is

After investigating a bit it seems that for .gpkg files OGR_L_GetFeatureCount returns the number of results doing a "coarse" bbox comparison (probably it just uses the spatial index), but when fetching the results with OGR_L_GetNextFeature it does a more thorough check (probably checks for an intersection).

In the mean time, I've also reproduced it in the pyogrio tests: if I enlarge the bbox in the test cases that use a bbox so another country's MBR comes in the bbox, but is actually outside the bbox, the same issue surfaces.

What version of GDAL is failing?

I'm running on 3.5.1

I had interpreted "dynamic" to mean that the actual set of records within the database may be changing when running the feature count or iterating over features; I assume that is not the case here?

Yes, understandable... but that's not the case here.

It is possible that bounding boxes of a given geometry intersect that of the spatial filter and yet the actual geometry does not intersect the filter box. But I'm surprised we haven't seen errors on this before, though maybe spatial filter has not yet been used extensively in practice for pyogrio (I rarely use it) and not sufficiently covered in our test cases for this sort of case.

We do need an accurate count in advance in order to set up the arrays to receive the data. We currently set the bForce flag in 'OGR_L_GetFeatureCount` to get this even if expensive, so I'm unsure of what else we should do to get an accurate count.

I'm looking into it as we speak to write a fix and I saw indeed that the code relies a lot on the count being available beforehand.

If the count isn't known beforehand:

  • the double filtering that is most likely happening (once for the count, once when fetching rows) would be eliminated
  • but the memory allocation will need to happen row per row, which will have quite an impact on de code + will need quite some testing + tuning because (re)allocating many numpy arrays is slow... so another approach will be needed I think.

So, to minimize the impact I'm keeping the count for now...

@brendan-ward
Copy link
Member

Thanks for looking further into this!

It sounds like count via OGR_L_GetFeatureCount should thus be >= the true count of features that we iterate over, and all valid features would be read before an iteration where the feature is returned as NULL. In that case, we can still use the count to setup the arrays to read into, but then gracefully stop once OGR_L_GetNextFeature returns a NULL. Then we can return a slice against the original arrays.

@theroggy
Copy link
Member Author

Thanks for looking further into this!

It sounds like count via OGR_L_GetFeatureCount should thus be >= the true count of features that we iterate over, and all valid features would be read before an iteration where the feature is returned as NULL. In that case, we can still use the count to setup the arrays to read into, but then gracefully stop once OGR_L_GetNextFeature returns a NULL. Then we can return a slice against the original arrays.

Indeed, that's the way I implemented the fix. I also added an explicit check so if more rows would be available than the "prediction" by count OGR_L_GetFeatureCount, a clear error is thrown.

@theroggy
Copy link
Member Author

I had a look at the time it takes to execute OGR_L_GetFeatureCount. Also usefull to note: apparently in validate_feature_range there is another OGR_L_GetFeatureCount that is run every time.

Logically it depends on the file format you are reading from + the availability of a spatial index. I tested it on 3 different but similar files with 550.000 polygons of +- 300 MB (including attibutes), where +- 1000 rows were selected using the bbox:

format read nb OGR_L_GetFeatureCount 1 OGR_L_GetFeatureCount 2 total time
shp, no index 1st 10.5s 2.9s 26.5s
shp, no index 2nd 5.3s 2.9s 14.7s
shp, index 1st 6.2s 0.01s 15.8s
shp, index 2nd 0.55s 0.005s 1.8s
gpkg, index 1st 1.5s 0.001s 14s
gpkg, index 2nd 0.03s 0.001s 1.7s

Conclusion?

  • for files having a spatial index it probably doesn't really matter. The second OGR_L_GetFeatureCount is almost instantanous so if the count(s) would be eliminated, most likely the total time would barely be affected because once the index and the relevant data is in RAM the count/filtering is very fast.
  • for (shp) files having no spatial index there is a difference: each count probably gives an overhead of 3s... so 6 seconds could possibly/theoretically be saved... Then again... if you care about (filter) speed, it is logical that you should use a spatial index... so it doesn't feel like a use case that is worth putting effort in.

@brendan-ward
Copy link
Member

Thanks for the additional investigation and good catch on the performance impact of getting feature count twice when there are filters applied.

I think we should consolidate the use of OGR_L_GetFeatureCount so that we call it no more than once. Just added #151 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants