Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault in st_contains spatial join #438

Open
johngcrowley opened this issue Oct 24, 2024 · 3 comments
Open

segmentation fault in st_contains spatial join #438

johngcrowley opened this issue Oct 24, 2024 · 3 comments

Comments

@johngcrowley
Copy link

johngcrowley commented Oct 24, 2024

Problem

Good morning, again!

When performing an ST_ContainsProperly or ST_Within of a centroid in another boundary (the centroids of a 152M record table compared against the large township boundaries of only a 2M record table) -- I'm getting a segmentation fault.

As an aside, I want to say that we are very close to getting DuckDB to work for processing all our spatial data. This spatial join is the last blocker. DuckDB is performing faster than PostGIS. The work done here has been a blessing. I am grateful for everyone here at DuckDB!

Traceback

segmentation fault

Steps to re-create

Running this query against a row table stored in .duckdb file, locally.

          select 
                 b.row_id,
                 t.township
          from parcel_batch_1 b
          join './section_township_range.parquet' t
          on st_containsproperly(st_centroid(b.geom), st_geomfromhexewkb(t.geom));
  • I've added an R-Tree index to the geom column of the large, 152M record table, parcel_batch_1.
  • I've added the query plan to this issue indicating the non-use of the R-tree index.
  • Query gets to ~49% completion before hitting segmentation fault.

Initial Query Plan

┌─────────────────────────────┐                                                                                                                                                              
│┌───────────────────────────┐│                                                                                                                                                              
││       Physical Plan       ││                                                                                                                                                              
│└───────────────────────────┘│                                                                                                                                                              
└─────────────────────────────┘                                                                                                                                                              
┌───────────────────────────┐                                                                                                                                                                
│         PROJECTION        │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│            row_id         │                                                                                                                                                                                                                                                                                                                            
│          township         │                                                                                                                                                                                                                                                                                                                              
│                           │                                                                                                                                                                
│       ~25000000 Rows      │                                                                                                                                                                
└─────────────┬─────────────┘                                                                                                                                                                
┌─────────────┴─────────────┐                                                                                                                                                                
│           FILTER          │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│    ST_ContainsProperly    │                                                                                                                                                                
│    (ST_Centroid(geom),    │                                                                                                                                                                
│  ST_GeomFromHEXEWKB(geom))│                                                                                                                                                                
│                           │                                                                                                                                                                
│       ~25000000 Rows      │                                                                                                                                                                
└─────────────┬─────────────┘                                                                                                                                                                
┌─────────────┴─────────────┐                                                                                                                                                                
│          IE_JOIN          │                                                                                                                                                                
│    ────────────────────   │                                                                                                                                                                
│      Join Type: INNER     │             
│                           │                                                                                                                                                                
│        Conditions:        │                                                                                                                                                                
│     ST_XMin(ST_Extent     │                                                                                                                                                                
│  (ST_Centroid(geom))) <=  │                                                                                                                                                                
│      ST_XMax(ST_Extent    │                                                                                                                                                                
│(ST_GeomFromHEXEWKB(geom)))│                                                                                                                                                                
│     ST_XMax(ST_Extent     │                                                                                                                                                                
│  (ST_Centroid(geom))) >=  │                                                                                                                                                                
│      ST_XMin(ST_Extent    ├──────────────┐                                                                                                                                                 
│(ST_GeomFromHEXEWKB(geom)))│              │                                                                                                                                                 
│     ST_YMin(ST_Extent     │              │                                                                                                                                                 
│  (ST_Centroid(geom))) <=  │              │                                                                                                                                                 
│      ST_YMax(ST_Extent    │              │
│(ST_GeomFromHEXEWKB(geom)))│              │
│     ST_YMax(ST_Extent     │              │
│  (ST_Centroid(geom))) >=  │              │
│      ST_YMin(ST_Extent    │              │
│(ST_GeomFromHEXEWKB(geom)))│              │
│                           │              │
│       ~25000000 Rows      │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││       PARQUET_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│       parcel_batch_1      ││         Function:         │
│                           ││        PARQUET_SCAN       │
│        Projections:       ││                           │
│            geom           ││        Projections:       │
│            row_id         ││            geom           │
│                           ││          township         │
│                           ││                           │
│       ~25000000 Rows      ││       ~2379868 Rows       │
└───────────────────────────┘└───────────────────────────┘

Alternate Attempt (no segmentation fault, but query hangs at 50%)

  • I then created a centroid column on parcel_batch_1 so the st_centroid needn't be calculated in the join condition, and materialized the Parquet file as a table. I added an R-Tree index on both parcel_batch_1.centroid and the geom column of the township (formerly the Parquet file) table. Still, no usage in query plan.
  • I changed ST_ContainsProperly to ST_Contains, as well.
  • Memory is under control, threads being used:
    image

Environment:

 % uname -a
Linux john-XPS-15-9520 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

 % duckdb --version
v1.1.2 f680b7d08f

My Machine:

  • 64GiB Memory
  • 20 CPU
  • Dell XPS
@Maxxen
Copy link
Member

Maxxen commented Oct 27, 2024

Hi! Happy to hear DuckDB spatial is useful for you and thanks for opening this issue!

R-Tree indexes are not used for spatial joins yet, only when doing spatial filters, so it makes sense that you wouldn't see it used in the plan. Adding support for joins is of course high on the TODO list but I can't provide an exact timeline.

All the functions you use, ST_ContainsProperly, ST_Contains, ST_Within ST_Centroid etc. are all implemented by calling into the GEOS library, which will allocate memory outside of DuckDB's control. This is usually the reason why larger spatial joins/queries fail, but since you seem to be well within the memory bounds and the process doesnt get killed by the oom-killer, there might be a bug somewhere in either our code or GEOS. Is there any way you could share the data? Alternatively try to isolate which geometries caused the crash (e.g. by recursively dividing the datasets in half and still see if the segmentation fault occurs, and if so, in which half)?

If you convert and store the geometries into a table first (so that you don't have to call ST_GeomFromHEXEWKB in the main query), does it still crash?

@johngcrowley
Copy link
Author

@Maxxen, thank you for your reply. Unfortunately, I can not share the proprietary data, but the suggestions were right-on:

  • I pre-processed the ST_GeomFromHEXEWKB() column so i was only performing an st_within(), not calcuating the geometry as well. I was able to complete this query where the larger table was 152M rows and the smaller table was 1.5M. The maximum RAM usage was 40GiB. The query completed successfully! Thank you.

Yet, in another query, using the same data from the first query, I'm trying to perform a spatial join between a 142M row subset table and a 3M row subset table. Same geometries. Using the same Spatial extension functions as the first, passing query:

select ....
from 152_M_row_table large
join 3_M_row_table small
on st_within(geom_centroid, geom)

and it is using up to 57GiB RAM, but eventually hitting the initial segmentation fault.

I then tried dialing down the precision of my spatial join with a:

select ....
from 152_M_row_table large
join 3_M_row_table small
on st_intersects(geom_centroid, geom_envelope) 
  • whereby the geometry is just the calculated, st_envelope (minimum bounding box) value
  • still gets terminated with a segmentation fault.

Realizing I only really care how records match to the larger table, I changed the join to a left join, which has now shifted from an IE Join to an NL Blockwise Join, even with set prefer_range_joins=true;. Good news: the query plans total rows are the same, but I'm only using up to 9GiB of RAM, now. Not sure what the total run time will be, now, but I'll update here.

I really appreciate you taking the time to answer my questions, and the work you're doing to improve the Spatial extension.

@johngcrowley
Copy link
Author

Looking at it more:

  • the geometries in this second case have more vertices, are more complex.
  • When i run this in PostgreSQL, and add an additional joinfilter for matching on state, the query planner winnows to only try joining geometries that first match on state. However, when i try to replicate that filter condition (on state) in DuckDB, it explodes the join into trillions of rows in the explain.
  • wonder if reducing precision helps, meaningfully here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants