Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temp files not deleted on save for spatial addon #439

Open
2 tasks done
sehHeiden opened this issue Oct 24, 2024 · 3 comments
Open
2 tasks done

Temp files not deleted on save for spatial addon #439

sehHeiden opened this issue Oct 24, 2024 · 3 comments

Comments

@sehHeiden
Copy link

What happens?

I tried to benchmark the spatial Plugin.

With hyperfine I did 13 repetitions.

Each creating a fgb-file.
The open space on drive C: was reduced heavily. Benchmark crashed, when no save left.

Found out that on the Temp dir a copy of each output run was stored und never freed.

See Code:

https://github.com/sehHeiden/geospeed/blob/master/geospeed/duckdb_speed.py

To Reproduce

"""Test the execution speed for an equivalent of Geopandas Overlay with intersection using a spatial index."""

import time
from pathlib import Path

import duckdb

save_name = None
con = duckdb.connect(save_name if save_name else ":memory:", config={"threads": 15, "memory_limit": "20GB"})
con.install_extension("spatial")
con.load_extension("spatial")

con.sql("DROP TABLE IF EXISTS buildings")
con.sql("DROP TABLE IF EXISTS parcels")
con.sql("DROP TABLE IF EXISTS buildings_intersection")
con.sql("DROP VIEW IF EXISTS intersections")

Check supported formats for writing

formats = con.sql("SELECT short_name FROM ST_Drivers() WHERE can_create;").fetchall()
print(f"Writable formats: {formats}")
start = time.time()

shapefile_dir = Path("./ALKIS") # Base directory
building_files = list(shapefile_dir.glob("/GebauedeBauwerk.shp")) # Glob pattern for the subdirectories
parcel_files = list(shapefile_dir.glob("
/NutzungFlurstueck.shp")) # Glob pattern for the subdirectories

Error handling if no files are found

if not building_files or not parcel_files:
load_error_txt = "No shapefiles found in the provided directory."
raise FileNotFoundError(load_error_txt)

Create then insert

con.sql(f"CREATE TABLE buildings AS SELECT * FROM ST_Read('{building_files[0].resolve()!s}');") # noqa: S608
con.sql(f"CREATE TABLE parcels AS SELECT * FROM ST_Read('{parcel_files[0].resolve()!s}');") # noqa: S608
con.execute(
"""PREPARE insert_buildings_stmt AS
INSERT INTO buildings SELECT *
FROM ST_Read($1)
WHERE oid NOT IN (SELECT oid FROM parcels);"""
)
con.execute(
"""PREPARE insert_parcels_stmt AS
INSERT INTO parcels SELECT *
FROM ST_Read($1)
WHERE oid NOT IN(SELECT oid FROM parcels);"""
)

Iterate over the found shapefiles and load them into DuckDB

for building_file, parcel_file in zip(building_files[1:], parcel_files[1:], strict=True):
# Insert into the existing table for subsequent shapefiles
con.execute(f"EXECUTE insert_buildings_stmt('{building_file.resolve()!s}')")
con.execute(f"EXECUTE insert_parcels_stmt('{parcel_file.resolve()!s}')")

Make the data valid

Make geometries valid

con.sql("""
UPDATE buildings SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);
UPDATE parcels SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);
""")

create indexes

con.sql("CREATE INDEX buildings_idx ON buildings USING RTREE (geom);")
con.sql("CREATE INDEX parcels_idx ON parcels USING RTREE (geom);")
print(f"DuckDB: Loading data takes: {(time.time() - start):.0f} s.")

Perform intersection

time_intersection = time.time()
con.sql("""
CREATE TABLE buildings_intersection AS
SELECT ST_Intersection(buildings.geom, parcels.geom) as geom,
buildings.oid AS building_oid,
parcels.oid AS parcel_oid
FROM buildings, parcels
WHERE ST_Intersects(buildings.geom, parcels.geom);
""")

Drop the indexes and unnecessary columns

con.sql("""
DROP INDEX buildings_idx;
DROP INDEX parcels_idx;
ALTER TABLE buildings DROP COLUMN geom;
ALTER TABLE parcels DROP COLUMN geom;
""")

Create final intersections table

con.sql("""
CREATE VIEW intersections AS
SELECT *
FROM buildings_intersection AS bi
JOIN buildings AS bs ON bi.building_oid = bs.oid
JOIN parcels AS ps ON bi.parcel_oid = ps.oid;
""")
con.sql("""UPDATE buildings_intersection SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);""")
print(f"DuckDB: Intersection takes: {(time.time() - time_intersection):.0f} s.")

if not save_name:
# Save the result to a file
time_writing = time.time()
con.sql("""
COPY(SELECT * EXCLUDE geom, ST_AsWKB(geom) AS geometry
FROM intersections
WHERE ST_IsValid(geom) AND NOT ST_IsEmpty(geom))
TO 'buildings_with_parcels.fgb' WITH(FORMAT GDAL, DRIVER 'FlatGeobuf', SRS 'EPSG:25833')""")
print(f"DuckDB: Saving takes: {(time.time() - time_writing):.0f} s.")

print(f"DuckDB: Total duration: {(time.time() - start):.0f} s.")

OS:

Windows

DuckDB Version:

1.1

DuckDB Client:

Python

Hardware:

30 GB free on C:, Ryzen 5800x 40 GB RAM

Full Name:

Sebastian Heiden

Affiliation:

Delphi-IMM, But this usage of DuckDB is private

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@duckdblabs-bot
Copy link

Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it.

Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:

```sql
SELECT ...
```

If this is a false positive, feel free to disregard this comment.

@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Oct 25, 2024
@Maxxen
Copy link
Member

Maxxen commented Oct 27, 2024

Hi!

DuckDB spatial (like almost all geospatial libraries) uses GDAL to write FGB files. We don't know where, how and when GDAL creates temporary files so it's not really feasible for us to clean them up after a crash.

With that said I also don't think you will see any meaningful differences between DuckDBs FGB output performance compared to other systems, as we all use GDAL under the hood. In essence, you'd only be benchmarking how fast DuckDB can convert geometries into GDAL features, but thats not making the most out of duckdb as it is a single-threaded row-by-row process. I'd actually expect e.g. Geopandas to be a lot faster at doing that now because they can pass geoarrow arrays directly to GDAL while we still construct the geometries one at a time.

@sehHeiden
Copy link
Author

I want to make clear, that these temporary files are neither deleted upon crash nor on happy path of hyperfine.

But in the end it should not matter, as when hyperfine crashes on the x-th run, than the python code closed for each previous run successfully.

To the off-top of profiling, fgb. As opening and saving files are an integral part of the analysis, I need to be profiled two. I only used fgb, as I could not write to gpkg. I also wrote to geoparquet and that was really fast, I think about 30 % even faster than GeoPandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants