-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue with geospatial data (shapefile) #394
Comments
Hi! Thanks for opening this issue! Unfortunately duckdb does not support encodings other than utf-8, and even though st_read uses GDAL under the hood, i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text. We are trying to reduce the amount of depencies in the spatial extension, so it is unlikely this use case will ever be supported. |
It's bad news as there are many shapefiles with exotic encoding in the nature :) Could we at least have garbage text fields instead of a fatal error In many case the characters with accent could be located in columns not even used in the dataflow Currently
In other tool, encoding is often an issue but not causing critical error |
So spatial has its own experimental shape file reader, |
Using the experimental |
OSGeo/gdal#10799 should improve that situation |
What happens?
When trying to import a shapefile encoded with CP1252 I have the following error
InvalidInputException: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
I tried various options in the st_read but no success
A current workaround is to convert first shapefile to geoparquet with ogr2ogr and then import the geoparquet
To Reproduce
Note : the shapefile does have a
.cpg
file providing the encodingEven forcing encoding do fail:
However,
SELECT * FROM ST_READ('source/t_adresse.shp')
does not gives error (in python)Current workaround :
OS:
MacOS
DuckDB Version:
1.1
DuckDB Client:
Python
Hardware:
No response
Full Name:
Valérian LEBERT
Affiliation:
Digi-Studio
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot easily share my data sets due to their large size
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: