You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a non-UTF-8 encoded Shapefile with proper .cpg file with correctly specified encoding (attached), testing OLCStringsAsUTF8 against the dataset with no SQL query returns True for this capability, whereas testing this capability on the SQL layer returned from executing an SQL query on this layer returns False. It seems like it should return True in either case, unless it is deliberately disabled for SQL result layers.
Per pyogrio #384 we're trying to correctly detect that GDAL is auto-decoding from the native encoding of the file to UTF-8 for us (using the C API and this capability test), so that we specifically disable trying to decode text to a user-specified encoding in later steps (because text is already decoded). We use OLCStringsAsUTF8 to determine if the layer is likely to be returned as UTF-8, and in the case of Shapefiles specifically set a fallback to ISO-8859-1 when the capability is False (which in this case miscodes the results because we assume ISO-8859-1 when the text is in fact UTF-8).
It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it. It is unclear how to obtain the base layer encoding support from the SQL result layer we get via the C API (via GDALDatasetExecuteSQL), so our current workaround is to fall back to getting the base layer without an SQL query (via GDALDatasetGetLayer on the first / only layer) and testing against that instead. It is unclear if that is the recommended practice, or if this is instead a big in GDAL.
Steps to reproduce the issue
Given this file: test.zip
extracted to /tmp/test.shp:
It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it.
They totally disregard the UTF-8 capability, and optimistically assume that strings are in UTF-8, and if they are not (failure in PyUnicode_DecodeUTF8), they return a bytes object
What is the bug?
For a non-UTF-8 encoded Shapefile with proper
.cpg
file with correctly specified encoding (attached), testingOLCStringsAsUTF8
against the dataset with no SQL query returns True for this capability, whereas testing this capability on the SQL layer returned from executing an SQL query on this layer returns False. It seems like it should return True in either case, unless it is deliberately disabled for SQL result layers.Per pyogrio #384 we're trying to correctly detect that GDAL is auto-decoding from the native encoding of the file to
UTF-8
for us (using the C API and this capability test), so that we specifically disable trying to decode text to a user-specified encoding in later steps (because text is already decoded). We useOLCStringsAsUTF8
to determine if the layer is likely to be returned asUTF-8
, and in the case of Shapefiles specifically set a fallback toISO-8859-1
when the capability is False (which in this case miscodes the results because we assumeISO-8859-1
when the text is in factUTF-8
).It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it. It is unclear how to obtain the base layer encoding support from the SQL result layer we get via the C API (via
GDALDatasetExecuteSQL
), so our current workaround is to fall back to getting the base layer without an SQL query (viaGDALDatasetGetLayer
on the first / only layer) and testing against that instead. It is unclear if that is the recommended practice, or if this is instead a big in GDAL.Steps to reproduce the issue
Given this file:
test.zip
extracted to
/tmp/test.shp
:Versions and provenance
GDAL 3.8.3 on MacOS 12.6.5 (M1)
Additional context
No response
The text was updated successfully, but these errors were encountered: