UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

brendan-ward · 2024-04-06T14:31:25Z

What is the bug?

For a non-UTF-8 encoded Shapefile with proper .cpg file with correctly specified encoding (attached), testing OLCStringsAsUTF8 against the dataset with no SQL query returns True for this capability, whereas testing this capability on the SQL layer returned from executing an SQL query on this layer returns False. It seems like it should return True in either case, unless it is deliberately disabled for SQL result layers.

Per pyogrio #384 we're trying to correctly detect that GDAL is auto-decoding from the native encoding of the file to UTF-8 for us (using the C API and this capability test), so that we specifically disable trying to decode text to a user-specified encoding in later steps (because text is already decoded). We use OLCStringsAsUTF8 to determine if the layer is likely to be returned as UTF-8, and in the case of Shapefiles specifically set a fallback to ISO-8859-1 when the capability is False (which in this case miscodes the results because we assume ISO-8859-1 when the text is in fact UTF-8).

It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it. It is unclear how to obtain the base layer encoding support from the SQL result layer we get via the C API (via GDALDatasetExecuteSQL), so our current workaround is to fall back to getting the base layer without an SQL query (via GDALDatasetGetLayer on the first / only layer) and testing against that instead. It is unclear if that is the recommended practice, or if this is instead a big in GDAL.

Steps to reproduce the issue

Given this file:
test.zip
extracted to /tmp/test.shp:

from osgeo import ogr

drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)

lyr = ds.GetLayerByIndex(0)
print(f"Base layer supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文
print(lyr.GetFeature(0)[0])
# 中文


lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"SQL layer supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文
print(lyr.GetFeature(0)[0])
# 中文

Versions and provenance

GDAL 3.8.3 on MacOS 12.6.5 (M1)

Additional context

No response

The text was updated successfully, but these errors were encountered:

…om underlying layer (fixes OSGeo#9648)

rouault · 2024-04-06T14:58:22Z

It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it.

They totally disregard the UTF-8 capability, and optimistically assume that strings are in UTF-8, and if they are not (failure in PyUnicode_DecodeUTF8), they return a bytes object

static PyObject* GDALPythonObjectFromCStr(const char *pszStr)
{
  const unsigned char* pszIter = (const unsigned char*) pszStr;
  while(*pszIter != 0)
  {
    if (*pszIter > 127)
    {
        PyObject* pyObj = PyUnicode_DecodeUTF8(pszStr, strlen(pszStr), "strict");
        if (pyObj != NULL && !PyErr_Occurred())
            return pyObj;
        PyErr_Clear();
        return PyBytes_FromString(pszStr);
    }
    pszIter ++;
  }
  return PyUnicode_FromString(pszStr);
}

rouault · 2024-04-06T14:58:38Z

fix in #9649

brendan-ward · 2024-04-06T15:17:02Z

Thanks @rouault for the extremely fast fix! 💯

rouault self-assigned this Apr 6, 2024

rouault added a commit to rouault/gdal that referenced this issue Apr 6, 2024

ExecuteSQL() with OGRSQL and SQLITE dialects: get OLCStringsAsUTF8 fr…

abe8d7f

…om underlying layer (fixes OSGeo#9648)

rouault mentioned this issue Apr 6, 2024

ExecuteSQL() with OGRSQL and SQLITE dialects: get OLCStringsAsUTF8 from underlying layer #9649

Merged

rouault closed this as completed in #9649 Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

brendan-ward commented Apr 6, 2024

rouault commented Apr 6, 2024

rouault commented Apr 6, 2024

brendan-ward commented Apr 6, 2024

UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

Comments

brendan-ward commented Apr 6, 2024

What is the bug?

Steps to reproduce the issue

Versions and provenance

Additional context

rouault commented Apr 6, 2024

rouault commented Apr 6, 2024

brendan-ward commented Apr 6, 2024