Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 layer capability returns false for SQL result layer of shapefile but true for base layer #9648

Closed
brendan-ward opened this issue Apr 6, 2024 · 3 comments · Fixed by #9649
Assignees

Comments

@brendan-ward
Copy link

What is the bug?

For a non-UTF-8 encoded Shapefile with proper .cpg file with correctly specified encoding (attached), testing OLCStringsAsUTF8 against the dataset with no SQL query returns True for this capability, whereas testing this capability on the SQL layer returned from executing an SQL query on this layer returns False. It seems like it should return True in either case, unless it is deliberately disabled for SQL result layers.

Per pyogrio #384 we're trying to correctly detect that GDAL is auto-decoding from the native encoding of the file to UTF-8 for us (using the C API and this capability test), so that we specifically disable trying to decode text to a user-specified encoding in later steps (because text is already decoded). We use OLCStringsAsUTF8 to determine if the layer is likely to be returned as UTF-8, and in the case of Shapefiles specifically set a fallback to ISO-8859-1 when the capability is False (which in this case miscodes the results because we assume ISO-8859-1 when the text is in fact UTF-8).

It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it. It is unclear how to obtain the base layer encoding support from the SQL result layer we get via the C API (via GDALDatasetExecuteSQL), so our current workaround is to fall back to getting the base layer without an SQL query (via GDALDatasetGetLayer on the first / only layer) and testing against that instead. It is unclear if that is the recommended practice, or if this is instead a big in GDAL.

Steps to reproduce the issue

Given this file:
test.zip
extracted to /tmp/test.shp:

from osgeo import ogr

drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)

lyr = ds.GetLayerByIndex(0)
print(f"Base layer supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文
print(lyr.GetFeature(0)[0])
# 中文


lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"SQL layer supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文
print(lyr.GetFeature(0)[0])
# 中文

Versions and provenance

GDAL 3.8.3 on MacOS 12.6.5 (M1)

Additional context

No response

@rouault
Copy link
Member

rouault commented Apr 6, 2024

It appears that the GDAL Python bindings are able to correctly decode from the native encoding even though this capability returns False, presumably because the native encoding and recoding to UTF-8 are stored on the base layer independent of the SQL result layer on top of it.

They totally disregard the UTF-8 capability, and optimistically assume that strings are in UTF-8, and if they are not (failure in PyUnicode_DecodeUTF8), they return a bytes object

static PyObject* GDALPythonObjectFromCStr(const char *pszStr)
{
  const unsigned char* pszIter = (const unsigned char*) pszStr;
  while(*pszIter != 0)
  {
    if (*pszIter > 127)
    {
        PyObject* pyObj = PyUnicode_DecodeUTF8(pszStr, strlen(pszStr), "strict");
        if (pyObj != NULL && !PyErr_Occurred())
            return pyObj;
        PyErr_Clear();
        return PyBytes_FromString(pszStr);
    }
    pszIter ++;
  }
  return PyUnicode_FromString(pszStr);
}

@rouault
Copy link
Member

rouault commented Apr 6, 2024

fix in #9649

@brendan-ward
Copy link
Author

Thanks @rouault for the extremely fast fix! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants