-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GBK encoded SHP file, read exception. #380
Comments
Thanks for the report. Are you able to share a small sample zip file (all files related to your shp) that reproduces this issue? Is there a corresponding |
The data mentioned above do not have cpg files. Most of the SHP data without a CPG file do not have garbled characters.These several SHP data are exceptions. Unzip the shp data in the attachment (theoretically they are all encoded in gbk). When using read_dataframe() to read, specifically: for file 01.shp, if encoding parameter is not set or set as 'utf8', there will be no garbled characters; setting encoding as 'cp936' will result in a coding error. For file 02.shp, setting encoding as 'cp936' will not cause garbled characters, while setting it as 'utf8' will. For file 03.shp, both setting encoding as 'cp936' and 'utf8' will result in garbled characters. |
Hello, have you downloaded the zip attachment above? If you have, I will delete it. |
I have downloaded, and I am able to reproduce your findings, and am currently trying to get to the root of what is going on here. If possible, please leave the sample files available for a bit longer, so that other maintainers here can use them for testing to either isolate the issue or review a fix when / if identified. |
Actually, I'm finding that I'm finding that Fiona produces the same default behavior as Pyogrio, so I'm wondering if system preferred encoding is different between our systems:
Results of reading first value of pygrio detected encoding: UTF-8
read_dataframe: 01.shp (default): 001街道尚营社区
fiona_read: 01.shp (default): 001街道尚营社区
read_dataframe: 01.shp (UTF-8): 001街道尚营社区
read_dataframe: 01.shp (cp936): failed with exception
========================================
pygrio detected encoding: ISO-8859-1
read dataframe: 02.shp (default): Ô¬ÀÏׯ´å
fiona read: 02.shp (default): Ô¬ÀÏׯ´å
read dataframe: 02.shp (UTF-8): failed with exception
read dataframe: 02.shp (cp936): 袁老庄村
========================================
pygrio detected encoding: UTF-8
read dataframe: 03.shp (default): µËÖÝÊÐ
fiona read: 03.shp (default): µËÖÝÊÐ
read dataframe: 03.shp (UTF-8): µËÖÝÊÐ
read dataframe: 03.shp (cp936): 碌脣脰脻脢脨 If encoding is not provided, we check with GDAL to see if the dataset supports UTF-8, and otherwise fall back to Here is what GDAL detects at a lower level: > ogrinfo -json 01.shp 01
...
"SHAPEFILE":{
"ENCODING_FROM_LDID":"CP936",
"LDID_VALUE":"77",
"SOURCE_ENCODING":"CP936"
}
> ogrinfo -json 02.shp 02
...
"SHAPEFILE":{
"SOURCE_ENCODING":""
}
...
> ogrinfo -json 03.shp 03
...
"SHAPEFILE":{
"ENCODING_FROM_LDID":"ISO-8859-1",
"LDID_VALUE":"87",
"SOURCE_ENCODING":"ISO-8859-1"
}
...
What this means is that the 3 files differ in terms of how GDAL is detecting their encoding from the I'm still trying to trace this through, but it looks like GDAL is automatically decoding from the detected encoding to UTF-8 before we attempt to detect the encoding of the file. This would explain why GDAL is reporting that |
This is to be expected for shapefile as shapefile is a "OLCStringsAsUTF8" format. So we don't do any detecting... this is handled fully by GDAL. FYI: if OGR_L_TestCapability(ogr_layer, OLCStringsAsUTF8):
return 'UTF-8'
# OGR_L_TestCapability returns True for OLCStringsAsUTF8 if GDAL hides encoding
# complexities for this layer/driver type. In this case all string attribute
# values have to be supplied in UTF-8 and values will be returned in UTF-8.
# The encoding used to read/write under the hood depends on the driver used.
# For layers/drivers where False is returned, the string values are written and
# read without recoding. Hence, it is up to you to supply the data in the
# appropriate encoding. More info:
# https://gdal.org/development/rfc/rfc23_ogr_unicode.html#oftstring-oftstringlist-fields
return "UTF-8" |
I had a quick look, and it seems that the |
Ok, I think I understand better what is going on here. For Where GDAL auto-detects the encoding, it automatically decodes the native encoding to For For There are a couple of ways to sidestep the above issues:
from pyogrio import set_gdal_config_options
set_gdal_config_options({"SHAPE_ENCODING": "cp936"}) Note: this then applies to all read operations; I need to check for a dataset / layer read option. We may also need to set options when |
@theroggy thanks for looking at this too; I'm not terribly familiar with alternative encodings. I'm starting to wonder if we should not be trying to decode via user-passed Like you say, for reading shapefiles, we need to be opting out of GDAL's auto detection when the user passes an Some related bits in Fiona for further investigation: Fiona #516, Fiona #512 |
It looks like Fiona as removed anything that was directly setting It looks like we can use the open option |
Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the |
resolved by #380 |
Windows 10 professional 22H2 19045.4170
pyogrio == 0.7.2
GDAL == 3.8.4
fiona == 1.9.5
When using pyogrio.read_dataframe() to read a shp file, if the encoding of the dbf file is gbk, specify the parameter encoding='gbk' or encoding='cp936'. There are two exceptional situations encountered: when specifying encoding='cp936' in fiona, there are no similar issues:
The text was updated successfully, but these errors were encountered: