unicode: encoding/decoding errors in python2 #315

tilboerner · 2013-05-17T22:19:18Z

For the 🍶 of Python 2, we need to ensure we're handling non-ASCII string data in the same uniform way that Python 3 includes by design. Internally, all strings should have the same representation.

Create a central place that takes strings (str, byte or unicode) of various encodings and returns them in standard unicode form (py2->unicode, py3->str). All external string data needs to be passed through as soon as possible, probably the moment it enters the cherrymusicserver namespace.

There should probably be two different implementations: one for each major version of Python.

External sources:

calls to os module: filesystem, text file contents (config!), ...
command line interface
HTTP interface
anything we're getting from sys ❓

Reset or consider as external:

databases

Please feel free to add other relevant sources, observations or notable consequences that are still missing.

The text was updated successfully, but these errors were encountered:

tilboerner · 2013-06-09T23:43:16Z

For reference: http://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/

tilboerner · 2013-06-11T18:02:37Z

For reference: http://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/

This is relevant because it concerns non-normalized unicode filenames. It also hints at problems of unicode non-normalization in general.

Or should I say: co-normalization? As a matter of fact, in valid unicode there exist several equivalent normal forms, and identitical characters (glyphs) may be equivalently represented by different codepoints. Yes, the same unicode text, in the same UTF encoding and byte order, can be stored in different byte sequences which don't compare equal. For example:

Python certainly doesn't care to look beyond the level of codepoints;
SQLite turns bytestrings into unicode and returns them as such, by whatever unknown procedure; it also doesn't normalize unicode that is put into it;
a unicode-aware filesystem will happily treat different canonically-equivalent forms of the same glyph as different names.

So, it's on us to be aware of such things. One significant consequence is that a HSF+ filename generated by OS X may be different from the same name as handled by Linux (see original link above). This means that, no, we cannot enforce a certain normal form without potentially invalidating stuff like filenames. In general, this means that string data generated by the program in one environment is not in all cases interoperable with the same program in another environment. More stuff to watch out for, hooh yeah.

Read config file as utf-8; encode '/serve' staticdir config to prevent mixing of byte/unicode strings in cherrypy code, which would yield ASCII decoding errors. (#315, #325, #368)

When os.listdir returns a bytestring name, make an effort to decode it; if that fails, skip the file and log the error.

acidtonic · 2014-10-07T19:17:15Z

This issue is really bugging me and preventing nearly 50% of the songs in my library from even appearing in the web front end. My server.log is simply overflowing with errors about decoding various filenames. I'll paste a few here for reference.....

I'm going to be installing convmv and simply adding a hack to python so these errors run convmv on the files that throw exceptions. That tool will be able to fix the filenames and I really wish it was just something cherrymusic ran by itself or supported.....

ERROR [2014-10-07 14:43:48,447] : cherrypy.error.139977052412496 : from line (201) at
/usr/lib64/python2.7/site-packages/cherrypy/_cplogging.py
--
[07/Oct/2014:14:43:48] HTTP Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/lib64/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/usr/lib64/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, *_self.kwargs)
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/httphandler.py", line 290, in api
return json.dumps({'data': handler(**handler_args)})
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/httphandler.py", line 452, in api_listdir
return [entry.to_dict() for entry in self.model.listdir(directory)]
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/cherrymodel.py", line 163, in listdir
musicentry.count_subfolders_and_files()
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/cherrymodel.py", line 355, in count_subfolders_and_files
subfilefullpath = os.path.join(fullpath, filename)
File "/usr/lib64/python2.7/posixpath.py", line 80, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)

ERROR [2014-10-07 14:43:41,425] : cherrymusicserver.sqlitecache : from line (688) at
/home/acidtonic/code/cherrymusic/cherrymusicserver/sqlitecache.py
--
unable to decode filename '05 - Jeunesse Dor\xc3\xa9e.mp3' in u'/home/acidtonic/music/De Phazz/Big'; skipping.

ERROR [2014-10-07 14:43:41,427] : cherrymusicserver.sqlitecache : from line (688) at
/home/acidtonic/code/cherrymusic/cherrymusicserver/sqlitecache.py
--
unable to decode filename '12 - Jeunesse Dor\xc3\xa9e (Dub-Mix).mp3' in u'/home/acidtonic/music/De Phazz/Big'; skipping.

devsnd · 2014-10-11T12:47:24Z

Hi @acidtonic!

Thanks for your report! I just checked what could be wrong using the information you gave us and it seems the encoding of the filenames files and your file system encoding differ from one another. The file is encoded using UTF-8:

>>> print codecs.decode('12 - Jeunesse Dor\xc3\xa9e (Dub-Mix).mp3', 'utf-8')
12 - Jeunesse Dorée (Dub-Mix).mp3

but your filesystem seems to be using ASCII, or maybe windows-1252 (but that would probably rather lead to mojibake in this case), as far as I can tell. Can you please post the output of the following command to find out if my theory is correct?

python2 -c "import sys; print sys.getfilesystemencoding()"

acidtonic · 2014-10-11T15:19:56Z

$ python2 -c "import sys; print sys.getfilesystemencoding()"
ANSI_X3.4-1968

I ended up solving the problem by downloading a tool called "detox" instead of the one I mentioned before.

The command I used was .... (-n is dry-run for no changes)

detox -s utf_8 -r -v -n *

I do wish however that there was some screen that listed all the tracks that were skipped for import errors. So I can catch if anyone else using the server is accidentally missing tracks because of this without babysitting it.

devsnd · 2014-10-11T16:04:33Z

you can get that information from the error log file, which is probably located in ~/.local/share/cherrymusic/error.log

$ grep "unable to decode filename" ~/.local/share/cherrymusic/error.log

should do the trick

tilboerner referenced this issue May 17, 2013

improved unicode handling in python2 for compact file listing

25a2848

tilboerner mentioned this issue Jun 7, 2013

Error Updating Music Library when reading accented characters #319

Closed

tilboerner referenced this issue Jun 11, 2013

Made workaround even uglier, but it works for python2 and 3 #273

af8ad67

tilboerner mentioned this issue Jul 5, 2013

Setup Crashes for some folder names #325

Closed

devsnd mentioned this issue Jul 28, 2013

Setup failing due to unicode errors #333

Closed

tilboerner mentioned this issue Nov 12, 2013

Unable to process Chinese filename #367

Closed

tilboerner added a commit that referenced this issue Nov 18, 2013

Serve utf-8 paths in python 2

5e103f8

Read config file as utf-8; encode '/serve' staticdir config to prevent mixing of byte/unicode strings in cherrypy code, which would yield ASCII decoding errors. (#315, #325, #368)

tilboerner added a commit that referenced this issue Nov 18, 2013

Fix unicode filename bug in sqlitecache (#315, #333)

f04bbd0

When os.listdir returns a bytestring name, make an effort to decode it; if that fails, skip the file and log the error.

6arms1leg mentioned this issue Apr 3, 2016

Encoding error in file names ? #611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode: encoding/decoding errors in python2 #315

unicode: encoding/decoding errors in python2 #315

tilboerner commented May 17, 2013

tilboerner commented Jun 9, 2013

tilboerner commented Jun 11, 2013

acidtonic commented Oct 7, 2014

devsnd commented Oct 11, 2014

acidtonic commented Oct 11, 2014

devsnd commented Oct 11, 2014

unicode: encoding/decoding errors in python2 #315

unicode: encoding/decoding errors in python2 #315

Comments

tilboerner commented May 17, 2013

tilboerner commented Jun 9, 2013

tilboerner commented Jun 11, 2013

acidtonic commented Oct 7, 2014

devsnd commented Oct 11, 2014

acidtonic commented Oct 11, 2014

devsnd commented Oct 11, 2014