Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode: encoding/decoding errors in python2 #315

Open
5 tasks
tilboerner opened this issue May 17, 2013 · 6 comments
Open
5 tasks

unicode: encoding/decoding errors in python2 #315

tilboerner opened this issue May 17, 2013 · 6 comments

Comments

@tilboerner
Copy link
Collaborator

For the 🍶 of Python 2, we need to ensure we're handling non-ASCII string data in the same uniform way that Python 3 includes by design. Internally, all strings should have the same representation.

Create a central place that takes strings (str, byte or unicode) of various encodings and returns them in standard unicode form (py2->unicode, py3->str). All external string data needs to be passed through as soon as possible, probably the moment it enters the cherrymusicserver namespace.

There should probably be two different implementations: one for each major version of Python.

External sources:

  • calls to os module: filesystem, text file contents (config!), ...
  • command line interface
  • HTTP interface
  • anything we're getting from sys

Reset or consider as external:

  • databases

Please feel free to add other relevant sources, observations or notable consequences that are still missing.

@tilboerner
Copy link
Collaborator Author

@tilboerner
Copy link
Collaborator Author

For reference: http://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/

This is relevant because it concerns non-normalized unicode filenames. It also hints at problems of unicode non-normalization in general.

Or should I say: co-normalization? As a matter of fact, in valid unicode there exist several equivalent normal forms, and identitical characters (glyphs) may be equivalently represented by different codepoints. Yes, the same unicode text, in the same UTF encoding and byte order, can be stored in different byte sequences which don't compare equal. For example:

  • Python certainly doesn't care to look beyond the level of codepoints;
  • SQLite turns bytestrings into unicode and returns them as such, by whatever unknown procedure; it also doesn't normalize unicode that is put into it;
  • a unicode-aware filesystem will happily treat different canonically-equivalent forms of the same glyph as different names.

So, it's on us to be aware of such things. One significant consequence is that a HSF+ filename generated by OS X may be different from the same name as handled by Linux (see original link above). This means that, no, we cannot enforce a certain normal form without potentially invalidating stuff like filenames. In general, this means that string data generated by the program in one environment is not in all cases interoperable with the same program in another environment. More stuff to watch out for, hooh yeah.

tilboerner added a commit that referenced this issue Nov 18, 2013
Read config file as utf-8; encode '/serve' staticdir config to prevent
mixing of byte/unicode strings in cherrypy code, which would yield
ASCII decoding errors. (#315, #325, #368)
tilboerner added a commit that referenced this issue Nov 18, 2013
When os.listdir returns a bytestring name, make an effort to decode it;
if that fails, skip the file and log the error.
@acidtonic
Copy link

This issue is really bugging me and preventing nearly 50% of the songs in my library from even appearing in the web front end. My server.log is simply overflowing with errors about decoding various filenames. I'll paste a few here for reference.....

I'm going to be installing convmv and simply adding a hack to python so these errors run convmv on the files that throw exceptions. That tool will be able to fix the filenames and I really wish it was just something cherrymusic ran by itself or supported.....


ERROR [2014-10-07 14:43:48,447] : cherrypy.error.139977052412496 : from line (201) at
/usr/lib64/python2.7/site-packages/cherrypy/_cplogging.py
--
[07/Oct/2014:14:43:48] HTTP Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/lib64/python2.7/site-packages/cherrypy/lib/encoding.py", line 188, in call
self.body = self.oldhandler(_args, *_kwargs)
File "/usr/lib64/python2.7/site-packages/cherrypy/_cpdispatch.py", line 34, in call
return self.callable(_self.args, *_self.kwargs)
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/httphandler.py", line 290, in api
return json.dumps({'data': handler(**handler_args)})
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/httphandler.py", line 452, in api_listdir
return [entry.to_dict() for entry in self.model.listdir(directory)]
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/cherrymodel.py", line 163, in listdir
musicentry.count_subfolders_and_files()
File "/home/acidtonic/code/cherrymusic/cherrymusicserver/cherrymodel.py", line 355, in count_subfolders_and_files
subfilefullpath = os.path.join(fullpath, filename)
File "/usr/lib64/python2.7/posixpath.py", line 80, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)


ERROR [2014-10-07 14:43:41,425] : cherrymusicserver.sqlitecache : from line (688) at
/home/acidtonic/code/cherrymusic/cherrymusicserver/sqlitecache.py
--
unable to decode filename '05 - Jeunesse Dor\xc3\xa9e.mp3' in u'/home/acidtonic/music/De Phazz/Big'; skipping.


ERROR [2014-10-07 14:43:41,427] : cherrymusicserver.sqlitecache : from line (688) at
/home/acidtonic/code/cherrymusic/cherrymusicserver/sqlitecache.py
--
unable to decode filename '12 - Jeunesse Dor\xc3\xa9e (Dub-Mix).mp3' in u'/home/acidtonic/music/De Phazz/Big'; skipping.

@devsnd
Copy link
Owner

devsnd commented Oct 11, 2014

Hi @acidtonic!

Thanks for your report! I just checked what could be wrong using the information you gave us and it seems the encoding of the filenames files and your file system encoding differ from one another. The file is encoded using UTF-8:

>>> print codecs.decode('12 - Jeunesse Dor\xc3\xa9e (Dub-Mix).mp3', 'utf-8')
12 - Jeunesse Dorée (Dub-Mix).mp3

but your filesystem seems to be using ASCII, or maybe windows-1252 (but that would probably rather lead to mojibake in this case), as far as I can tell. Can you please post the output of the following command to find out if my theory is correct?

python2 -c "import sys; print sys.getfilesystemencoding()"

@acidtonic
Copy link

$ python2 -c "import sys; print sys.getfilesystemencoding()"
ANSI_X3.4-1968

I ended up solving the problem by downloading a tool called "detox" instead of the one I mentioned before.

The command I used was .... (-n is dry-run for no changes)

detox -s utf_8 -r -v -n *

I do wish however that there was some screen that listed all the tracks that were skipped for import errors. So I can catch if anyone else using the server is accidentally missing tracks because of this without babysitting it.

@devsnd
Copy link
Owner

devsnd commented Oct 11, 2014

you can get that information from the error log file, which is probably located in ~/.local/share/cherrymusic/error.log

$ grep "unable to decode filename" ~/.local/share/cherrymusic/error.log

should do the trick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants