Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ascii XML output does not work with Python2 #106

Closed
goodmami opened this issue May 8, 2017 · 4 comments
Closed

non-ascii XML output does not work with Python2 #106

goodmami opened this issue May 8, 2017 · 4 comments
Assignees
Milestone

Comments

@goodmami
Copy link
Member

goodmami commented May 8, 2017

Converting to an XML-based format (e.g. mrx or dmrx) with Python2 generates a UnicodeDecodeError when there is non-ascii characters in the stream:

  [...]
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 2: ordinal not in range(128)
@goodmami
Copy link
Member Author

goodmami commented Aug 8, 2018

Here is a simple MRS that illustrates the problem:

[ TOP: h0
  INDEX: e2 [ e TENSE: pres MOOD: indicative PROG: - PERF: - ASPECT: default_aspect PASS: - SF: prop ]
  RELS: < [ "_スマート_a_1_rel"<0:4> LBL: h1 ARG0: e2 ARG1: i3 ] >
  HCONS: < h0 qeq h1 > ]

One #164 is fixed, you'll see an error like the one in the original post. The first trace in PyDelphin code, though, is this:

  File "delphin/mrs/util.py", line 94, in etree_tostring
    return etree.tostring(elem, encoding='utf-8', **kwargs).decode('utf-8')

So it seems the delphin.mrs.util.etree_tostring() function is probably where to look for a fix.

@goodmami
Copy link
Member Author

goodmami commented Aug 21, 2018

In an email, @mcmillanmajora says this:

It seemed like the internal ElementTree decoder was defaulting to ascii regardless of the encoding given to it, so the only thing that ended up working for me was decoding the text as it was read in from sys.stdin in main.py. Strings don't have a decode() method in Python 3 though, so if an AttributeError is thrown, I have it reverting back to reading from stdin without the decoding.

From line 106 in main.py:
    text = sys.stdin.read()
    try:
        xs = loads(text.decode('utf-8', errors='replace'))
    except AttributeError:
        xs = loads(text)

I think we have non-unicode strings in the elements in Python 2, so the .encode('utf-8') part of etree_tostring has to first decode those into unicode; it assumes they were ascii and fails. Note that if it worked, it would first decode, then encode, then decode again, which is rather inefficient. It might be possible to just ensure that everything in the element is unicode from the beginning, which would avoid that first decode() call.

The code you've shown avoids the problem by converting to unicode before constructing the XML structure (although it might be better to use the codec or io module for a Py2/3 compatible version instead of catching an exception). You're solution is generally good, but since it is in main.py, I think it only helps when using the delphin command. Is the bug resolved if you use the API (e.g., dmrx.dumps(x))?

@mcmillanmajora
Copy link
Contributor

You're right. It only solves the delphin command. I tried applying the conversion in loads() instead, and it works for the command line, but the encoding issue persists when using the API.

@goodmami
Copy link
Member Author

Here's a unit test (I called it tests/mrs_util_test.py) that passes with Python3 but fails with Python2:

# -*- coding: utf-8 -*-

from delphin.mrs.util import etree_tostring

def test_etree_tostring():
    import xml.etree.ElementTree as etree
    e = etree.Element('a')
    e.text = 'a'
    assert etree_tostring(e, encoding='unicode') == u'<a>a</a>'
    e.text = u'あ'
    assert etree_tostring(e, encoding='unicode') == u'<a>あ</a>'
    e.text = 'あ'
    assert etree_tostring(e, encoding='unicode') == u'<a>あ</a>'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants