Test failures with cchardet-2.1.7 and chardet are installed #318

mgorny · 2022-07-02T03:13:52Z

When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.

FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).

...F...FF.F.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test_001742 (__main__.TestCase)
./tests/illformed/chardet/windows1255.xml: windows-1255 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'windows-1255'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as WINDOWS-1255'),
 'content-type': '',
 'encoding': 'WINDOWS-1255',
 'entries': [{'summary': 'האם תדפיס נייר של אתר אינטרנט שמוצג על מסך משתמש הוא '
                         'העתק נאמן למקור של אתר האינטרנט? רבים יגידו שכן, '
                         'ולפעמים גם בתי המשפט יצטרפו אליהם שיקבלו פלט מאתר '
                         'אינטרנט כראיה קבילה. אבל, זה ממש לא כך. ויש אפילו '
                         'הוכחה מדהימה.',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'האם תדפיס נייר של אתר אינטרנט שמוצג '
                                          'על מסך משתמש הוא העתק נאמן למקור של '
                                          'אתר האינטרנט? רבים יגידו שכן, '
                                          'ולפעמים גם בתי המשפט יצטרפו אליהם '
                                          'שיקבלו פלט מאתר אינטרנט כראיה '
                                          'קבילה. אבל, זה ממש לא כך. ויש אפילו '
                                          'הוכחה מדהימה.'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001746 (__main__.TestCase)
./tests/illformed/chardet/gb2312.xml: GB2312 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'GB2312'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as GB18030'),
 'content-type': '',
 'encoding': 'GB18030',
 'entries': [{'title': '不归移民漫画系列：专业工作',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': '不归移民漫画系列：专业工作'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001747 (__main__.TestCase)
./tests/illformed/chardet/euckr.xml: EUC-KR with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'EUC-KR'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as UHC'),
 'content-type': '',
 'encoding': 'UHC',
 'entries': [{'summary': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 된 닉네임을 정할 경우에, '
                         'EUC-KR로 된 무버블타입 블록에선 리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 '
                         '깨어져 나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 한글로 사용하는 많은 분들도 '
                         '타입키에서의 닉네임은 이런 문제때문에 울며겨자먹기로 영어로 짓고 있다....',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 '
                                          '된 닉네임을 정할 경우에, EUC-KR로 된 무버블타입 블록에선 '
                                          '리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 깨어져 '
                                          '나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 '
                                          '한글로 사용하는 많은 분들도 타입키에서의 닉네임은 이런 '
                                          '문제때문에 울며겨자먹기로 영어로 짓고 있다....'},
              'title': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001749 (__main__.TestCase)
./tests/illformed/chardet/big5.xml: Big5 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'Big5'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as BIG5'),
 'content-type': '',
 'encoding': 'BIG5',
 'entries': [],
 'feed': {'title': '我希望??很容易?其翻?成中文，并有助于改??件。 感?您??本文。',
          'title_detail': {'base': '',
                           'language': None,
                           'type': 'text/plain',
                           'value': '我希望??很容易?其翻?成中文，并有助于改??件。 感?您??本文。'}},
 'headers': {},
 'namespaces': {'': 'http://www.w3.org/2005/Atom'},
 'version': 'atom10'})

----------------------------------------------------------------------
Ran 4354 tests in 4.892s

FAILED (failures=4)

maksverver · 2024-08-29T16:55:32Z

I ran into the same problem. Here's a snippet that can be used to show the differences between chardet and cchardet.

import cchardet
import chardet
import glob

for path in glob.glob('tests/illformed/chardet/*'):
    data = open(path, 'rb').read()
    enc1 = chardet.detect(data)['encoding']
    enc2 = cchardet.detect(data)['encoding']
    print('%-40s %-20s %-20s %s' % (path, enc1, enc2, 'same' if enc1 == enc2 else 'different'))

tests/illformed/chardet/koi8r.xml        KOI8-R               KOI8-R               same
tests/illformed/chardet/windows1255.xml  windows-1255         WINDOWS-1255         different
tests/illformed/chardet/gb2312.xml       GB2312               GB18030              different
tests/illformed/chardet/big5.xml         Big5                 BIG5                 different
tests/illformed/chardet/shiftjis.xml     SHIFT_JIS            SHIFT_JIS            same
tests/illformed/chardet/eucjp.xml        EUC-JP               EUC-JP               same
tests/illformed/chardet/euckr.xml        EUC-KR               UHC                  different
tests/illformed/chardet/tis620.xml       TIS-620              TIS-620              same

feedparser imports cchardet or chardet depending on what's installed: https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40 Although these libraries are mostly equivalent, they return slightly different encoding strings, even though both are correct and lead to succesful decoding. This change allows the tests to be run with either library by accepting both encoding names as correct. cchardet detects slightly different encodings from chardet,

feedparser imports cchardet or chardet depending on what's installed: https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40 Although these libraries are mostly equivalent, they return slightly different encoding strings, even though both are correct and lead to succesful decoding. This change allows the tests to be run with either library by accepting both encoding names as correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failures with cchardet-2.1.7 and chardet are installed #318

Test failures with cchardet-2.1.7 and chardet are installed #318

mgorny commented Jul 2, 2022

maksverver commented Aug 29, 2024

Test failures with cchardet-2.1.7 and chardet are installed #318

Test failures with cchardet-2.1.7 and chardet are installed #318

Comments

mgorny commented Jul 2, 2022

maksverver commented Aug 29, 2024