Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures with cchardet-2.1.7 and chardet are installed #318

Open
mgorny opened this issue Jul 2, 2022 · 1 comment
Open

Test failures with cchardet-2.1.7 and chardet are installed #318

mgorny opened this issue Jul 2, 2022 · 1 comment

Comments

@mgorny
Copy link

mgorny commented Jul 2, 2022

When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.

FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).

...F...FF.F.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test_001742 (__main__.TestCase)
./tests/illformed/chardet/windows1255.xml: windows-1255 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'windows-1255'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as WINDOWS-1255'),
 'content-type': '',
 'encoding': 'WINDOWS-1255',
 'entries': [{'summary': 'האם תדפיס נייר של אתר אינטרנט שמוצג על מסך משתמש הוא '
                         'העתק נאמן למקור של אתר האינטרנט? רבים יגידו שכן, '
                         'ולפעמים גם בתי המשפט יצטרפו אליהם שיקבלו פלט מאתר '
                         'אינטרנט כראיה קבילה. אבל, זה ממש לא כך. ויש אפילו '
                         'הוכחה מדהימה.',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'האם תדפיס נייר של אתר אינטרנט שמוצג '
                                          'על מסך משתמש הוא העתק נאמן למקור של '
                                          'אתר האינטרנט? רבים יגידו שכן, '
                                          'ולפעמים גם בתי המשפט יצטרפו אליהם '
                                          'שיקבלו פלט מאתר אינטרנט כראיה '
                                          'קבילה. אבל, זה ממש לא כך. ויש אפילו '
                                          'הוכחה מדהימה.'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001746 (__main__.TestCase)
./tests/illformed/chardet/gb2312.xml: GB2312 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'GB2312'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as GB18030'),
 'content-type': '',
 'encoding': 'GB18030',
 'entries': [{'title': '不归移民漫画系列:专业工作',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': '不归移民漫画系列:专业工作'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001747 (__main__.TestCase)
./tests/illformed/chardet/euckr.xml: EUC-KR with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'EUC-KR'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as UHC'),
 'content-type': '',
 'encoding': 'UHC',
 'entries': [{'summary': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 된 닉네임을 정할 경우에, '
                         'EUC-KR로 된 무버블타입 블록에선 리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 '
                         '깨어져 나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 한글로 사용하는 많은 분들도 '
                         '타입키에서의 닉네임은 이런 문제때문에 울며겨자먹기로 영어로 짓고 있다....',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 '
                                          '된 닉네임을 정할 경우에, EUC-KR로 된 무버블타입 블록에선 '
                                          '리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 깨어져 '
                                          '나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 '
                                          '한글로 사용하는 많은 분들도 타입키에서의 닉네임은 이런 '
                                          '문제때문에 울며겨자먹기로 영어로 짓고 있다....'},
              'title': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001749 (__main__.TestCase)
./tests/illformed/chardet/big5.xml: Big5 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'Big5'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as BIG5'),
 'content-type': '',
 'encoding': 'BIG5',
 'entries': [],
 'feed': {'title': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。',
          'title_detail': {'base': '',
                           'language': None,
                           'type': 'text/plain',
                           'value': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。'}},
 'headers': {},
 'namespaces': {'': 'http://www.w3.org/2005/Atom'},
 'version': 'atom10'})

----------------------------------------------------------------------
Ran 4354 tests in 4.892s

FAILED (failures=4)
@maksverver
Copy link
Contributor

I ran into the same problem. Here's a snippet that can be used to show the differences between chardet and cchardet.

import cchardet
import chardet
import glob

for path in glob.glob('tests/illformed/chardet/*'):
    data = open(path, 'rb').read()
    enc1 = chardet.detect(data)['encoding']
    enc2 = cchardet.detect(data)['encoding']
    print('%-40s %-20s %-20s %s' % (path, enc1, enc2, 'same' if enc1 == enc2 else 'different'))
tests/illformed/chardet/koi8r.xml        KOI8-R               KOI8-R               same
tests/illformed/chardet/windows1255.xml  windows-1255         WINDOWS-1255         different
tests/illformed/chardet/gb2312.xml       GB2312               GB18030              different
tests/illformed/chardet/big5.xml         Big5                 BIG5                 different
tests/illformed/chardet/shiftjis.xml     SHIFT_JIS            SHIFT_JIS            same
tests/illformed/chardet/eucjp.xml        EUC-JP               EUC-JP               same
tests/illformed/chardet/euckr.xml        EUC-KR               UHC                  different
tests/illformed/chardet/tis620.xml       TIS-620              TIS-620              same

maksverver added a commit to maksverver/feedparser that referenced this issue Aug 29, 2024
feedparser imports cchardet or chardet depending on what's installed:
https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40

Although these libraries are mostly equivalent, they return slightly different
encoding strings, even though both are correct and lead to succesful decoding.
This change allows the tests to be run with either library by accepting both
encoding names as correct.

cchardet detects slightly different encodings from chardet,
maksverver added a commit to maksverver/feedparser that referenced this issue Aug 29, 2024
feedparser imports cchardet or chardet depending on what's installed:
https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40

Although these libraries are mostly equivalent, they return slightly different
encoding strings, even though both are correct and lead to succesful decoding.
This change allows the tests to be run with either library by accepting both
encoding names as correct.
maksverver added a commit to maksverver/feedparser that referenced this issue Aug 29, 2024
feedparser imports cchardet or chardet depending on what's installed:
https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40

Although these libraries are mostly equivalent, they return slightly different
encoding strings, even though both are correct and lead to succesful decoding.
This change allows the tests to be run with either library by accepting both
encoding names as correct.
maksverver added a commit to maksverver/feedparser that referenced this issue Aug 29, 2024
feedparser imports cchardet or chardet depending on what's installed:
https://github.com/kurtmckee/feedparser/blob/11990ea1d8791acc76c67781f1d2011daf0c3a99/feedparser/encodings.py#L37-L40

Although these libraries are mostly equivalent, they return slightly different
encoding strings, even though both are correct and lead to succesful decoding.
This change allows the tests to be run with either library by accepting both
encoding names as correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants