You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.
normalizer -v ../competitive-verifier/examples/tests/encoding/cp932.txt
2023-10-06 02:56:01,424 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (485 byte(s) given) parameters.
2023-10-06 02:56:01,424 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0x89 in position 0: ordinal not in range(128)
2023-10-06 02:56:01,424 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
2023-10-06 02:56:01,425 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,425 | Level 5 | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,425 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 93.600000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 521.200000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,426 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,426 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,426 | Level 5 | Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,426 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 102.400000 %.
2023-10-06 02:56:01,427 | Level 5 | Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,427 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2023-10-06 02:56:01,427 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x90 in position 26: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,428 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x76 in position 54: character maps to <undefined>
2023-10-06 02:56:01,428 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 107.800000 %.
2023-10-06 02:56:01,428 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,428 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 156.500000 %.
2023-10-06 02:56:01,429 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,429 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.100000 %.
2023-10-06 02:56:01,429 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,429 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 104.400000 %.
2023-10-06 02:56:01,430 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 127.500000 %.
2023-10-06 02:56:01,430 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xe1 in position 27: character maps to <undefined>
2023-10-06 02:56:01,430 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 107.800000 %.
2023-10-06 02:56:01,430 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,430 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa6 in position 325: character maps to <undefined>
2023-10-06 02:56:01,431 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,431 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,432 | Level 5 | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 2: character maps to <undefined>
2023-10-06 02:56:01,432 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x89 in position 0: character maps to <undefined>
2023-10-06 02:56:01,432 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 75.200000 %.
2023-10-06 02:56:01,432 | Level 5 | Code page cp932 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,433 | Level 5 | cp932 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,433 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0x83 in position 6: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,434 | Level 5 | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xc9 in position 247: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xc9 in position 247: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,435 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,435 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,436 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0x89 in position 0: illegal multibyte sequence
2023-10-06 02:56:01,437 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,437 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 195: character maps to <undefined>
2023-10-06 02:56:01,437 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,437 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,437 | Level 5 | iso8859_15 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,437 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 484.800000 %.
2023-10-06 02:56:01,438 | Level 5 | iso8859_2 is deemed too similar to code page iso8859_16 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 264: character maps to <undefined>
2023-10-06 02:56:01,438 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,438 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 521.200000 %.
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfa in position 122: character maps to <undefined>
2023-10-06 02:56:01,438 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xae in position 264: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc4 in position 33: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | iso8859_9 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,439 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0x83 in position 2: illegal multibyte sequence
2023-10-06 02:56:01,439 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 251.600000 %.
2023-10-06 02:56:01,439 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x8f in position 36: character maps to <undefined>
2023-10-06 02:56:01,439 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 251.600000 %.
2023-10-06 02:56:01,440 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,440 | Level 5 | latin_1 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,440 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 115.300000 %.
2023-10-06 02:56:01,440 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 243.600000 %.
2023-10-06 02:56:01,440 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 243.500000 %.
2023-10-06 02:56:01,441 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 200.100000 %.
2023-10-06 02:56:01,441 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2023-10-06 02:56:01,441 | Level 5 | Code page shift_jis is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,441 | Level 5 | shift_jis was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page shift_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,442 | Level 5 | shift_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page shift_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2023-10-06 02:56:01,442 | Level 5 | shift_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.600000 %.
2023-10-06 02:56:01,442 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 195: character maps to <undefined>
2023-10-06 02:56:01,442 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:56:01,442 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 258-259: illegal encoding
2023-10-06 02:56:01,442 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 150-151: illegal UTF-16 surrogate
2023-10-06 02:56:01,442 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2023-10-06 02:56:01,443 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:56:01,443 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2023-10-06 02:56:01,443 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2023-10-06 02:56:01,443 | DEBUG | Encoding detection: Unable to determine any suitable charset.
Unable to identify originating encoding for "../competitive-verifier/examples/tests/encoding/cp932.txt". Maybe try increasing maximum amount of chaos.
{
"path": "/home/kzrnm/workspace/competitive-verifier/examples/tests/encoding/cp932.txt",
"encoding": null,
"encoding_aliases": [],
"alternative_encodings": [],
"language": "Unknown",
"alphabets": [],
"has_sig_or_bom": false,
"chaos": 1.0,
"coherence": 0.0,
"unicode_path": null,
"is_preferred": true
}
Expected encoding
CP932
Desktop (please complete the following information):
OS: Ubuntu (WSL on Windows11)
Python version : 3.10
Package version: 3.3
Additional context
If all the characters are kanji or full-width kana as shown below, charset_normalizer can detect correctly.
About hankaku half-width kana. https://en.wikipedia.org/wiki/Half-width_kana
Notice
I hereby announce that my raw input is not :
Provide the file
https://github.com/competitive-verifier/competitive-verifier/blob/89102a878a9081f72bd3450065bcf7d9fd536a5f/examples/tests/encoding/cp932.txt
Verbose output
Using the CLI, run
normalizer -v ./my-file.txt
and past the result in here.Expected encoding
CP932
Desktop (please complete the following information):
Additional context
If all the characters are kanji or full-width kana as shown below, charset_normalizer can detect correctly.
https://en.wikipedia.org/wiki/Ame_ni_mo_makezu
The text was updated successfully, but these errors were encountered: