`codecs.encode` with `utf-*` encoding and errors returing `str` rejects surrogates blindly #127305

litlighilit · 2024-11-26T20:17:53Z

Bug report

Bug description:

For codecs.encode,
with utf-* encoding, and a custom errors which returns str,
if you pass some characters that are not invalid UTF characters (e.g. surrogates),
UnicodeEncodeError is just raised and there's not the expected (and documented) case
where the returned str is appended.

import codecs

ERRORS_NAME = "returning non-ascii"

# something being not encod-able via `utf-*`
BAD_UTF = "\uD800"  # the first high surrogate character

def register_repl_error(repl):
    def error_handle(exc: UnicodeEncodeError):
        return (repl, exc.end)

    codecs.register_error(ERRORS_NAME, error_handle)



def encode_surrogate(encoding, repl):
    register_repl_error(repl)
    max_enc_len = 9
    try:
        res = codecs.encode(BAD_UTF, encoding, ERRORS_NAME)
    except UnicodeEncodeError as err:
        reason = err.reason
        print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
              f"with custom errors {ERRORS_NAME} raises with {reason=}")

    else:
        print(f"codecs.encode({BAD_UTF!r}, {encoding=:{max_enc_len}}) " +
              f"with custom errors {ERRORS_NAME} returns {res}")


## emoji

NON_ASCII = "\U0001F605" # \N{smiling face with open mouth and cold sweat}: 😅

for i in ('8', '16', '32', '16-le', '16-be'):
    encode_surrogate("utf-" + i, NON_ASCII)


print('-'*3)


## cjk

NON_ASCII = "龍"  # loong in Chinese


### zh
for enc in ("gbk", "big5"):
    encode_surrogate(enc, NON_ASCII)

### jp
for enc in ("Shift_JIS", "EUC-JP"):
    encode_surrogate(enc, NON_ASCII)

Output:

codecs.encode('\ud800', encoding=utf-8    ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16   ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-32   ) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-le) with custom errors returning non-ascii raises with reason='surrogates not allowed'
codecs.encode('\ud800', encoding=utf-16-be) with custom errors returning non-ascii raises with reason='surrogates not allowed'
---
codecs.encode('\ud800', encoding=gbk      ) with custom errors returning non-ascii returns b'\xfd\x88'
codecs.encode('\ud800', encoding=big5     ) with custom errors returning non-ascii returns b'\xc0s'
codecs.encode('\ud800', encoding=Shift_JIS) with custom errors returning non-ascii returns b'\x97\xb4'
codecs.encode('\ud800', encoding=EUC-JP   ) with custom errors returning non-ascii returns b'\xce\xb6'

CPython versions tested on:

3.9, 3.11, 3.12, 3.13, 3.14

Operating systems tested on:

Linux, Windows

The text was updated successfully, but these errors were encountered:

litlighilit · 2024-11-26T20:38:03Z

I'm going to fix it until I realize something serious to communicate: how errors shall be handled within error_handle?

One is simply re-call encode with 'strict' errors against the returned str, which,

~~however, may lead to dead loop (recursion), because `encode` will still invoke error_handler against the same data of `str`~~

is a method also used by _multibytecodec module.

(EDIT:using strict as fallback won's cause deadloop)

picnixz · 2024-11-27T01:17:14Z

I'll have a look at it in a few days. I think error handler should not have errors but I need to verify. If the error handler errs, then the exception should be propagated.

litlighilit · 2024-11-27T09:13:49Z

I'll have a look at it in a few days. I think error handler should not have errors but I need to verify. If the error handler errs, then the exception should be propagated.

Right, and it in fact does as expected, as edited above, my previous option was incorrect.

Then let's return to the original issue:

I've figured out where the mistaken code lies, and there're at least three places required to be fixed.

Yet these two days I'm too busy to focus on this patch, sorry in advance but I'll make it this weekday.

litlighilit added the type-bug An unexpected behavior, bug, or error label Nov 26, 2024

picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir labels Nov 27, 2024

picnixz added this to Codecs and encodings issues Nov 27, 2024

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`codecs.encode` with `utf-*` encoding and errors returing `str` rejects surrogates blindly #127305

`codecs.encode` with `utf-*` encoding and errors returing `str` rejects surrogates blindly #127305

litlighilit commented Nov 26, 2024 •

edited

Loading

litlighilit commented Nov 26, 2024 •

edited

Loading

picnixz commented Nov 27, 2024

litlighilit commented Nov 27, 2024

codecs.encode with utf-* encoding and errors returing str rejects surrogates blindly #127305

codecs.encode with utf-* encoding and errors returing str rejects surrogates blindly #127305

Comments

litlighilit commented Nov 26, 2024 • edited Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

litlighilit commented Nov 26, 2024 • edited Loading

picnixz commented Nov 27, 2024

litlighilit commented Nov 27, 2024

`codecs.encode` with `utf-*` encoding and errors returing `str` rejects surrogates blindly #127305

`codecs.encode` with `utf-*` encoding and errors returing `str` rejects surrogates blindly #127305

litlighilit commented Nov 26, 2024 •

edited

Loading

litlighilit commented Nov 26, 2024 •

edited

Loading