Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode decoding error while reading from stderr / stdout #638

Closed
peick opened this issue Jul 21, 2015 · 21 comments · Fixed by #1029
Closed

unicode decoding error while reading from stderr / stdout #638

peick opened this issue Jul 21, 2015 · 21 comments · Fixed by #1029

Comments

@peick
Copy link

peick commented Jul 21, 2015

Having a supervisor configuration that produces non-utf-8 data (e.g. binary) triggers an error in supervisor followed by a close of stdout / stderr for that process. This leads to a not responding process because data which should be send to stdout / stderr are blocked / stuck in the write operation.

[program:produce-unicode-error]
command=echo -e "\x88"

Log message of supervisor (formatted):

2015-07-21 15:56:04,934 CRIT uncaptured python exception, closing channel <POutputDispatcher at 26310992 for <Subprocess at 26313168 with name produce-unicode-error in state STARTING>  
    (stdout)> (<type 'exceptions.UnicodeDecodeError'>:'utf8' codec can't decode byte 0x88 in position 0: invalid start byte
    [/opt/python2.6/supervisor/supervisord.py|runforever|223] 
    [/opt/python2.6/supervisor/dispatchers.py|handle_read_event|260] 
    [/opt/python2.6/supervisor/options.py|readfd|1507]
    [/opt/python2.6/supervisor/compat.py|as_string|24]
    [/opt/python2.6/encodings/utf_8.py|decode|16])
2015-07-21 15:56:04,934 DEBG fd 6 closed, stopped monitoring <POutputDispatcher at 26310992 for <Subprocess at 26313168 with name produce-unicode-error in state STARTING> (stdout)>
@mnaberez
Copy link
Member

From the traceback, the source of the exception is the bytes/string conversion code added for Python 3 compatibility (even though you are running on it Python 2). This should only happen on the master branch. A release version of Supervisor like 3.1.3 won't have this problem.

@Lucretiel
Copy link

It looks like we're assuming UTF-8 for all data coming off of the child process:

def as_string(s): return s if isinstance(s, unicode) else s.decode('utf-8')

return as_string(data)

Would this be worth making configurable in the program options? Something like:

[program:emit-raw]
command: echo -e "\x88"
stdout-encoding: latin-1

The alternative would be to alter the supervisor internals to pass only bytes around. This would be a pretty drastic change, and would really only defer the problem, as many parts of supervisor (event handling, logging logs) require text strings.

@regebro
Copy link

regebro commented Feb 18, 2016

Indeed passing bytes around internally is a big and difficult change (I tried) and since we also need to write outputs the supervisor stdout etc, I think it's the wrong solution.

We need to make the encoding/decoding more error resistant (easy) and I do agree there should be a way to configure the stdout/stderr encodings, which isn't a trivial change either.

@mnaberez
Copy link
Member

Indeed passing bytes around internally is a big and difficult change (I tried)

Could you please push this branch to GitHub? I checked your fork and couldn't find it.

since we also need to write outputs the supervisor stdout etc, I think it's the wrong solution.

Subprocess output is only written to supervisord's stdout when running non-daemonized and at the log level debug. This is not the way Supervisor is typically used. If needed, we could check for this scenario specifically to avoid unnecessarily decoding/encoding subprocess output.

There is a large performance impact by all the bytes to strings conversions added by the Python 3 port. The port also has hacks that convert bytes to strings and back to bytes again. I suspect the only way to restore performance and remove the hacks is to make it work with bytes internally.

@regebro
Copy link

regebro commented Feb 18, 2016

To be clearer, I tried and failed. :-) I'll check if I have the branch left, but it's going to be a big failing mess in any case.

You are right about the performance impact, and that's why I thought it was worth a try. Perhaps I can make some changes to give a cleaner separation between processes and logging, then it might work. I'll look at that (but not this week).

@ztane
Copy link

ztane commented Jun 5, 2016

Encode-Decode should perhaps use the surrogate-escape explicitly

@evanunderscore
Copy link

@vsajip
Copy link
Contributor

vsajip commented Dec 27, 2016

I came across this via a recent call for help with porting to Python 3 (which was posted to Reddit).

From @regebro 's comment

Indeed passing bytes around internally is a big and difficult change

Isn't that the right thing to do, though?

Trying to understand ... it looks to me (w.r.t. the above traceback) like ServerOptions.readfd reads some number of bytes from a child process' output fd, whereas POutputDispatcher.handle_read_event just collects what it gets from readfd into a buffer. Even if utf-8 is the output encoding of the child process, what guarantee do we have that each call to readfd will produce a decodable sequence of bytes, rather than sometimes (due to the vagaries of I/O, buffering etc.) delivering a partial utf-8 sequence that can't be decoded until the remainder of the sequence is received in a subsequent readfd call?

Unlike the usual "Unicode sandwich" case we get in most text-centric applications, isn't Supervisor a candidate for a "reverse-Unicode sandwich" where everything internal is in bytes until it needs to be output to console or file, at which point it may need some decoding (based on child process output encoding) and re-encoding (based on the device being output toi)? Sorry if I've got the wrong end of the stick here, but it seems like doing an as_string in readfd is fundamentally wrong, and decoding can only reliably happen after all the output has been received (or else, you decode as much as you can, and leave incomplete sequences undecoded until they are completed - but that level of complication seems unwarranted) ... hoping someone can educate me.

@regebro
Copy link

regebro commented Dec 28, 2016

Supervisor most importantly needs to be consistent with what it passes around internally. Bytes is OK, and may be the fastest. That will mean that any logfile will be in the same character encoding as any output is. This may not be desirable, as you can in theory have different encodings from different processes all going to the same logfile.

To handle that you need to use unicode internally. But that also means we need to deal with when processes send invalid data.

So both sandwiches are quite soggy, but we should eat one of them. :-)

@vsajip
Copy link
Contributor

vsajip commented Dec 28, 2016

There may be a need to use Unicode internally, but it could be fairly localised, say to POutputDispatcher.record_output. Assume that all child process output is collected as bytes throughout the system, and kept together with an assumed encoding (perhaps a default, overridable via [command:XXX] configuration). Assuming that the output is line-oriented (as one might expect for a log), one could try splitting the bytes into lines and then doing something like line.decode(child_process_encoding).encode(desired_output_encoding) and outputting the result to the log. If the decoding process fails, either the child process is outputting corrupted data or the child_process_encoding is wrong. The line could then be output with a suitable rider and suitable escaping.

The reason for doing it line-by-line is just to avoid trying to be more granular and failing due to only part of the bytes in an encoding sequence being available.

Are there any other places where Unicode needs to be handled, other than e.g. reading configuration files, RPC interface and similar?

@mnaberez
Copy link
Member

Supervisor most importantly needs to be consistent with what it passes around internally. Bytes is OK, and may be the fastest. That will mean that any logfile will be in the same character encoding as any output is.

Using bytes internally is closest to what released versions of Supervisor on Python 2 do now. A subprocess can output whatever binary data it wants on its stdout and supervisord will write that verbatim to the log file. It's supposed to, anyway.

Users already complain on the issue tracker that logging is slow, so the additional overhead of decoding all output may not be tolerable. Decoding all subprocess output also probably necessitates more configuration (encoding per stream) and will corrupt output when that is configured incorrectly.

I'd be in favor of using bytes internally and avoiding decoding whenever possible. Some decoding is necessary. For the HTTP servers (XML-RPC and web interface), we need the subprocess output to be decoded so we can turn around and encode it as valid UTF-8. Performance there isn't sensitive like normal logging, since those are only active when a user makes a remote connection. Imperfect conversions on those is also probably more acceptable than on the actual log files.

This may not be desirable, as you can in theory have different encodings from different processes all going to the same logfile.

All processes spawned from supervisord in inherit its environment (LANG etc) so I'd hope the case of different processes outputting with different encodings is rare. I'd be fine putting something in the docs saying supervisord is not going to normalize all your logfile output to a common encoding. It doesn't do that now.

@mnaberez
Copy link
Member

The reason for doing it line-by-line is just to avoid trying to be more granular and failing due to only part of the bytes in an encoding sequence being available.

Line buffering doesn't work for supervisorctl fg mode (see this example). Based on issue reports, users actually use this feature.

@vsajip
Copy link
Contributor

vsajip commented Dec 29, 2016

A subprocess can output whatever binary data it wants on its stdout and supervisord will write that verbatim to the log file.

Suppose this were done for Python 3, too - just write binary streams to binary streams. If the processes all use a common encoding, then the resulting output should be consistent for whoever/whatever consumes the output log. If the encoding used by different processes isn't the same, you might get mojibake in the resulting output, but the onus is on the consumer of the log from those processes to sort it out when they consume it, without any decoding/encoding penalties incurred in situations where everything does utf-8, for example.

Line buffering doesn't work for supervisorctl fg mode

I see. Then the approach would be to try and decode, see at what offset the decoding failure occurred, and keep that part for decoding later when more of the data is available, passing on the successfully decoded part to the output.

What problems have been reported with the binary-stream -> binary-stream (without decode/encode) approach, or has that not been tried in Python 3 execution paths?

@mnaberez
Copy link
Member

What problems have been reported with the binary-stream -> binary-stream (without decode/encode) approach, or has that not been tried in Python 3 execution paths?

Nobody has tried it.

@vsajip
Copy link
Contributor

vsajip commented Dec 29, 2016

OK, I've had a go. See this branch comparison - I've not actually raised a pull request.

All tests pass in this branch on 2.7 and 3.5 on my system, FWIW.

Also, running the above produce-unicode-error program under Supervisor results in a child log file with the two bytes '\x88', '\x0A', as one might expect.

@evanunderscore
Copy link

@vsajip
Copy link
Contributor

vsajip commented Dec 30, 2016

Good catch, thanks. IMO the branch still needs a bit of work - b-prefixes won't work on older Pythons. What's the thinking about dropping support for Python < 2.6? To support 2.4 and 2.5 I would see the need to keep using as_bytes and as_string, perhaps aliased as _B() and _T() or similar. There are also a couple of sections marked TODO, at least one of which is a test/mock-related artifact. The use of BytesIO would also need revisiting for the older Pythons.

The use of the same machinery for child logs and Supervisor's own log also seems a bit clunky. The child logs are IMO data from Supervisor's POV and shouldn't go through the same path as Supervisor's own logging - what I've done there is a bit of a kludge.

@evanunderscore
Copy link

I'm assuming any of this work is for Supervisor 4.0, in which case you only need to worry about 2.6+.

Supervisor 4.0 (unreleased) is planned to work under Python 2 version 2.6 or greater and Python 3 version 3.2 or greater. See note above about the current state of Python 3 support.

There's an open question here about whether 3.2 can also be dropped, though right now the u prefix isn't used anywhere so that may not be necessary.

@mnaberez
Copy link
Member

mnaberez commented Jan 1, 2017

OK, I've had a go. See this branch comparison - I've not actually raised a pull request.

Thanks so much for your work on this. This is the first effort to solve the logging issues that I have seen. There are a number of other Python 3 bytes/strings issues tagged python 3 if you are up for it.

The tests are apparently not very good. Core features that people use all the time like eventlisteners and supervisorctl tail -f don't work on Python 3, but the tests pass anyway. However, I have added reproduce instructions to several of the Python 3 tickets.

What's the thinking about dropping support for Python < 2.6?

The other packages in this org (meld3 and superlance) now have release versions that support 2.6+ and 3.2+. The next major version of Supervisor is planned to support the same. Dropping 3.2 was suggested but there doesn't seem to be a compelling reason to do that yet.

The use of the same machinery for child logs and Supervisor's own log also seems a bit clunky

I think the origin for this is that the main log and the child logs have all the same options in the config file, e.g. rotation.

@vsajip
Copy link
Contributor

vsajip commented Jan 1, 2017

Thanks so much for your work on this.

You're welcome - Supervisor has been of good use to me and it's nice to be able to give something back ☺️

There are a number of other Python 3 bytes/strings issues tagged python 3 if you are up for it.

Indeed, I've already incorporated some patches/ideas from others in my branch. As I see it, this patch should go some way to addressing #565 (thanks to a patch by @regebro), #638 (this issue), #663, #835, #836 (thanks to #868 by @evanunderscore ), and I'll be looking at #664 soon as well (you can already transmit Unicode bytes using e.g. fg functionality, but I haven't tried making a non-ASCII command name yet). That's all of the issues tagged python 3 at the moment 😉

The tests are apparently not very good.

They're not that bad, but there are definitely gaps. There ought, for instance, to be some "end-to-end" tests that invoke supervisord and supervisorctl as subprocesses (the way a user would) but I see no references to subprocess in the tests.

The use of the same machinery for child logs and Supervisor's own log also seems a bit clunky

I think the origin for this is that the main log and the child logs have all the same options in the config file

Maybe we just need different handlers to avoid the kludging I've currently done in emit(). e.g. a specific ChildLogHandler. This needs a little further thought.

@evanunderscore
Copy link

For #664, I think I've solved that here, but haven't submitted a PR since most of the files I touched suffer from some other serious problems that I've talked about in #109.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants