-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode decoding error while reading from stderr / stdout #638
Comments
From the traceback, the source of the exception is the bytes/string conversion code added for Python 3 compatibility (even though you are running on it Python 2). This should only happen on the master branch. A release version of Supervisor like 3.1.3 won't have this problem. |
It looks like we're assuming UTF-8 for all data coming off of the child process: supervisor/supervisor/compat.py Line 22 in 7dcbc2c
supervisor/supervisor/options.py Line 1507 in 669ec85
Would this be worth making configurable in the program options? Something like:
The alternative would be to alter the supervisor internals to pass only |
Indeed passing bytes around internally is a big and difficult change (I tried) and since we also need to write outputs the supervisor stdout etc, I think it's the wrong solution. We need to make the encoding/decoding more error resistant (easy) and I do agree there should be a way to configure the stdout/stderr encodings, which isn't a trivial change either. |
Could you please push this branch to GitHub? I checked your fork and couldn't find it.
Subprocess output is only written to supervisord's stdout when running non-daemonized and at the log level debug. This is not the way Supervisor is typically used. If needed, we could check for this scenario specifically to avoid unnecessarily decoding/encoding subprocess output. There is a large performance impact by all the bytes to strings conversions added by the Python 3 port. The port also has hacks that convert bytes to strings and back to bytes again. I suspect the only way to restore performance and remove the hacks is to make it work with bytes internally. |
To be clearer, I tried and failed. :-) I'll check if I have the branch left, but it's going to be a big failing mess in any case. You are right about the performance impact, and that's why I thought it was worth a try. Perhaps I can make some changes to give a cleaner separation between processes and logging, then it might work. I'll look at that (but not this week). |
Encode-Decode should perhaps use the surrogate-escape explicitly |
Link to relevant PEP for surrogateescape: https://www.python.org/dev/peps/pep-0383/ An implementation of this for Python 2: https://github.com/PythonCharmers/python-future/blob/fe494df5d66db19af94a1fdc32afe0e2e52dcf66/src/future/utils/surrogateescape.py |
I came across this via a recent call for help with porting to Python 3 (which was posted to Reddit). From @regebro 's comment
Isn't that the right thing to do, though? Trying to understand ... it looks to me (w.r.t. the above traceback) like Unlike the usual "Unicode sandwich" case we get in most text-centric applications, isn't Supervisor a candidate for a "reverse-Unicode sandwich" where everything internal is in bytes until it needs to be output to console or file, at which point it may need some decoding (based on child process output encoding) and re-encoding (based on the device being output toi)? Sorry if I've got the wrong end of the stick here, but it seems like doing an |
Supervisor most importantly needs to be consistent with what it passes around internally. Bytes is OK, and may be the fastest. That will mean that any logfile will be in the same character encoding as any output is. This may not be desirable, as you can in theory have different encodings from different processes all going to the same logfile. To handle that you need to use unicode internally. But that also means we need to deal with when processes send invalid data. So both sandwiches are quite soggy, but we should eat one of them. :-) |
There may be a need to use Unicode internally, but it could be fairly localised, say to The reason for doing it line-by-line is just to avoid trying to be more granular and failing due to only part of the bytes in an encoding sequence being available. Are there any other places where Unicode needs to be handled, other than e.g. reading configuration files, RPC interface and similar? |
Using bytes internally is closest to what released versions of Supervisor on Python 2 do now. A subprocess can output whatever binary data it wants on its stdout and supervisord will write that verbatim to the log file. It's supposed to, anyway. Users already complain on the issue tracker that logging is slow, so the additional overhead of decoding all output may not be tolerable. Decoding all subprocess output also probably necessitates more configuration (encoding per stream) and will corrupt output when that is configured incorrectly. I'd be in favor of using bytes internally and avoiding decoding whenever possible. Some decoding is necessary. For the HTTP servers (XML-RPC and web interface), we need the subprocess output to be decoded so we can turn around and encode it as valid UTF-8. Performance there isn't sensitive like normal logging, since those are only active when a user makes a remote connection. Imperfect conversions on those is also probably more acceptable than on the actual log files.
All processes spawned from supervisord in inherit its environment ( |
Line buffering doesn't work for |
Suppose this were done for Python 3, too - just write binary streams to binary streams. If the processes all use a common encoding, then the resulting output should be consistent for whoever/whatever consumes the output log. If the encoding used by different processes isn't the same, you might get mojibake in the resulting output, but the onus is on the consumer of the log from those processes to sort it out when they consume it, without any decoding/encoding penalties incurred in situations where everything does utf-8, for example.
I see. Then the approach would be to try and decode, see at what offset the decoding failure occurred, and keep that part for decoding later when more of the data is available, passing on the successfully decoded part to the output. What problems have been reported with the binary-stream -> binary-stream (without decode/encode) approach, or has that not been tried in Python 3 execution paths? |
Nobody has tried it. |
OK, I've had a go. See this branch comparison - I've not actually raised a pull request. All tests pass in this branch on 2.7 and 3.5 on my system, FWIW. Also, running the above |
I spotted a missed The tests don't cover this function. |
Good catch, thanks. IMO the branch still needs a bit of work - b-prefixes won't work on older Pythons. What's the thinking about dropping support for Python < 2.6? To support 2.4 and 2.5 I would see the need to keep using The use of the same machinery for child logs and Supervisor's own log also seems a bit clunky. The child logs are IMO data from Supervisor's POV and shouldn't go through the same path as Supervisor's own logging - what I've done there is a bit of a kludge. |
I'm assuming any of this work is for Supervisor 4.0, in which case you only need to worry about 2.6+.
There's an open question here about whether 3.2 can also be dropped, though right now the |
Thanks so much for your work on this. This is the first effort to solve the logging issues that I have seen. There are a number of other Python 3 bytes/strings issues tagged python 3 if you are up for it. The tests are apparently not very good. Core features that people use all the time like eventlisteners and
The other packages in this org (meld3 and superlance) now have release versions that support 2.6+ and 3.2+. The next major version of Supervisor is planned to support the same. Dropping 3.2 was suggested but there doesn't seem to be a compelling reason to do that yet.
I think the origin for this is that the main log and the child logs have all the same options in the config file, e.g. rotation. |
You're welcome - Supervisor has been of good use to me and it's nice to be able to give something back
Indeed, I've already incorporated some patches/ideas from others in my branch. As I see it, this patch should go some way to addressing #565 (thanks to a patch by @regebro), #638 (this issue), #663, #835, #836 (thanks to #868 by @evanunderscore ), and I'll be looking at #664 soon as well (you can already transmit Unicode bytes using e.g.
They're not that bad, but there are definitely gaps. There ought, for instance, to be some "end-to-end" tests that invoke
Maybe we just need different handlers to avoid the kludging I've currently done in |
Having a supervisor configuration that produces non-utf-8 data (e.g. binary) triggers an error in supervisor followed by a close of stdout / stderr for that process. This leads to a not responding process because data which should be send to stdout / stderr are blocked / stuck in the write operation.
Log message of supervisor (formatted):
The text was updated successfully, but these errors were encountered: