Replace MIME parsing with custom HTTP parsing. #200

aaugustin · 2017-07-09T16:28:15Z

Given that websockets makes straightforward use of HTTP, that websocket
implementations can be expected not to exhibit legacy behaviors, and
that RFC 7230 deprecates this behavior, parsing HTTP is doable.

Thanks https://github.com/njsmith/h11 for providing some inspiration,
especially for translating the RFC to regular expressions and figuring
out some edge cases.

I expect the new implementation to be faster, since it has a much
tighter focus than the stdlib's general purpose MIME parser, and
possibly more secure, since it was written from the beginning with
security as a primary goal (with the caveat that it's new code,
which means it's more likely to have security issues).

Fix #19.

Given that websockets makes straightforward use of HTTP, that websocket implementations can be expected not to exhibit legacy behaviors, and that RFC 7230 deprecates this behavior, parsing HTTP is doable. Thanks https://github.com/njsmith/h11 for providing some inspiration, especially for translating the RFC to regular expressions and figuring out some edge cases. I expect the new implementation to be faster, since it has a much tighter focus than the stdlib's general purpose MIME parser, and possibly more secure, since it was written from the beginning with security as a primary goal (with the caveat that it's new code, which means it's more likely to have security issues). Fix #19.

aaugustin · 2017-07-09T19:08:22Z

@cjerdonek Any thoughts on this?

aaugustin · 2017-07-09T19:09:44Z

Context: I got side-tracked and cleaned this up because I was wondering how to handle multiple Sec-Websocket-Extension headers.

🐰 🕳

cjerdonek · 2017-07-09T20:21:41Z

websockets/client.py

-        for name, value in headers:
-            self.request_headers[name] = value
+        self.request_headers = http.client.HTTPMessage()
+        self.request_headers._headers = headers     # HACK


I looked at CPython's code here, and HTTPMessage doesn't really add much, and it's only reluctantly part of the public API, if that.

What about the idea of defining a websockets.HTTPMessage(email.message.Message) class? At the least, this would let you DRY up the four occurrences of:

self.response_headers._headers = headers # HACK

and confine the hack to one spot. You could also add a custom property / method on the class to eliminate the need for separate self.raw_*_headers attributes on the WebSocketCommonProtocol class.

Also, it might be worth having a couple unit tests to check that setting _headers directly will continue to work in future Pythons -- giving you the same result as going through email.message.Message's (slightly slower) public API, etc.

These two frequently used functions could also go on the class.

set_header = lambda k, v: headers.append((k, v)) get_header = lambda k: headers.get(k, '')

Yes, I ended up using http.client.HTTPMessage a bit reluctantly and I'm not very happy with that hack.

The constraints here are:

not losing any information: I believe the the raw_*_headers lists of (name, value) pairs are a good way to achieve that; I'm happy with them (even though it's unlikely that they will be used widely)

providing a convenient API for accessing headers: we're immediately getting into the problem of multi-value dicts where 99,99% of accesses expect a single value; each library tends to invent its own API; I decided to use the closest matching data structure in the standard library to avoid adding one more variant to the Python ecosystem

minimizing performance overhead; the HACK is the only way I found to instantiate a HTTPMessage without running a lot of code; it don't expect the implementation to change, but the risk exists in theory; I believe tests cover this

not reinventing the wheel: not only does http.client.HTTPMessage feel semantically correct for representing HTTP headers, but it's backwards compatible because it inherits email.message.Message (the less semantically correct type previously used there); see also https://docs.python.org/3/library/http.client.html#http.client.HTTPResponse.msg

I considered using properties and rejected that idea because it would make the code more verbose, slightly less efficient, and make it less obvious where these values are set.

Conversely, defining get/set_header at the class level is certainly more efficient that redefining them dynamically with each execution.

Historically they couldn't be methods because headers weren't stored as attributes. Certainly a good time to reconsider that part of the design.

To start with, I think it could be as simple as a class with two attributes for "raw" and "not raw" (matching the two attributes currently stored on the Protocol classes), and you wouldn't lose anything. And then you could add whatever helper methods as needed, like the get / set header functions.

Later on (and only if you wanted), you could get more sophisticated with how you store and mutate the raw header info internally to the class.

cjerdonek · 2017-07-09T20:24:34Z

websockets/http.py

@@ -34,20 +53,38 @@ def read_request(stream):
    ``stream`` is an :class:`~asyncio.StreamReader`.

    Return ``(path, headers)`` where ``path`` is a :class:`str` and
-    ``headers`` is a :class:`~email.message.Message`. ``path`` isn't
-    URL-decoded.
+    ``headers`` is a list of ``(name, value)`` tuples.


With a custom class, you can probably go back to returning an object in these (three?) spots.

I could, but I like using basic Python types in low level APIs when possible. A list of pairs is pretty manageable.

aaugustin · 2017-07-10T21:35:46Z

Thanks for the review, I'll make changes next week-end.

cjerdonek · 2017-07-15T06:57:38Z

Serious question: since the library would now be doing its own parsing, does it even need to construct and set an HTTPMessage object?

It looks like the only place the class is actually being used (aside from test code) is to call its get() method in the following two spots.

For the server, here in its handshake() method:

get_header = lambda k: headers.get(k, '')

And in the client, here also in its handshake() method:

get_header = lambda k: headers.get(k, '')

aaugustin · 2017-07-17T08:20:12Z

Headers represented as a MIME message are currently part of the public API, but I'm open to changing that, see #210.

aaugustin · 2017-07-17T10:51:01Z

I added a commit that wraps the # HACK into a helper function.

I looked at refactoring get/set_header as methods, but I like that the path, request_headers and raw_request_headers attributes are set on instances atomically when sending out the HTTP request, so I didn't make that change for now.

Since this part is hacky and likely to change in the future (#210), wrap it into a single function and add tests for the public API we really care about.

cjerdonek · 2017-07-18T21:50:28Z

Thanks for making a helper function and for opening issue #210. I appreciate it.

aaugustin force-pushed the http branch 4 times, most recently from b6a1d71 to 24c5317 Compare July 9, 2017 19:01

aaugustin force-pushed the http branch from 24c5317 to dcd9b1a Compare July 9, 2017 19:04

cjerdonek reviewed Jul 9, 2017

View reviewed changes

aaugustin mentioned this pull request Jul 17, 2017

Consider a better structure for HTTP headers #210

Closed

Encapsulate creation of HTTP headers.

a1a6aef

Since this part is hacky and likely to change in the future (#210), wrap it into a single function and add tests for the public API we really care about.

aaugustin force-pushed the http branch from b57a410 to a1a6aef Compare July 17, 2017 12:50

aaugustin merged commit 00efcc6 into master Jul 17, 2017

aaugustin deleted the http branch July 17, 2017 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace MIME parsing with custom HTTP parsing. #200

Replace MIME parsing with custom HTTP parsing. #200

aaugustin commented Jul 9, 2017 •

edited

Loading

aaugustin commented Jul 9, 2017

aaugustin commented Jul 9, 2017

cjerdonek Jul 9, 2017

cjerdonek Jul 9, 2017

aaugustin Jul 10, 2017

aaugustin Jul 10, 2017

cjerdonek Jul 10, 2017

cjerdonek Jul 9, 2017

aaugustin Jul 10, 2017

aaugustin commented Jul 10, 2017

cjerdonek commented Jul 15, 2017

aaugustin commented Jul 17, 2017

aaugustin commented Jul 17, 2017

cjerdonek commented Jul 18, 2017

Replace MIME parsing with custom HTTP parsing. #200

Replace MIME parsing with custom HTTP parsing. #200

Conversation

aaugustin commented Jul 9, 2017 • edited Loading

aaugustin commented Jul 9, 2017

aaugustin commented Jul 9, 2017

cjerdonek Jul 9, 2017

Choose a reason for hiding this comment

cjerdonek Jul 9, 2017

Choose a reason for hiding this comment

aaugustin Jul 10, 2017

Choose a reason for hiding this comment

aaugustin Jul 10, 2017

Choose a reason for hiding this comment

cjerdonek Jul 10, 2017

Choose a reason for hiding this comment

cjerdonek Jul 9, 2017

Choose a reason for hiding this comment

aaugustin Jul 10, 2017

Choose a reason for hiding this comment

aaugustin commented Jul 10, 2017

cjerdonek commented Jul 15, 2017

aaugustin commented Jul 17, 2017

aaugustin commented Jul 17, 2017

cjerdonek commented Jul 18, 2017

aaugustin commented Jul 9, 2017 •

edited

Loading