-
-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace MIME parsing with custom HTTP parsing. #200
Conversation
b6a1d71
to
24c5317
Compare
Given that websockets makes straightforward use of HTTP, that websocket implementations can be expected not to exhibit legacy behaviors, and that RFC 7230 deprecates this behavior, parsing HTTP is doable. Thanks https://github.com/njsmith/h11 for providing some inspiration, especially for translating the RFC to regular expressions and figuring out some edge cases. I expect the new implementation to be faster, since it has a much tighter focus than the stdlib's general purpose MIME parser, and possibly more secure, since it was written from the beginning with security as a primary goal (with the caveat that it's new code, which means it's more likely to have security issues). Fix #19.
@cjerdonek Any thoughts on this? |
Context: I got side-tracked and cleaned this up because I was wondering how to handle multiple 🐰 🕳 |
websockets/client.py
Outdated
for name, value in headers: | ||
self.request_headers[name] = value | ||
self.request_headers = http.client.HTTPMessage() | ||
self.request_headers._headers = headers # HACK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at CPython's code here, and HTTPMessage doesn't really add much, and it's only reluctantly part of the public API, if that.
What about the idea of defining a websockets.HTTPMessage(email.message.Message)
class? At the least, this would let you DRY up the four occurrences of:
self.response_headers._headers = headers # HACK
and confine the hack to one spot. You could also add a custom property / method on the class to eliminate the need for separate self.raw_*_headers
attributes on the WebSocketCommonProtocol
class.
Also, it might be worth having a couple unit tests to check that setting _headers
directly will continue to work in future Pythons -- giving you the same result as going through email.message.Message
's (slightly slower) public API, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two frequently used functions could also go on the class.
set_header = lambda k, v: headers.append((k, v))
get_header = lambda k: headers.get(k, '')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I ended up using http.client.HTTPMessage
a bit reluctantly and I'm not very happy with that hack.
The constraints here are:
- not losing any information: I believe the the
raw_*_headers
lists of(name, value)
pairs are a good way to achieve that; I'm happy with them (even though it's unlikely that they will be used widely) - providing a convenient API for accessing headers: we're immediately getting into the problem of multi-value dicts where 99,99% of accesses expect a single value; each library tends to invent its own API; I decided to use the closest matching data structure in the standard library to avoid adding one more variant to the Python ecosystem
- minimizing performance overhead; the HACK is the only way I found to instantiate a
HTTPMessage
without running a lot of code; it don't expect the implementation to change, but the risk exists in theory; I believe tests cover this - not reinventing the wheel: not only does
http.client.HTTPMessage
feel semantically correct for representing HTTP headers, but it's backwards compatible because it inheritsemail.message.Message
(the less semantically correct type previously used there); see also https://docs.python.org/3/library/http.client.html#http.client.HTTPResponse.msg
I considered using properties and rejected that idea because it would make the code more verbose, slightly less efficient, and make it less obvious where these values are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, defining get/set_header at the class level is certainly more efficient that redefining them dynamically with each execution.
Historically they couldn't be methods because headers weren't stored as attributes. Certainly a good time to reconsider that part of the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To start with, I think it could be as simple as a class with two attributes for "raw" and "not raw" (matching the two attributes currently stored on the Protocol classes), and you wouldn't lose anything. And then you could add whatever helper methods as needed, like the get / set header functions.
Later on (and only if you wanted), you could get more sophisticated with how you store and mutate the raw header info internally to the class.
@@ -34,20 +53,38 @@ def read_request(stream): | |||
``stream`` is an :class:`~asyncio.StreamReader`. | |||
|
|||
Return ``(path, headers)`` where ``path`` is a :class:`str` and | |||
``headers`` is a :class:`~email.message.Message`. ``path`` isn't | |||
URL-decoded. | |||
``headers`` is a list of ``(name, value)`` tuples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a custom class, you can probably go back to returning an object in these (three?) spots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but I like using basic Python types in low level APIs when possible. A list of pairs is pretty manageable.
Thanks for the review, I'll make changes next week-end. |
Serious question: since the library would now be doing its own parsing, does it even need to construct and set an It looks like the only place the class is actually being used (aside from test code) is to call its For the server, here in its
And in the client, here also in its
|
Headers represented as a MIME message are currently part of the public API, but I'm open to changing that, see #210. |
I added a commit that wraps the I looked at refactoring |
Since this part is hacky and likely to change in the future (#210), wrap it into a single function and add tests for the public API we really care about.
Thanks for making a helper function and for opening issue #210. I appreciate it. |
Given that websockets makes straightforward use of HTTP, that websocket
implementations can be expected not to exhibit legacy behaviors, and
that RFC 7230 deprecates this behavior, parsing HTTP is doable.
Thanks https://github.com/njsmith/h11 for providing some inspiration,
especially for translating the RFC to regular expressions and figuring
out some edge cases.
I expect the new implementation to be faster, since it has a much
tighter focus than the stdlib's general purpose MIME parser, and
possibly more secure, since it was written from the beginning with
security as a primary goal (with the caveat that it's new code,
which means it's more likely to have security issues).
Fix #19.