Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace MIME parsing with custom HTTP parsing. #200

Merged
merged 2 commits into from
Jul 17, 2017
Merged

Replace MIME parsing with custom HTTP parsing. #200

merged 2 commits into from
Jul 17, 2017

Conversation

aaugustin
Copy link
Member

@aaugustin aaugustin commented Jul 9, 2017

Given that websockets makes straightforward use of HTTP, that websocket
implementations can be expected not to exhibit legacy behaviors, and
that RFC 7230 deprecates this behavior, parsing HTTP is doable.

Thanks https://github.com/njsmith/h11 for providing some inspiration,
especially for translating the RFC to regular expressions and figuring
out some edge cases.

I expect the new implementation to be faster, since it has a much
tighter focus than the stdlib's general purpose MIME parser, and
possibly more secure, since it was written from the beginning with
security as a primary goal (with the caveat that it's new code,
which means it's more likely to have security issues).

Fix #19.

@aaugustin aaugustin force-pushed the http branch 4 times, most recently from b6a1d71 to 24c5317 Compare July 9, 2017 19:01
Given that websockets makes straightforward use of HTTP, that websocket
implementations can be expected not to exhibit legacy behaviors, and
that RFC 7230 deprecates this behavior, parsing HTTP is doable.

Thanks https://github.com/njsmith/h11 for providing some inspiration,
especially for translating the RFC to regular expressions and figuring
out some edge cases.

I expect the new implementation to be faster, since it has a much
tighter focus than the stdlib's general purpose MIME parser, and
possibly more secure, since it was written from the beginning with
security as a primary goal (with the caveat that it's new code,
which means it's more likely to have security issues).

Fix #19.
@aaugustin
Copy link
Member Author

@cjerdonek Any thoughts on this?

@aaugustin
Copy link
Member Author

Context: I got side-tracked and cleaned this up because I was wondering how to handle multiple Sec-Websocket-Extension headers.

🐰 🕳

for name, value in headers:
self.request_headers[name] = value
self.request_headers = http.client.HTTPMessage()
self.request_headers._headers = headers # HACK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at CPython's code here, and HTTPMessage doesn't really add much, and it's only reluctantly part of the public API, if that.

What about the idea of defining a websockets.HTTPMessage(email.message.Message) class? At the least, this would let you DRY up the four occurrences of:

self.response_headers._headers = headers    # HACK

and confine the hack to one spot. You could also add a custom property / method on the class to eliminate the need for separate self.raw_*_headers attributes on the WebSocketCommonProtocol class.

Also, it might be worth having a couple unit tests to check that setting _headers directly will continue to work in future Pythons -- giving you the same result as going through email.message.Message's (slightly slower) public API, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two frequently used functions could also go on the class.

set_header = lambda k, v: headers.append((k, v))
get_header = lambda k: headers.get(k, '')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ended up using http.client.HTTPMessage a bit reluctantly and I'm not very happy with that hack.

The constraints here are:

  • not losing any information: I believe the the raw_*_headers lists of (name, value) pairs are a good way to achieve that; I'm happy with them (even though it's unlikely that they will be used widely)
  • providing a convenient API for accessing headers: we're immediately getting into the problem of multi-value dicts where 99,99% of accesses expect a single value; each library tends to invent its own API; I decided to use the closest matching data structure in the standard library to avoid adding one more variant to the Python ecosystem
  • minimizing performance overhead; the HACK is the only way I found to instantiate a HTTPMessage without running a lot of code; it don't expect the implementation to change, but the risk exists in theory; I believe tests cover this
  • not reinventing the wheel: not only does http.client.HTTPMessage feel semantically correct for representing HTTP headers, but it's backwards compatible because it inherits email.message.Message (the less semantically correct type previously used there); see also https://docs.python.org/3/library/http.client.html#http.client.HTTPResponse.msg

I considered using properties and rejected that idea because it would make the code more verbose, slightly less efficient, and make it less obvious where these values are set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, defining get/set_header at the class level is certainly more efficient that redefining them dynamically with each execution.

Historically they couldn't be methods because headers weren't stored as attributes. Certainly a good time to reconsider that part of the design.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start with, I think it could be as simple as a class with two attributes for "raw" and "not raw" (matching the two attributes currently stored on the Protocol classes), and you wouldn't lose anything. And then you could add whatever helper methods as needed, like the get / set header functions.

Later on (and only if you wanted), you could get more sophisticated with how you store and mutate the raw header info internally to the class.

@@ -34,20 +53,38 @@ def read_request(stream):
``stream`` is an :class:`~asyncio.StreamReader`.

Return ``(path, headers)`` where ``path`` is a :class:`str` and
``headers`` is a :class:`~email.message.Message`. ``path`` isn't
URL-decoded.
``headers`` is a list of ``(name, value)`` tuples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a custom class, you can probably go back to returning an object in these (three?) spots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but I like using basic Python types in low level APIs when possible. A list of pairs is pretty manageable.

@aaugustin
Copy link
Member Author

Thanks for the review, I'll make changes next week-end.

@cjerdonek
Copy link
Collaborator

Serious question: since the library would now be doing its own parsing, does it even need to construct and set an HTTPMessage object?

It looks like the only place the class is actually being used (aside from test code) is to call its get() method in the following two spots.

For the server, here in its handshake() method:

get_header = lambda k: headers.get(k, '')

And in the client, here also in its handshake() method:

get_header = lambda k: headers.get(k, '')

@aaugustin
Copy link
Member Author

Headers represented as a MIME message are currently part of the public API, but I'm open to changing that, see #210.

@aaugustin
Copy link
Member Author

I added a commit that wraps the # HACK into a helper function.

I looked at refactoring get/set_header as methods, but I like that the path, request_headers and raw_request_headers attributes are set on instances atomically when sending out the HTTP request, so I didn't make that change for now.

Since this part is hacky and likely to change in the future (#210),
wrap it into a single function and add tests for the public API we
really care about.
@aaugustin aaugustin merged commit 00efcc6 into master Jul 17, 2017
@aaugustin aaugustin deleted the http branch July 17, 2017 12:54
@cjerdonek
Copy link
Collaborator

Thanks for making a helper function and for opening issue #210. I appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants