-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to reencode malformed headers from Latin-1 to UTF8 #830
Conversation
As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out.
@StefanKarpinski, thanks for the suggestions. I incorporated them locally since I forgot that we need to do the
We could maybe try to match some client behavior here and just "skip" the header w/ latin1, parsing other headers, but it seems tricky to try and get exactly right because we're dealing w/ potentially corrupt data, but still trying to find the |
Also, FYI, I found this as a decent discussion |
Codecov Report
@@ Coverage Diff @@
## master #830 +/- ##
==========================================
+ Coverage 78.48% 78.56% +0.07%
==========================================
Files 36 36
Lines 2524 2538 +14
==========================================
+ Hits 1981 1994 +13
- Misses 543 544 +1
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
end | ||
end | ||
bytes = SubString(String(buf)) | ||
!isvalid(bytes) && @goto error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to check this—the resulting string is guaranteed to be valid by construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
if !isvalid(bytes) | ||
@warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8" | ||
rawbytes = codeunits(bytes) | ||
buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a fallback for when Base.StringVector
does not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're requiring Julia 1.6 in HTTP.jl now, so I think we should be covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just that this is not public Julia API (it has no docstring). But I guess that if Stefan recommends it then it should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's extremely unlikely to be removed and should probably be made public. @JeffBezanson, do you think there's any real risk that Base.StringVector
will be removed?
if !isvalid(bytes) | ||
@warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8" | ||
rawbytes = codeunits(bytes) | ||
buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a fallback for when Base.StringVector
does not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think supporting Julia 0.5 is that important at this point.
As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.
Note that curl warns on the malformed header and filters it out.