Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

Merged
merged 4 commits into from
May 25, 2022
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented May 25, 2022

As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.

Note that curl warns on the malformed header and filters it out.

As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.

Note that curl warns on the malformed header and filters it out.
src/Parsers.jl Outdated Show resolved Hide resolved
src/Parsers.jl Outdated Show resolved Hide resolved
src/Parsers.jl Outdated Show resolved Hide resolved
@quinnj
Copy link
Member Author

quinnj commented May 25, 2022

@StefanKarpinski, thanks for the suggestions. I incorporated them locally since I forgot that we need to do the isvalid check before we call the regexes, since that was the original issue we were running into (i.e. calling the regex on latin1 data). The strategy here then is:

  • Before parsing the first header, check if all headers are isvalid
  • If not, assume latin1 and re-encode
  • If still not isvalid, error out (as we do now on latin1); otherwise, parsing continues successfully

We could maybe try to match some client behavior here and just "skip" the header w/ latin1, parsing other headers, but it seems tricky to try and get exactly right because we're dealing w/ potentially corrupt data, but still trying to find the \r\n to move on to the next header.

@quinnj
Copy link
Member Author

quinnj commented May 25, 2022

Also, FYI, I found this as a decent discussion

@codecov-commenter
Copy link

codecov-commenter commented May 25, 2022

Codecov Report

Merging #830 (86dbb19) into master (54a6c13) will increase coverage by 0.07%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master     #830      +/-   ##
==========================================
+ Coverage   78.48%   78.56%   +0.07%     
==========================================
  Files          36       36              
  Lines        2524     2538      +14     
==========================================
+ Hits         1981     1994      +13     
- Misses        543      544       +1     
Impacted Files Coverage Δ
src/Parsers.jl 97.50% <92.85%> (-0.62%) ⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@quinnj quinnj merged commit 832185f into master May 25, 2022
@quinnj quinnj deleted the jq/796 branch May 25, 2022 21:21
end
end
bytes = SubString(String(buf))
!isvalid(bytes) && @goto error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to check this—the resulting string is guaranteed to be valid by construction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

if !isvalid(bytes)
@warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8"
rawbytes = codeunits(bytes)
buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a fallback for when Base.StringVector does not exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're requiring Julia 1.6 in HTTP.jl now, so I think we should be covered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that this is not public Julia API (it has no docstring). But I guess that if Stefan recommends it then it should be fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's extremely unlikely to be removed and should probably be made public. @JeffBezanson, do you think there's any real risk that Base.StringVector will be removed?

if !isvalid(bytes)
@warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8"
rawbytes = codeunits(bytes)
buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a fallback for when Base.StringVector does not exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think supporting Julia 0.5 is that important at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants