Attempt to reencode malformed headers from Latin-1 to UTF8 #830

quinnj · 2022-05-25T13:46:20Z

As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.

Note that curl warns on the malformed header and filters it out.

@StefanKarpinski

As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out.

src/Parsers.jl

quinnj · 2022-05-25T20:36:31Z

@StefanKarpinski, thanks for the suggestions. I incorporated them locally since I forgot that we need to do the isvalid check before we call the regexes, since that was the original issue we were running into (i.e. calling the regex on latin1 data). The strategy here then is:

Before parsing the first header, check if all headers are isvalid
If not, assume latin1 and re-encode
If still not isvalid, error out (as we do now on latin1); otherwise, parsing continues successfully

We could maybe try to match some client behavior here and just "skip" the header w/ latin1, parsing other headers, but it seems tricky to try and get exactly right because we're dealing w/ potentially corrupt data, but still trying to find the \r\n to move on to the next header.

quinnj · 2022-05-25T20:37:13Z

Also, FYI, I found this as a decent discussion

codecov-commenter · 2022-05-25T20:57:12Z

Codecov Report

Merging #830 (86dbb19) into master (54a6c13) will increase coverage by 0.07%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master     #830      +/-   ##
==========================================
+ Coverage   78.48%   78.56%   +0.07%     
==========================================
  Files          36       36              
  Lines        2524     2538      +14     
==========================================
+ Hits         1981     1994      +13     
- Misses        543      544       +1

Impacted Files	Coverage Δ
src/Parsers.jl	`97.50% <92.85%> (-0.62%)`	⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

StefanKarpinski · 2022-05-26T15:18:39Z

src/Parsers.jl

+            end
+        end
+        bytes = SubString(String(buf))
+        !isvalid(bytes) && @goto error


There's no need to check this—the resulting string is guaranteed to be valid by construction.

Good catch.

fonsp · 2022-06-08T15:53:11Z

src/Parsers.jl

+    if !isvalid(bytes)
+        @warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8"
+        rawbytes = codeunits(bytes)
+        buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes))


Should we add a fallback for when Base.StringVector does not exist?

We're requiring Julia 1.6 in HTTP.jl now, so I think we should be covered?

It's just that this is not public Julia API (it has no docstring). But I guess that if Stefan recommends it then it should be fine?

It's extremely unlikely to be removed and should probably be made public. @JeffBezanson, do you think there's any real risk that Base.StringVector will be removed?

fonsp · 2022-06-08T15:53:11Z

src/Parsers.jl

+    if !isvalid(bytes)
+        @warn "malformed HTTP header detected; attempting to re-encode from Latin-1 to UTF8"
+        rawbytes = codeunits(bytes)
+        buf = Base.StringVector(length(rawbytes) + count(≥(0x80), rawbytes))


Should we add a fallback for when Base.StringVector does not exist?

I don't think supporting Julia 0.5 is that important at this point.

quinnj marked this pull request as ready for review May 25, 2022 13:46

quinnj mentioned this pull request May 25, 2022

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

Closed

StefanKarpinski reviewed May 25, 2022

View reviewed changes

src/Parsers.jl Outdated Show resolved Hide resolved

StefanKarpinski reviewed May 25, 2022

View reviewed changes

src/Parsers.jl Outdated Show resolved Hide resolved

StefanKarpinski reviewed May 25, 2022

View reviewed changes

src/Parsers.jl Outdated Show resolved Hide resolved

quinnj added 3 commits May 25, 2022 14:25

cleanup implementation and ensure tests pass

c250981

fix

bb74537

one more cleanup

86dbb19

quinnj merged commit 832185f into master May 25, 2022

quinnj deleted the jq/796 branch May 25, 2022 21:21

StefanKarpinski reviewed May 26, 2022

View reviewed changes

fonsp reviewed Jun 8, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

quinnj commented May 25, 2022

quinnj commented May 25, 2022

quinnj commented May 25, 2022

codecov-commenter commented May 25, 2022 •

edited

Loading

StefanKarpinski May 26, 2022

quinnj May 28, 2022

fonsp Jun 8, 2022

quinnj Jun 8, 2022

fonsp Jun 9, 2022

StefanKarpinski Jun 9, 2022

fonsp Jun 8, 2022

StefanKarpinski Jun 9, 2022

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

Conversation

quinnj commented May 25, 2022

quinnj commented May 25, 2022

quinnj commented May 25, 2022

codecov-commenter commented May 25, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 25, 2022 •

edited

Loading