-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize improperly encoded form inputs #1954
Conversation
1cc5ea4
to
b130442
Compare
Thanks for working through all this! This PR looks good, but I'm a little worried that we'll be playing wack-a-mole with this family of issues unless we have a more general validation middleware. What do you think? In the other issue you alluded to the example gems/gist not actually working for most scenarios. |
I wonder if we can look at the request itself and sanitize there rather than on individual parameters. If we only accept a limited number of formats, this might work. Perhaps something like Another option is to detect poor encoding and return a 400 Bad Request. |
The problem with utf8-cleaner is that it appears to silently discard unexpected bytes rather than returning a 400 Bad Request, which seems like it would mask real issues. https://github.com/18F/identity-private/issues/2560#issuecomment-361394099 |
Looking at the exception that's getting raised here. |
Agree that silently passing along content that had bad bytes is not great. A 400 would be a lot better. I would shy away from a 500 because 5xx implies the problem requires action on the server side to correct. A 4xx requires action by the client, which is proper if the client is sending bad encodings. |
@jmhooper Unless I'm misunderstanding something, I don't think |
So perhaps something modeled after utf8-cleaner that raises |
68b42d2
to
7aea80b
Compare
@brodygov @jgsmith-usds Please check out the latest changes. I implemented a simple middleware. |
end | ||
|
||
def invalid_string?(string) | ||
!string.force_encoding('UTF-8').valid_encoding? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is that someone might use a browser that sends something other than UTF-8. Assuming that Rails converts non-UTF-8 to UTF-8 before handing it to the application, then this isn't a problem, and we only have to worry about Rails taking in content that is advertised as UTF-8 by the client. I'm still looking through Rack/Rails stuff to find a path that shows it looking at the input encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely some risk that users will send non-UTF8 data. I don't think rails does anything to parse the charset out of the request Content-Type header.
But returning an HTTP 400 is at least a little better than the current behavior of exploding and returning HTTP 500. I'm more concerned about future use cases where we want to accept file uploads or other fully binary data. Perhaps we could add some configuration option to the middleware so that if we want to whitelist certain parameters as non-UTF-8 we can do so?
Example realistic use case:
Anyone named Boris is going to be unhappy when Борис
encoded in Windows-1251 triggers the invalid UTF-8 errors. (Б
is 0xC1 in Windows-1251, which is an invalid UTF-8 byte.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
boris = "Борис".encode(Encoding::Windows_1251)
=> "\xC1\xEE\xF0\xE8\xF1"
boris.encoding
=> #<Encoding:Windows-1251>
boris.force_encoding('UTF-8')
=> "\xC1\xEE\xF0\xE8\xF1" # byte sequence hasn't changed - just the encoding
boris.force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>
boris.force_encoding('UTF-8').valid_encoding?
=> false
boris.encode('UTF-8')
=> "Борис"
boris.encode('UTF-8').valid_encoding?
=> true
This assumes that Rails does tag the parameter strings with the charset that the browser specifies. If that isn't the case, then we have a bit more lifting to do.
The best might be to convert all incoming string parameters (except those we want as straight-up binary) to UTF-8 and raise a 400 if the conversion raises any errors.
This is as close as I got to finding something in Rails that might be able to do some of this: https://github.com/rails/rails/blob/b5db73076914e7103466bd76bec785cbcfe88875/actionpack/lib/action_controller/metal/parameter_encoding.rb . It points to a solution that might be a bit overkill for us for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these tests pass, so it looks like Rails handles this well:
it 'does not return 400 when email has Cyrillic characters' do
params = { user: { email: 'Борис@test.ru' } }
post sign_up_register_path, params: params
expect(response.status).to eq 302
end
it 'does not return 400 when email is in Windows-1251 encoding' do
params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1251') } }
post sign_up_register_path, params: params
expect(response.status).to eq 302
end
it 'does not return 400 when email is in Windows-1252 encoding' do
params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1252') } }
post sign_up_register_path, params: params
expect(response.status).to eq 302
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests as written aren't testing actual Windows-1251 text. It should be 'Борис'.encode('Windows-1251')
, not force_encoding()
.
>> 'Борис'.force_encoding('windows-1251').bytes
=> [208, 145, 208, 190, 209, 128, 208, 184, 209, 129]
>> 'Борис'.encode('windows-1251').bytes
=> [193, 238, 240, 232, 241]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, as Jim notes, I don't think Rails tags parameters with the charset specified by the browser, and you need to know the source encoding in order to convert to UTF-8. In a few quick attempts at testing, the app still assumed everything was UTF-8 even when submitted with a charset specifying Windows-1251.
>> binary_boris = "Борис".encode(Encoding::Windows_1251).force_encoding(Encoding::BINARY)
=> "\xC1\xEE\xF0\xE8\xF1"
>> binary_boris.encoding
=> #<Encoding:ASCII-8BIT>
>> binary_boris.encode('utf-8')
Encoding::UndefinedConversionError: "\xC1" from ASCII-8BIT to UTF-8
I tested this by submitting a password reset using an email address of "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]"
, which threw an HTTP 500 as expected from the code on master.
Also I feel like Binary Boris is the name of some futurist robot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's an accurate test. If you submit "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]"
, the best guess for that String's encoding will be windows-1251
(using the rchardet gem for example: https://github.com/jmhodges/rchardet). You can then convert it to UTF-8 without any errors:
windows_andy = "andrew.brody+\xC1\xEE\xF0\xE8\xF1@gsa.gov".encode(Encoding::Windows_1251)
windows_andy.encode('utf-8', 'windows-1251')
=> "andrew.brody+Борис@gsa.gov"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's anything wrong with the test. We should never guess the content encoding based on the content, because that can create all kinds of extra problems.
RFC 7231 and others discourage content sniffing. https://tools.ietf.org/html/rfc7231#section-3.1.1.5
The reason our app's behavior is problematic is that the request contained a Content-Type: ...; charset=Windows-1251
header to indicate that the content was encoded with Windows-1251, and the idp ignored this header and treated it as UTF-8 anyway.
Let's jump back a bit. The reason we started this PR was to prevent the app from raising 500 errors when invalid UTF-8 was being submitted. So far, every single instance of these 500 errors has been the result of scans on non-authenticated endpoints: sign in form, register email form, and password reset form. None of these errors have been caused by legitimate users. The exception was raised due to the app calling one of these 3 methods on a String with invalid UTF-8 bytes: I don't think any of our LOA1 forms have any inputs that can be anything but valid UTF-8, so it should be safe enforcing that. For LOA3, I suppose first name, last name and address could allow other encodings, but I don't know for sure. It would depend on what our vendors accept. I say for now let's go with the solution that prevents 500 errors as we've seen them, and then deal with others on a case-by-case basis. By validating the encoding at the application layer, it allows us to display friendly error messages to the user, as opposed to a generic 400 error page. |
I don't think we have to accept encodings from the user based on what our vendors accept if we mediate between them. We simply transcode between what the user provides and what the vendor expects. The reason for encouraging the middleware was to prevent the wack-a-mole situation. If the user is sending data from a browser, then the browser should be setting a character encoding in the request and we should be honoring that encoding. If we don't want to do a general solution in this PR, then I'd like to create an issue to track the general case so we don't forget it. It'll become more critical with the LOA3 information we gather. |
Can someone please explain why non-UTF-8 characters are a realistic use case? Do any of the documents/services we currently check or are planning to check for ID verification allow characters that are not from an extended Latin-based alphabet? My name in Arabic is منصف, which is valid UTF-8, but I'm not going to type that when trying to prove my identity on login.gov. I'm going to type Rack does indeed read the charset from the request = Rack::Request.new(env)
charset = request.content_charset My testing with Chrome shows that it does not include the charset, but maybe that's because browsers only send the charset if it's not UTF-8? A quick search didn't yield any ways to simulate different charsets from Chrome. However, transcoding based on the client's charset is not foolproof. For example, someone could easily send a
Remember that the primary reason we are doing this work is to prevent non-legitimate requests from generating 500 errors and paging team members unnecessarily. To date, those are the only types of requests that have been generating such errors at high frequencies. I suppose we could rescue the conversion error and raise the 400 in that case. For the sake of time, I propose that we leave this PR as is, since currently, we do not have any inputs that would accept anything other than UTF-8. Once we know for sure that we will want to accept other encodings for LOA3, then we can revisit the middleware. Thoughts? |
Right, the priority is definitely just to stop the HTTP 500 errors. But we should be aware of the consequences we are creating for legitimate requests that will be rejected by this. I'm especially worried about being able to accept fully binary data, which is why I suggested adding something like a whitelist of binary-safe parameters to the middleware to make it future-proof. Agreed that we don't appear to have appreciable traffic sending content with invalid UTF-8 bytes today. We definitely shouldn't guess the encoding based on content, but there may be clients that send well-formed content encoded in charsets other than UTF-8 and correctly send us a Content-Type header saying it's not UTF-8. In those cases we currently treat the content as UTF-8 anyway, even though we have all the information needed to correctly process it. While most computers in the US have UTF-8 or Windows-1252 set as their default system encodings, it's common for computers especially in Asia to use other encodings like Shift JIS as the default encoding. It's hard to get global statistics for this, but it seems like as much as 10% of websites serve content in other encodings. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg A few examples of legitimate requests that will be rejected:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working through this!
**Why**: To reduce 500 errors.
7aea80b
to
7c00d71
Compare
Why: To reduce 500 errors.
Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:
For DB changes, check for missing indexes, check to see if the changes
affect other apps (such as the dashboard), make sure the DB columns in the
various environments are properly populated, coordinate with devops, plan
migrations in separate steps.
For route changes, make sure GET requests don't change state or result in
destructive behavior. GET requests should only result in information being
read, not written.
For encryption changes, make sure it is compatible with data that was
encrypted with the old code.
For secrets changes, make sure to update the S3 secrets bucket with the
new configs in all environments.
Do not disable Rubocop or Reek offenses unless you are absolutely sure
they are false positives. If you're not sure how to fix the offense, please
ask a teammate.
When reading data, write tests for nil values, empty strings,
and invalid formats.
When calling
redirect_to
in a controller, use_url
, not_path
.When adding user data to the session, use the
user_session
helperinstead of the
session
helper so the data does not persist beyond the user'ssession.
When adding a new controller that requires the user to be fully
authenticated, make sure to add
before_action :confirm_two_factor_authenticated
.