Sanitize improperly encoded form inputs #1954

monfresh · 2018-01-30T21:42:20Z

Why: To reduce 500 errors.

Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:

For DB changes, check for missing indexes, check to see if the changes
affect other apps (such as the dashboard), make sure the DB columns in the
various environments are properly populated, coordinate with devops, plan
migrations in separate steps.
For route changes, make sure GET requests don't change state or result in
destructive behavior. GET requests should only result in information being
read, not written.
For encryption changes, make sure it is compatible with data that was
encrypted with the old code.
For secrets changes, make sure to update the S3 secrets bucket with the
new configs in all environments.
Do not disable Rubocop or Reek offenses unless you are absolutely sure
they are false positives. If you're not sure how to fix the offense, please
ask a teammate.
When reading data, write tests for nil values, empty strings,
and invalid formats.
When calling redirect_to in a controller, use _url, not _path.
When adding user data to the session, use the user_session helper
instead of the session helper so the data does not persist beyond the user's
session.
When adding a new controller that requires the user to be fully
authenticated, make sure to add before_action :confirm_two_factor_authenticated.

brodygov · 2018-01-31T19:09:29Z

Thanks for working through all this! This PR looks good, but I'm a little worried that we'll be playing wack-a-mole with this family of issues unless we have a more general validation middleware. What do you think? In the other issue you alluded to the example gems/gist not actually working for most scenarios.

jgsmith-usds · 2018-01-31T19:23:47Z

I wonder if we can look at the request itself and sanitize there rather than on individual parameters. If we only accept a limited number of formats, this might work. Perhaps something like utf8-cleaner? It seems to handle json and form urlencoding.

Another option is to detect poor encoding and return a 400 Bad Request.

brodygov · 2018-01-31T20:41:19Z

The problem with utf8-cleaner is that it appears to silently discard unexpected bytes rather than returning a 400 Bad Request, which seems like it would mask real issues. https://github.com/18F/identity-private/issues/2560#issuecomment-361394099

jmhooper · 2018-01-31T20:46:24Z

Looking at the exception that's getting raised here. ActionController::BadRequest. That seems to imply a 400 should be rendered, right? What is the case where that is raised and we should render a 500 instead of a 400? Perhaps we should just rescue that exception in the application controller and then render a 400 there?

jgsmith-usds · 2018-01-31T21:05:41Z

Agree that silently passing along content that had bad bytes is not great. A 400 would be a lot better. I would shy away from a 500 because 5xx implies the problem requires action on the server side to correct. A 4xx requires action by the client, which is proper if the client is sending bad encodings.

brodygov · 2018-01-31T21:19:51Z

@jmhooper Unless I'm misunderstanding something, I don't think ActionController::BadRequest is what was being raised here. BadRequest does indeed result in an HTTP 400. We have been paging ourselves on unhandled ArgumentError, ActiveRecord::StatementInvalid, and NoMethodError.

jgsmith-usds · 2018-01-31T21:41:58Z

So perhaps something modeled after utf8-cleaner that raises BadRequest if it finds a bad byte sequence?

monfresh · 2018-02-01T01:52:54Z

@brodygov @jgsmith-usds Please check out the latest changes. I implemented a simple middleware.

jgsmith-usds · 2018-02-01T15:13:21Z

lib/utf8_sanitizer.rb

+  end
+
+  def invalid_string?(string)
+    !string.force_encoding('UTF-8').valid_encoding?


My worry is that someone might use a browser that sends something other than UTF-8. Assuming that Rails converts non-UTF-8 to UTF-8 before handing it to the application, then this isn't a problem, and we only have to worry about Rails taking in content that is advertised as UTF-8 by the client. I'm still looking through Rack/Rails stuff to find a path that shows it looking at the input encoding.

There's definitely some risk that users will send non-UTF8 data. I don't think rails does anything to parse the charset out of the request Content-Type header.

But returning an HTTP 400 is at least a little better than the current behavior of exploding and returning HTTP 500. I'm more concerned about future use cases where we want to accept file uploads or other fully binary data. Perhaps we could add some configuration option to the middleware so that if we want to whitelist certain parameters as non-UTF-8 we can do so?

Example realistic use case:

Anyone named Boris is going to be unhappy when Борис encoded in Windows-1251 triggers the invalid UTF-8 errors. (Б is 0xC1 in Windows-1251, which is an invalid UTF-8 byte.)

So:

boris = "Борис".encode(Encoding::Windows_1251) => "\xC1\xEE\xF0\xE8\xF1" boris.encoding => #<Encoding:Windows-1251> boris.force_encoding('UTF-8') => "\xC1\xEE\xF0\xE8\xF1" # byte sequence hasn't changed - just the encoding boris.force_encoding('UTF-8').encoding => #<Encoding:UTF-8> boris.force_encoding('UTF-8').valid_encoding? => false boris.encode('UTF-8') => "Борис" boris.encode('UTF-8').valid_encoding? => true

This assumes that Rails does tag the parameter strings with the charset that the browser specifies. If that isn't the case, then we have a bit more lifting to do.

The best might be to convert all incoming string parameters (except those we want as straight-up binary) to UTF-8 and raise a 400 if the conversion raises any errors.

This is as close as I got to finding something in Rails that might be able to do some of this: https://github.com/rails/rails/blob/b5db73076914e7103466bd76bec785cbcfe88875/actionpack/lib/action_controller/metal/parameter_encoding.rb . It points to a solution that might be a bit overkill for us for now.

All of these tests pass, so it looks like Rails handles this well:

it 'does not return 400 when email has Cyrillic characters' do params = { user: { email: 'Борис@test.ru' } } post sign_up_register_path, params: params expect(response.status).to eq 302 end it 'does not return 400 when email is in Windows-1251 encoding' do params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1251') } } post sign_up_register_path, params: params expect(response.status).to eq 302 end it 'does not return 400 when email is in Windows-1252 encoding' do params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1252') } } post sign_up_register_path, params: params expect(response.status).to eq 302 end

Those tests as written aren't testing actual Windows-1251 text. It should be 'Борис'.encode('Windows-1251'), not force_encoding().

>> 'Борис'.force_encoding('windows-1251').bytes => [208, 145, 208, 190, 209, 128, 208, 184, 209, 129] >> 'Борис'.encode('windows-1251').bytes => [193, 238, 240, 232, 241]

Right, as Jim notes, I don't think Rails tags parameters with the charset specified by the browser, and you need to know the source encoding in order to convert to UTF-8. In a few quick attempts at testing, the app still assumed everything was UTF-8 even when submitted with a charset specifying Windows-1251.

>> binary_boris = "Борис".encode(Encoding::Windows_1251).force_encoding(Encoding::BINARY) => "\xC1\xEE\xF0\xE8\xF1" >> binary_boris.encoding => #<Encoding:ASCII-8BIT> >> binary_boris.encode('utf-8') Encoding::UndefinedConversionError: "\xC1" from ASCII-8BIT to UTF-8

I tested this by submitting a password reset using an email address of "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]", which threw an HTTP 500 as expected from the code on master.

Also I feel like Binary Boris is the name of some futurist robot.

I don't think that's an accurate test. If you submit "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]", the best guess for that String's encoding will be windows-1251 (using the rchardet gem for example: https://github.com/jmhodges/rchardet). You can then convert it to UTF-8 without any errors:

windows_andy = "andrew.brody+\xC1\xEE\xF0\xE8\xF1@gsa.gov".encode(Encoding::Windows_1251) windows_andy.encode('utf-8', 'windows-1251') => "andrew.brody+Борис@gsa.gov"

I don't think there's anything wrong with the test. We should never guess the content encoding based on the content, because that can create all kinds of extra problems.

RFC 7231 and others discourage content sniffing. https://tools.ietf.org/html/rfc7231#section-3.1.1.5

The reason our app's behavior is problematic is that the request contained a Content-Type: ...; charset=Windows-1251 header to indicate that the content was encoded with Windows-1251, and the idp ignored this header and treated it as UTF-8 anyway.

https://tools.ietf.org/html/rfc7231#section-3.1.1

monfresh · 2018-02-01T21:31:05Z

Let's jump back a bit. The reason we started this PR was to prevent the app from raising 500 errors when invalid UTF-8 was being submitted. So far, every single instance of these 500 errors has been the result of scans on non-authenticated endpoints: sign in form, register email form, and password reset form. None of these errors have been caused by legitimate users.

The exception was raised due to the app calling one of these 3 methods on a String with invalid UTF-8 bytes: .present?, .blank?, .downcase. In most cases, those calls are from our own code, so they are easy to fix, which is what I did in my original version of this PR. The other calls to .present? are made by Devise internals which would require overriding private methods. Instead, I opted to probe for the invalid bytes before Devise makes the call to current_user.

I don't think any of our LOA1 forms have any inputs that can be anything but valid UTF-8, so it should be safe enforcing that. For LOA3, I suppose first name, last name and address could allow other encodings, but I don't know for sure. It would depend on what our vendors accept.

I say for now let's go with the solution that prevents 500 errors as we've seen them, and then deal with others on a case-by-case basis. By validating the encoding at the application layer, it allows us to display friendly error messages to the user, as opposed to a generic 400 error page.

jgsmith-usds · 2018-02-01T21:44:53Z

I don't think we have to accept encodings from the user based on what our vendors accept if we mediate between them. We simply transcode between what the user provides and what the vendor expects.

The reason for encouraging the middleware was to prevent the wack-a-mole situation. If the user is sending data from a browser, then the browser should be setting a character encoding in the request and we should be honoring that encoding. If we don't want to do a general solution in this PR, then I'd like to create an issue to track the general case so we don't forget it. It'll become more critical with the LOA3 information we gather.

monfresh · 2018-02-02T03:41:04Z

Can someone please explain why non-UTF-8 characters are a realistic use case? Do any of the documents/services we currently check or are planning to check for ID verification allow characters that are not from an extended Latin-based alphabet?

My name in Arabic is منصف, which is valid UTF-8, but I'm not going to type that when trying to prove my identity on login.gov. I'm going to type Moncef.

Rack does indeed read the charset from the Content-Type header if it's present:

request = Rack::Request.new(env)
charset = request.content_charset

My testing with Chrome shows that it does not include the charset, but maybe that's because browsers only send the charset if it's not UTF-8? A quick search didn't yield any ways to simulate different charsets from Chrome.

However, transcoding based on the client's charset is not foolproof. For example, someone could easily send a cURL request with a charset that doesn't correspond to the data, and if we tried to transcode it, it would fail. For example:

irb(main):002:0> "مندف".encode("utf-8", "Windows-1252")
Encoding::UndefinedConversionError: "\x81" to UTF-8 in conversion from Windows-1252 to UTF-8
	from (irb):2:in `encode'
	from (irb):2

Remember that the primary reason we are doing this work is to prevent non-legitimate requests from generating 500 errors and paging team members unnecessarily. To date, those are the only types of requests that have been generating such errors at high frequencies.

I suppose we could rescue the conversion error and raise the 400 in that case.
We could also use the rchardet gem, which guesses the encoding and provides a confidence level between 0 and 1, which seems pretty cool but probably overkill at the moment.

For the sake of time, I propose that we leave this PR as is, since currently, we do not have any inputs that would accept anything other than UTF-8. Once we know for sure that we will want to accept other encodings for LOA3, then we can revisit the middleware.

Thoughts?

brodygov · 2018-02-02T20:20:01Z

Right, the priority is definitely just to stop the HTTP 500 errors. But we should be aware of the consequences we are creating for legitimate requests that will be rejected by this.

I'm especially worried about being able to accept fully binary data, which is why I suggested adding something like a whitelist of binary-safe parameters to the middleware to make it future-proof. Agreed that we don't appear to have appreciable traffic sending content with invalid UTF-8 bytes today.

We definitely shouldn't guess the encoding based on content, but there may be clients that send well-formed content encoded in charsets other than UTF-8 and correctly send us a Content-Type header saying it's not UTF-8. In those cases we currently treat the content as UTF-8 anyway, even though we have all the information needed to correctly process it.

While most computers in the US have UTF-8 or Windows-1252 set as their default system encodings, it's common for computers especially in Asia to use other encodings like Shift JIS as the default encoding. It's hard to get global statistics for this, but it seems like as much as 10% of websites serve content in other encodings. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

A few examples of legitimate requests that will be rejected:

Binary data (e.g. photo upload using Content-Transfer-Encoding: binary)
Борис encoded in Windows-1251 (contains \xC1)
Müller encoded in Windows-1252 (sometimes called ISO-8859-1, contains \xFC)
منصف encoded in Windows-1256 ("\xE3\xE4\xD5\xDD")
かいと encoded in EUC-JP ("\xA4\xAB\xA4\xA4\xA4\xC8")

brodygov

LGTM, thanks for working through this!

**Why**: To reduce 500 errors.

monfresh requested review from brodygov, jmhooper, mugizico and jgsmith-usds January 30, 2018 21:42

monfresh added the status - ready for review label Jan 30, 2018

monfresh requested a review from stevegsa January 30, 2018 21:48

monfresh force-pushed the mb-utf8-cleaning branch from 1cc5ea4 to b130442 Compare January 30, 2018 21:57

monfresh force-pushed the mb-utf8-cleaning branch 2 times, most recently from 68b42d2 to 7aea80b Compare February 1, 2018 01:51

jgsmith-usds reviewed Feb 1, 2018

View reviewed changes

brodygov approved these changes Feb 2, 2018

View reviewed changes

Sanitize improperly encoded form inputs

7c00d71

**Why**: To reduce 500 errors.

monfresh force-pushed the mb-utf8-cleaning branch from 7aea80b to 7c00d71 Compare February 5, 2018 18:20

monfresh merged commit 42592b4 into master Feb 5, 2018

monfresh deleted the mb-utf8-cleaning branch February 5, 2018 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize improperly encoded form inputs #1954

Sanitize improperly encoded form inputs #1954

monfresh commented Jan 30, 2018

brodygov commented Jan 31, 2018

jgsmith-usds commented Jan 31, 2018 •

edited

Loading

brodygov commented Jan 31, 2018

jmhooper commented Jan 31, 2018

jgsmith-usds commented Jan 31, 2018

brodygov commented Jan 31, 2018 •

edited

Loading

jgsmith-usds commented Jan 31, 2018

monfresh commented Feb 1, 2018

jgsmith-usds Feb 1, 2018

brodygov Feb 1, 2018

jgsmith-usds Feb 1, 2018 •

edited

Loading

monfresh Feb 1, 2018

brodygov Feb 1, 2018

brodygov Feb 1, 2018 •

edited

Loading

monfresh Feb 2, 2018 •

edited

Loading

brodygov Feb 2, 2018

monfresh commented Feb 1, 2018

jgsmith-usds commented Feb 1, 2018

monfresh commented Feb 2, 2018

brodygov commented Feb 2, 2018

brodygov left a comment

Sanitize improperly encoded form inputs #1954

Sanitize improperly encoded form inputs #1954

Conversation

monfresh commented Jan 30, 2018

brodygov commented Jan 31, 2018

jgsmith-usds commented Jan 31, 2018 • edited Loading

brodygov commented Jan 31, 2018

jmhooper commented Jan 31, 2018

jgsmith-usds commented Jan 31, 2018

brodygov commented Jan 31, 2018 • edited Loading

jgsmith-usds commented Jan 31, 2018

monfresh commented Feb 1, 2018

jgsmith-usds Feb 1, 2018

Choose a reason for hiding this comment

brodygov Feb 1, 2018

Choose a reason for hiding this comment

jgsmith-usds Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

monfresh Feb 1, 2018

Choose a reason for hiding this comment

brodygov Feb 1, 2018

Choose a reason for hiding this comment

brodygov Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

monfresh Feb 2, 2018 • edited Loading

Choose a reason for hiding this comment

brodygov Feb 2, 2018

Choose a reason for hiding this comment

monfresh commented Feb 1, 2018

jgsmith-usds commented Feb 1, 2018

monfresh commented Feb 2, 2018

brodygov commented Feb 2, 2018

brodygov left a comment

Choose a reason for hiding this comment

jgsmith-usds commented Jan 31, 2018 •

edited

Loading

brodygov commented Jan 31, 2018 •

edited

Loading

jgsmith-usds Feb 1, 2018 •

edited

Loading

brodygov Feb 1, 2018 •

edited

Loading

monfresh Feb 2, 2018 •

edited

Loading