Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize improperly encoded form inputs #1954

Merged
merged 1 commit into from
Feb 5, 2018
Merged

Conversation

monfresh
Copy link
Contributor

Why: To reduce 500 errors.

Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:

  • For DB changes, check for missing indexes, check to see if the changes
    affect other apps (such as the dashboard), make sure the DB columns in the
    various environments are properly populated, coordinate with devops, plan
    migrations in separate steps.

  • For route changes, make sure GET requests don't change state or result in
    destructive behavior. GET requests should only result in information being
    read, not written.

  • For encryption changes, make sure it is compatible with data that was
    encrypted with the old code.

  • For secrets changes, make sure to update the S3 secrets bucket with the
    new configs in all environments.

  • Do not disable Rubocop or Reek offenses unless you are absolutely sure
    they are false positives. If you're not sure how to fix the offense, please
    ask a teammate.

  • When reading data, write tests for nil values, empty strings,
    and invalid formats.

  • When calling redirect_to in a controller, use _url, not _path.

  • When adding user data to the session, use the user_session helper
    instead of the session helper so the data does not persist beyond the user's
    session.

  • When adding a new controller that requires the user to be fully
    authenticated, make sure to add before_action :confirm_two_factor_authenticated.

@brodygov
Copy link
Contributor

Thanks for working through all this! This PR looks good, but I'm a little worried that we'll be playing wack-a-mole with this family of issues unless we have a more general validation middleware. What do you think? In the other issue you alluded to the example gems/gist not actually working for most scenarios.

@jgsmith-usds
Copy link
Contributor

jgsmith-usds commented Jan 31, 2018

I wonder if we can look at the request itself and sanitize there rather than on individual parameters. If we only accept a limited number of formats, this might work. Perhaps something like utf8-cleaner? It seems to handle json and form urlencoding.

Another option is to detect poor encoding and return a 400 Bad Request.

@brodygov
Copy link
Contributor

The problem with utf8-cleaner is that it appears to silently discard unexpected bytes rather than returning a 400 Bad Request, which seems like it would mask real issues. https://github.com/18F/identity-private/issues/2560#issuecomment-361394099

@jmhooper
Copy link
Member

Looking at the exception that's getting raised here. ActionController::BadRequest. That seems to imply a 400 should be rendered, right? What is the case where that is raised and we should render a 500 instead of a 400? Perhaps we should just rescue that exception in the application controller and then render a 400 there?

@jgsmith-usds
Copy link
Contributor

Agree that silently passing along content that had bad bytes is not great. A 400 would be a lot better. I would shy away from a 500 because 5xx implies the problem requires action on the server side to correct. A 4xx requires action by the client, which is proper if the client is sending bad encodings.

@brodygov
Copy link
Contributor

brodygov commented Jan 31, 2018

@jmhooper Unless I'm misunderstanding something, I don't think ActionController::BadRequest is what was being raised here. BadRequest does indeed result in an HTTP 400. We have been paging ourselves on unhandled ArgumentError, ActiveRecord::StatementInvalid, and NoMethodError.

@jgsmith-usds
Copy link
Contributor

So perhaps something modeled after utf8-cleaner that raises BadRequest if it finds a bad byte sequence?

@monfresh monfresh force-pushed the mb-utf8-cleaning branch 2 times, most recently from 68b42d2 to 7aea80b Compare February 1, 2018 01:51
@monfresh
Copy link
Contributor Author

monfresh commented Feb 1, 2018

@brodygov @jgsmith-usds Please check out the latest changes. I implemented a simple middleware.

end

def invalid_string?(string)
!string.force_encoding('UTF-8').valid_encoding?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that someone might use a browser that sends something other than UTF-8. Assuming that Rails converts non-UTF-8 to UTF-8 before handing it to the application, then this isn't a problem, and we only have to worry about Rails taking in content that is advertised as UTF-8 by the client. I'm still looking through Rack/Rails stuff to find a path that shows it looking at the input encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely some risk that users will send non-UTF8 data. I don't think rails does anything to parse the charset out of the request Content-Type header.

But returning an HTTP 400 is at least a little better than the current behavior of exploding and returning HTTP 500. I'm more concerned about future use cases where we want to accept file uploads or other fully binary data. Perhaps we could add some configuration option to the middleware so that if we want to whitelist certain parameters as non-UTF-8 we can do so?

Example realistic use case:

Anyone named Boris is going to be unhappy when Борис encoded in Windows-1251 triggers the invalid UTF-8 errors. (Б is 0xC1 in Windows-1251, which is an invalid UTF-8 byte.)

Copy link
Contributor

@jgsmith-usds jgsmith-usds Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

boris = "Борис".encode(Encoding::Windows_1251)
=> "\xC1\xEE\xF0\xE8\xF1"

boris.encoding
=> #<Encoding:Windows-1251>

boris.force_encoding('UTF-8')
=> "\xC1\xEE\xF0\xE8\xF1" # byte sequence hasn't changed - just the encoding

boris.force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>

boris.force_encoding('UTF-8').valid_encoding?
=> false

boris.encode('UTF-8')
=> "Борис"

boris.encode('UTF-8').valid_encoding?
=> true

This assumes that Rails does tag the parameter strings with the charset that the browser specifies. If that isn't the case, then we have a bit more lifting to do.

The best might be to convert all incoming string parameters (except those we want as straight-up binary) to UTF-8 and raise a 400 if the conversion raises any errors.

This is as close as I got to finding something in Rails that might be able to do some of this: https://github.com/rails/rails/blob/b5db73076914e7103466bd76bec785cbcfe88875/actionpack/lib/action_controller/metal/parameter_encoding.rb . It points to a solution that might be a bit overkill for us for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these tests pass, so it looks like Rails handles this well:

it 'does not return 400 when email has Cyrillic characters' do
  params = { user: { email: 'Борис@test.ru' } }
  post sign_up_register_path, params: params

  expect(response.status).to eq 302
end

it 'does not return 400 when email is in Windows-1251 encoding' do
  params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1251') } }
  post sign_up_register_path, params: params

  expect(response.status).to eq 302
end


it 'does not return 400 when email is in Windows-1252 encoding' do
  params = { user: { email: 'Борис@test.ru'.force_encoding('Windows-1252') } }
  post sign_up_register_path, params: params

  expect(response.status).to eq 302
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests as written aren't testing actual Windows-1251 text. It should be 'Борис'.encode('Windows-1251'), not force_encoding().

>> 'Борис'.force_encoding('windows-1251').bytes
=> [208, 145, 208, 190, 209, 128, 208, 184, 209, 129]
>> 'Борис'.encode('windows-1251').bytes
=> [193, 238, 240, 232, 241]

Copy link
Contributor

@brodygov brodygov Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as Jim notes, I don't think Rails tags parameters with the charset specified by the browser, and you need to know the source encoding in order to convert to UTF-8. In a few quick attempts at testing, the app still assumed everything was UTF-8 even when submitted with a charset specifying Windows-1251.

>> binary_boris = "Борис".encode(Encoding::Windows_1251).force_encoding(Encoding::BINARY)
=> "\xC1\xEE\xF0\xE8\xF1"
>> binary_boris.encoding
=> #<Encoding:ASCII-8BIT>
>> binary_boris.encode('utf-8')
Encoding::UndefinedConversionError: "\xC1" from ASCII-8BIT to UTF-8

I tested this by submitting a password reset using an email address of "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]", which threw an HTTP 500 as expected from the code on master.

Also I feel like Binary Boris is the name of some futurist robot.

Copy link
Contributor Author

@monfresh monfresh Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's an accurate test. If you submit "andrew.brody+\xC1\xEE\xF0\xE8\[email protected]", the best guess for that String's encoding will be windows-1251 (using the rchardet gem for example: https://github.com/jmhodges/rchardet). You can then convert it to UTF-8 without any errors:

windows_andy = "andrew.brody+\xC1\xEE\xF0\xE8\xF1@gsa.gov".encode(Encoding::Windows_1251)
windows_andy.encode('utf-8', 'windows-1251')
=> "andrew.brody+Борис@gsa.gov"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything wrong with the test. We should never guess the content encoding based on the content, because that can create all kinds of extra problems.

RFC 7231 and others discourage content sniffing. https://tools.ietf.org/html/rfc7231#section-3.1.1.5

The reason our app's behavior is problematic is that the request contained a Content-Type: ...; charset=Windows-1251 header to indicate that the content was encoded with Windows-1251, and the idp ignored this header and treated it as UTF-8 anyway.

https://tools.ietf.org/html/rfc7231#section-3.1.1

@monfresh
Copy link
Contributor Author

monfresh commented Feb 1, 2018

Let's jump back a bit. The reason we started this PR was to prevent the app from raising 500 errors when invalid UTF-8 was being submitted. So far, every single instance of these 500 errors has been the result of scans on non-authenticated endpoints: sign in form, register email form, and password reset form. None of these errors have been caused by legitimate users.

The exception was raised due to the app calling one of these 3 methods on a String with invalid UTF-8 bytes: .present?, .blank?, .downcase. In most cases, those calls are from our own code, so they are easy to fix, which is what I did in my original version of this PR. The other calls to .present? are made by Devise internals which would require overriding private methods. Instead, I opted to probe for the invalid bytes before Devise makes the call to current_user.

I don't think any of our LOA1 forms have any inputs that can be anything but valid UTF-8, so it should be safe enforcing that. For LOA3, I suppose first name, last name and address could allow other encodings, but I don't know for sure. It would depend on what our vendors accept.

I say for now let's go with the solution that prevents 500 errors as we've seen them, and then deal with others on a case-by-case basis. By validating the encoding at the application layer, it allows us to display friendly error messages to the user, as opposed to a generic 400 error page.

@jgsmith-usds
Copy link
Contributor

I don't think we have to accept encodings from the user based on what our vendors accept if we mediate between them. We simply transcode between what the user provides and what the vendor expects.

The reason for encouraging the middleware was to prevent the wack-a-mole situation. If the user is sending data from a browser, then the browser should be setting a character encoding in the request and we should be honoring that encoding. If we don't want to do a general solution in this PR, then I'd like to create an issue to track the general case so we don't forget it. It'll become more critical with the LOA3 information we gather.

@monfresh
Copy link
Contributor Author

monfresh commented Feb 2, 2018

Can someone please explain why non-UTF-8 characters are a realistic use case? Do any of the documents/services we currently check or are planning to check for ID verification allow characters that are not from an extended Latin-based alphabet?

My name in Arabic is منصف, which is valid UTF-8, but I'm not going to type that when trying to prove my identity on login.gov. I'm going to type Moncef.

Rack does indeed read the charset from the Content-Type header if it's present:

request = Rack::Request.new(env)
charset = request.content_charset

My testing with Chrome shows that it does not include the charset, but maybe that's because browsers only send the charset if it's not UTF-8? A quick search didn't yield any ways to simulate different charsets from Chrome.

However, transcoding based on the client's charset is not foolproof. For example, someone could easily send a cURL request with a charset that doesn't correspond to the data, and if we tried to transcode it, it would fail. For example:

irb(main):002:0> "مندف".encode("utf-8", "Windows-1252")
Encoding::UndefinedConversionError: "\x81" to UTF-8 in conversion from Windows-1252 to UTF-8
	from (irb):2:in `encode'
	from (irb):2

Remember that the primary reason we are doing this work is to prevent non-legitimate requests from generating 500 errors and paging team members unnecessarily. To date, those are the only types of requests that have been generating such errors at high frequencies.

I suppose we could rescue the conversion error and raise the 400 in that case.
We could also use the rchardet gem, which guesses the encoding and provides a confidence level between 0 and 1, which seems pretty cool but probably overkill at the moment.

For the sake of time, I propose that we leave this PR as is, since currently, we do not have any inputs that would accept anything other than UTF-8. Once we know for sure that we will want to accept other encodings for LOA3, then we can revisit the middleware.

Thoughts?

@brodygov
Copy link
Contributor

brodygov commented Feb 2, 2018

Right, the priority is definitely just to stop the HTTP 500 errors. But we should be aware of the consequences we are creating for legitimate requests that will be rejected by this.

I'm especially worried about being able to accept fully binary data, which is why I suggested adding something like a whitelist of binary-safe parameters to the middleware to make it future-proof. Agreed that we don't appear to have appreciable traffic sending content with invalid UTF-8 bytes today.

We definitely shouldn't guess the encoding based on content, but there may be clients that send well-formed content encoded in charsets other than UTF-8 and correctly send us a Content-Type header saying it's not UTF-8. In those cases we currently treat the content as UTF-8 anyway, even though we have all the information needed to correctly process it.

While most computers in the US have UTF-8 or Windows-1252 set as their default system encodings, it's common for computers especially in Asia to use other encodings like Shift JIS as the default encoding. It's hard to get global statistics for this, but it seems like as much as 10% of websites serve content in other encodings. https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

A few examples of legitimate requests that will be rejected:

  • Binary data (e.g. photo upload using Content-Transfer-Encoding: binary)
  • Борис encoded in Windows-1251 (contains \xC1)
  • Müller encoded in Windows-1252 (sometimes called ISO-8859-1, contains \xFC)
  • منصف encoded in Windows-1256 ("\xE3\xE4\xD5\xDD")
  • かいと encoded in EUC-JP ("\xA4\xAB\xA4\xA4\xA4\xC8")

Copy link
Contributor

@brodygov brodygov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working through this!

**Why**: To reduce 500 errors.
@monfresh monfresh merged commit 42592b4 into master Feb 5, 2018
@monfresh monfresh deleted the mb-utf8-cleaning branch February 5, 2018 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants