LG-243 Disallow indexing of certain pages #2151

monfresh · 2018-05-09T16:12:55Z

Why: A couple of our unauthenticated pages can contain sensitive
parameters in the URL, which can be indexed by some search engines.
Although these parameters are only valid for a short period of time,
we want to be cautious and prevent search engines from indexing them.

In addition, we want to disallow indexing and crawling altogether on
our lower environments in order to avoid being dinged by search engines
for having duplicate content across different domains.

How:

During app deploy, copy a different robots.txt that disallows all
Use our existing session_with_trust? helper method, along with a new
Figaro config to disallow indexing of pages that can contain sensitive
parameters.

Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:

For DB changes, check for missing indexes, check to see if the changes
affect other apps (such as the dashboard), make sure the DB columns in the
various environments are properly populated, coordinate with devops, plan
migrations in separate steps.
For route changes, make sure GET requests don't change state or result in
destructive behavior. GET requests should only result in information being
read, not written.
For encryption changes, make sure it is compatible with data that was
encrypted with the old code.
For secrets changes, make sure to update the S3 secrets bucket with the
new configs in all environments.
Do not disable Rubocop or Reek offenses unless you are absolutely sure
they are false positives. If you're not sure how to fix the offense, please
ask a teammate.
When reading data, write tests for nil values, empty strings,
and invalid formats.
When calling redirect_to in a controller, use _url, not _path.
When adding user data to the session, use the user_session helper
instead of the session helper so the data does not persist beyond the user's
session.
When adding a new controller that requires the user to be fully
authenticated, make sure to add before_action :confirm_two_factor_authenticated.

konklone · 2018-05-09T18:06:27Z

Is this maybe a little extreme? Wouldn't we want e.g. archive.org and other spider-respecting services to see the homepage?

brodygov · 2018-05-09T19:06:20Z

Yeah I would lean toward disallowing access to authenticated pages only. We definitely do want secure.login.gov to be crawled by search engines and have good page rank so that people searching for login.gov will end up in the right place.

monfresh · 2018-05-10T15:43:18Z

At first, I thought the request_id parameter that can show up on the homepage was sensitive, but it is not. Looking into this some more, I found this helpful article about how to prevent indexing of certain pages in Rails: https://robots.thoughtbot.com/block-web-crawlers-with-rails. That article also made me realize we should block crawlers completely from lower environment sites, which are accessible publicly, and which contain the same content as secure.login.gov, which can hurt our ranking.

konklone

Noting that the thread has some consensus around making this narrower (in production) to allow unauthenticated pages to be indexed.

brodygov · 2018-05-10T17:31:03Z

Just for a point of comparison, the Stripe dashboard had similar issues with random pages ending up in search engines some years ago, and settled on:

User-agent: *
disallow: /account/
disallow: /oauth/
disallow: /invoice/

https://dashboard.stripe.com/robots.txt

monfresh · 2018-05-11T00:52:47Z

@konklone @brodygov This is ready for another review. Please take a look at the updated commit message for details.

konklone · 2018-05-11T16:49:56Z

My comments about scope were addressed, so approving from my end. But one concern I have is that by moving this from static to dynamic routing, we could potentially take on more load than expected, given how often this route could get hit. But that's just a distant observation on implementation, others would know better what the impact is.

monfresh · 2018-05-11T17:36:25Z

That's a good point. Maybe we can upload a different static robots.txt in the lower envs as part of the deploy process. What do you think, @brodygov?

brodygov · 2018-05-11T18:01:25Z

While it would be nicer all things being equal to drop a static robots.txt into public, we're way overprovisioned on idp capacity at the moment, so I wouldn't worry too much about load. Dropping it in as part of deploy/build or deploy/build-post-config would both be plausible.

The more scalable way to address idp server load would be to invest effort in fronting the whole system with a CDN.

monfresh · 2018-05-11T18:25:53Z

I agree on the CDN. That should be relatively easy to do on the app side, but in a separate PR obviously. Would Cloudfront be a viable solution?

monfresh · 2018-05-11T20:33:31Z

OK, I went with the "copy the modified robots.txt to lower envs upon deploy" route and removed the dynamic code. One more look, please? @brodygov @konklone

brodygov · 2018-05-11T20:56:02Z

bin/copy_robots_file

+require 'login_gov/hostdata'
+
+if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod'
+  system 'cp public/ban-robots.txt public/robots.txt'


I'd suggest FileUtils.cp here instead.

brodygov · 2018-05-11T21:33:23Z

bin/copy_robots_file

+
+require 'login_gov/hostdata'
+
+if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod'


Slight preference for using LoginGov::Hostdata.domain != 'login.gov' instead of LoginGov::Hostdata.env != 'prod', which would allow us to test that this works in staging before it goes to production.

brodygov · 2018-05-11T22:01:23Z

Regarding CloudFront yep that would be a good solution. I think the most significant work in enabling CloudFront would be in splitting out the assets compilation so that statics are uploaded to an S3 bucket and served from a different hostname. Or we could keep the current strategy and just accept that some clients will receive 404s during deployments until assets are cached from the new servers, which is at least a little better than what we have now.

**Why**: A couple of our unauthenticated pages can contain sensitive parameters in the URL, which can be indexed by some search engines. Although these parameters are only valid for a short period of time, we want to be cautious and prevent search engines from indexing them. In addition, we want to disallow indexing and crawling altogether on our lower environments in order to avoid being dinged by search engines for having duplicate content across different domains. **How**: - During app deploy, copy a different `robots.txt` that disallows all - Use our existing `session_with_trust?` helper method, along with a new Figaro config to disallow indexing of pages that can contain sensitive parameters.

monfresh requested review from abrouilette, brodygov and konklone and removed request for abrouilette May 9, 2018 16:13

monfresh added the status - ready for review label May 9, 2018

davemcorwin approved these changes May 10, 2018

View reviewed changes

konklone suggested changes May 10, 2018

View reviewed changes

monfresh force-pushed the mb-robots-lg-243 branch 2 times, most recently from 4d72fdb to 373f557 Compare May 11, 2018 00:48

konklone approved these changes May 11, 2018

View reviewed changes

monfresh force-pushed the mb-robots-lg-243 branch from 373f557 to 9472e2e Compare May 11, 2018 20:31

brodygov reviewed May 11, 2018

View reviewed changes

brodygov approved these changes May 11, 2018

View reviewed changes

monfresh force-pushed the mb-robots-lg-243 branch from 9472e2e to 9fd6178 Compare May 14, 2018 12:35

monfresh changed the title ~~LG-243 Disallow all site indexing spiders~~ LG-243 Disallow indexing of certain pages May 14, 2018

monfresh merged commit 008ff51 into master May 14, 2018

monfresh deleted the mb-robots-lg-243 branch May 14, 2018 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LG-243 Disallow indexing of certain pages #2151

LG-243 Disallow indexing of certain pages #2151

monfresh commented May 9, 2018 •

edited

Loading

konklone commented May 9, 2018

brodygov commented May 9, 2018

monfresh commented May 10, 2018

konklone left a comment

brodygov commented May 10, 2018

monfresh commented May 11, 2018

konklone commented May 11, 2018

monfresh commented May 11, 2018

brodygov commented May 11, 2018

monfresh commented May 11, 2018

monfresh commented May 11, 2018

brodygov May 11, 2018

brodygov May 11, 2018

brodygov commented May 11, 2018


		require 'login_gov/hostdata'

		if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod'

LG-243 Disallow indexing of certain pages #2151

LG-243 Disallow indexing of certain pages #2151

Conversation

monfresh commented May 9, 2018 • edited Loading

konklone commented May 9, 2018

brodygov commented May 9, 2018

monfresh commented May 10, 2018

konklone left a comment

Choose a reason for hiding this comment

brodygov commented May 10, 2018

monfresh commented May 11, 2018

konklone commented May 11, 2018

monfresh commented May 11, 2018

brodygov commented May 11, 2018

monfresh commented May 11, 2018

monfresh commented May 11, 2018

brodygov May 11, 2018

Choose a reason for hiding this comment

brodygov May 11, 2018

Choose a reason for hiding this comment

brodygov commented May 11, 2018

monfresh commented May 9, 2018 •

edited

Loading