Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LG-243 Disallow indexing of certain pages #2151

Merged
merged 1 commit into from
May 14, 2018
Merged

Conversation

monfresh
Copy link
Contributor

@monfresh monfresh commented May 9, 2018

Why: A couple of our unauthenticated pages can contain sensitive
parameters in the URL, which can be indexed by some search engines.
Although these parameters are only valid for a short period of time,
we want to be cautious and prevent search engines from indexing them.

In addition, we want to disallow indexing and crawling altogether on
our lower environments in order to avoid being dinged by search engines
for having duplicate content across different domains.

How:

  • During app deploy, copy a different robots.txt that disallows all
  • Use our existing session_with_trust? helper method, along with a new
    Figaro config to disallow indexing of pages that can contain sensitive
    parameters.

Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:

  • For DB changes, check for missing indexes, check to see if the changes
    affect other apps (such as the dashboard), make sure the DB columns in the
    various environments are properly populated, coordinate with devops, plan
    migrations in separate steps.

  • For route changes, make sure GET requests don't change state or result in
    destructive behavior. GET requests should only result in information being
    read, not written.

  • For encryption changes, make sure it is compatible with data that was
    encrypted with the old code.

  • For secrets changes, make sure to update the S3 secrets bucket with the
    new configs in all environments.

  • Do not disable Rubocop or Reek offenses unless you are absolutely sure
    they are false positives. If you're not sure how to fix the offense, please
    ask a teammate.

  • When reading data, write tests for nil values, empty strings,
    and invalid formats.

  • When calling redirect_to in a controller, use _url, not _path.

  • When adding user data to the session, use the user_session helper
    instead of the session helper so the data does not persist beyond the user's
    session.

  • When adding a new controller that requires the user to be fully
    authenticated, make sure to add before_action :confirm_two_factor_authenticated.

@monfresh monfresh requested review from abrouilette, brodygov and konklone and removed request for abrouilette May 9, 2018 16:13
@konklone
Copy link
Contributor

konklone commented May 9, 2018

Is this maybe a little extreme? Wouldn't we want e.g. archive.org and other spider-respecting services to see the homepage?

@brodygov
Copy link
Contributor

brodygov commented May 9, 2018

Yeah I would lean toward disallowing access to authenticated pages only. We definitely do want secure.login.gov to be crawled by search engines and have good page rank so that people searching for login.gov will end up in the right place.

@monfresh
Copy link
Contributor Author

At first, I thought the request_id parameter that can show up on the homepage was sensitive, but it is not. Looking into this some more, I found this helpful article about how to prevent indexing of certain pages in Rails: https://robots.thoughtbot.com/block-web-crawlers-with-rails. That article also made me realize we should block crawlers completely from lower environment sites, which are accessible publicly, and which contain the same content as secure.login.gov, which can hurt our ranking.

Copy link
Contributor

@konklone konklone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that the thread has some consensus around making this narrower (in production) to allow unauthenticated pages to be indexed.

@brodygov
Copy link
Contributor

Just for a point of comparison, the Stripe dashboard had similar issues with random pages ending up in search engines some years ago, and settled on:

User-agent: *
disallow: /account/
disallow: /oauth/
disallow: /invoice/

https://dashboard.stripe.com/robots.txt

@monfresh monfresh force-pushed the mb-robots-lg-243 branch 2 times, most recently from 4d72fdb to 373f557 Compare May 11, 2018 00:48
@monfresh
Copy link
Contributor Author

@konklone @brodygov This is ready for another review. Please take a look at the updated commit message for details.

@konklone
Copy link
Contributor

My comments about scope were addressed, so approving from my end. But one concern I have is that by moving this from static to dynamic routing, we could potentially take on more load than expected, given how often this route could get hit. But that's just a distant observation on implementation, others would know better what the impact is.

@monfresh
Copy link
Contributor Author

That's a good point. Maybe we can upload a different static robots.txt in the lower envs as part of the deploy process. What do you think, @brodygov?

@brodygov
Copy link
Contributor

While it would be nicer all things being equal to drop a static robots.txt into public, we're way overprovisioned on idp capacity at the moment, so I wouldn't worry too much about load. Dropping it in as part of deploy/build or deploy/build-post-config would both be plausible.

The more scalable way to address idp server load would be to invest effort in fronting the whole system with a CDN.

@monfresh
Copy link
Contributor Author

I agree on the CDN. That should be relatively easy to do on the app side, but in a separate PR obviously. Would Cloudfront be a viable solution?

@monfresh monfresh force-pushed the mb-robots-lg-243 branch from 373f557 to 9472e2e Compare May 11, 2018 20:31
@monfresh
Copy link
Contributor Author

OK, I went with the "copy the modified robots.txt to lower envs upon deploy" route and removed the dynamic code. One more look, please? @brodygov @konklone

require 'login_gov/hostdata'

if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod'
system 'cp public/ban-robots.txt public/robots.txt'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest FileUtils.cp here instead.


require 'login_gov/hostdata'

if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for using LoginGov::Hostdata.domain != 'login.gov' instead of LoginGov::Hostdata.env != 'prod', which would allow us to test that this works in staging before it goes to production.

@brodygov
Copy link
Contributor

Regarding CloudFront yep that would be a good solution. I think the most significant work in enabling CloudFront would be in splitting out the assets compilation so that statics are uploaded to an S3 bucket and served from a different hostname. Or we could keep the current strategy and just accept that some clients will receive 404s during deployments until assets are cached from the new servers, which is at least a little better than what we have now.

**Why**: A couple of our unauthenticated pages can contain sensitive
parameters in the URL, which can be indexed by some search engines.
Although these parameters are only valid for a short period of time,
we want to be cautious and prevent search engines from indexing them.

In addition, we want to disallow indexing and crawling altogether on
our lower environments in order to avoid being dinged by search engines
for having duplicate content across different domains.

**How**:
- During app deploy, copy a different `robots.txt` that disallows all
- Use our existing `session_with_trust?` helper method, along with a new
Figaro config to disallow indexing of pages that can contain sensitive
parameters.
@monfresh monfresh force-pushed the mb-robots-lg-243 branch from 9472e2e to 9fd6178 Compare May 14, 2018 12:35
@monfresh monfresh changed the title LG-243 Disallow all site indexing spiders LG-243 Disallow indexing of certain pages May 14, 2018
@monfresh monfresh merged commit 008ff51 into master May 14, 2018
@monfresh monfresh deleted the mb-robots-lg-243 branch May 14, 2018 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants