-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LG-243 Disallow indexing of certain pages #2151
Conversation
Is this maybe a little extreme? Wouldn't we want e.g. archive.org and other spider-respecting services to see the homepage? |
Yeah I would lean toward disallowing access to authenticated pages only. We definitely do want secure.login.gov to be crawled by search engines and have good page rank so that people searching for login.gov will end up in the right place. |
At first, I thought the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that the thread has some consensus around making this narrower (in production) to allow unauthenticated pages to be indexed.
Just for a point of comparison, the Stripe dashboard had similar issues with random pages ending up in search engines some years ago, and settled on:
|
4d72fdb
to
373f557
Compare
My comments about scope were addressed, so approving from my end. But one concern I have is that by moving this from static to dynamic routing, we could potentially take on more load than expected, given how often this route could get hit. But that's just a distant observation on implementation, others would know better what the impact is. |
That's a good point. Maybe we can upload a different static robots.txt in the lower envs as part of the deploy process. What do you think, @brodygov? |
While it would be nicer all things being equal to drop a static The more scalable way to address idp server load would be to invest effort in fronting the whole system with a CDN. |
I agree on the CDN. That should be relatively easy to do on the app side, but in a separate PR obviously. Would Cloudfront be a viable solution? |
373f557
to
9472e2e
Compare
bin/copy_robots_file
Outdated
require 'login_gov/hostdata' | ||
|
||
if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod' | ||
system 'cp public/ban-robots.txt public/robots.txt' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest FileUtils.cp
here instead.
|
||
require 'login_gov/hostdata' | ||
|
||
if LoginGov::Hostdata.in_datacenter? && LoginGov::Hostdata.env != 'prod' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for using LoginGov::Hostdata.domain != 'login.gov'
instead of LoginGov::Hostdata.env != 'prod'
, which would allow us to test that this works in staging before it goes to production.
Regarding CloudFront yep that would be a good solution. I think the most significant work in enabling CloudFront would be in splitting out the assets compilation so that statics are uploaded to an S3 bucket and served from a different hostname. Or we could keep the current strategy and just accept that some clients will receive 404s during deployments until assets are cached from the new servers, which is at least a little better than what we have now. |
**Why**: A couple of our unauthenticated pages can contain sensitive parameters in the URL, which can be indexed by some search engines. Although these parameters are only valid for a short period of time, we want to be cautious and prevent search engines from indexing them. In addition, we want to disallow indexing and crawling altogether on our lower environments in order to avoid being dinged by search engines for having duplicate content across different domains. **How**: - During app deploy, copy a different `robots.txt` that disallows all - Use our existing `session_with_trust?` helper method, along with a new Figaro config to disallow indexing of pages that can contain sensitive parameters.
9472e2e
to
9fd6178
Compare
Why: A couple of our unauthenticated pages can contain sensitive
parameters in the URL, which can be indexed by some search engines.
Although these parameters are only valid for a short period of time,
we want to be cautious and prevent search engines from indexing them.
In addition, we want to disallow indexing and crawling altogether on
our lower environments in order to avoid being dinged by search engines
for having duplicate content across different domains.
How:
robots.txt
that disallows allsession_with_trust?
helper method, along with a newFigaro config to disallow indexing of pages that can contain sensitive
parameters.
Hi! Before submitting your PR for review, and/or before merging it, please
go through the following checklist:
For DB changes, check for missing indexes, check to see if the changes
affect other apps (such as the dashboard), make sure the DB columns in the
various environments are properly populated, coordinate with devops, plan
migrations in separate steps.
For route changes, make sure GET requests don't change state or result in
destructive behavior. GET requests should only result in information being
read, not written.
For encryption changes, make sure it is compatible with data that was
encrypted with the old code.
For secrets changes, make sure to update the S3 secrets bucket with the
new configs in all environments.
Do not disable Rubocop or Reek offenses unless you are absolutely sure
they are false positives. If you're not sure how to fix the offense, please
ask a teammate.
When reading data, write tests for nil values, empty strings,
and invalid formats.
When calling
redirect_to
in a controller, use_url
, not_path
.When adding user data to the session, use the
user_session
helperinstead of the
session
helper so the data does not persist beyond the user'ssession.
When adding a new controller that requires the user to be fully
authenticated, make sure to add
before_action :confirm_two_factor_authenticated
.