Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entire Zooniverse web site blocked from search engines #6331

Closed
eatyourgreens opened this issue Sep 22, 2024 · 17 comments
Closed

Entire Zooniverse web site blocked from search engines #6331

eatyourgreens opened this issue Sep 22, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@eatyourgreens
Copy link
Contributor

eatyourgreens commented Sep 22, 2024

Describe the bug

Since launching the new home page, the www.zooniverse.org domain is blocked from appearing in search engines.

Search results for Eclipsing Binary Patrol. The Zooniverse entry is a message about being blocked from showing a result.

To Reproduce

https://www.zooniverse.org/robots.txt disallows indexing of all URLs on the Zooniverse domain.

You can reproduce the problem by searching for Eclipsing Binary Patrol in DuckDuckGo.
https://duckduckgo.com/?q=eclipsing+binary+patrol&t=iphone&ia=web

Google doesn't seem to be affected.
https://www.google.co.uk/search?q=eclipsing+binary+patrol

Expected behavior

Search engines should be allowed to index Zooniverse, so that people can find new projects.

Additional context

@eatyourgreens eatyourgreens added the bug Something isn't working label Sep 22, 2024
@eatyourgreens
Copy link
Contributor Author

Google shows links to Zooniverse. Like DuckDuckGo, the description is blocked.

Galaxy Zoo linked from a Google search for 'galaxy zoo website.' The listing says that Google is blocked from showing information about the page.

@lcjohnso
Copy link
Member

Adding robots.txt originated from wanting to prevent indexing of staging pages (see #2541). Consistent with previous homepage behavior, the restriction should be removed for the main Zooniverse domain.

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 23, 2024

Adding robots.txt originated from wanting to prevent indexing of staging pages (see #2541). Consistent with previous homepage behavior, the restriction should be removed for the main Zooniverse domain.

The robots.txt files I added in #2541 don't actually work, as I didn't publish them to their respective domain roots. 😞

https://fe-project.zooniverse.org/robots.txt returns 404, and that domain can be indexed by Google.

The Readme is available at https://fe-project.zooniverse.org/projects/assets/README.md and https://www.zooniverse.org/projects/assets/README.md so the public directory is being published, as expected. 🤔

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 24, 2024

I've opened #6335 to publish /robots.txt for the standalone projects app. My mistake in #2541 was that I published the robots file at /projects/robots.txt.

@snblickhan
Copy link

A report from Justin Schell re: Mapping Prejudice (project ID: 3877).

Workflow 25524 is appearing in Google search results via an FEM link (frontend.preview), but MP is a PFE project. This has led to one case where someone found this link by searching "washtenaw Zooniverse" and inadvertently submitted FEM classifications to a PFE project, not realizing there was any difference. See screenshot below:

washtenawzooniverse

@eatyourgreens
Copy link
Contributor Author

@snblickhan https://frontend.preview.zooniverse.org/robots.txt blocks search crawlers, but I think it didn’t exist prior to last week.

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 25, 2024

#6340 removes /robots.txt from staging too, so frontend.preview would be crawlable.

@goplayoutside3
Copy link
Contributor

@eatyourgreens do you have a suggestion on how to solve the scenarios you described? We do not want www.zooniverse.org/robots.txt, but we do want frontend.preview.zooniverse.org/robots.txt. How to make that happen?

@eatyourgreens
Copy link
Contributor Author

Can you selectively add that file to the staging deploy, but not to the production deploy? I haven't thought about this for a long time, but I believe that staging and production use different Docker images.

@eatyourgreens
Copy link
Contributor Author

A very quick search of the app router docs found the answer.

https://nextjs.org/docs/app/api-reference/file-conventions/metadata/robots

@goplayoutside3
Copy link
Contributor

the answer

I don't see any mention in the docs link about selective deployment. Unless you're considering using a robots.js file to look for certain env variables 🤔

Was there a different question you're looking for the answer to?

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 26, 2024

I hacked something together very quickly in #6341. It seems to work. That's probably as much as I can do for free.

Copilot is great for small jobs like this. It's free for anyone with a .edu or .ac.uk email address.
https://education.github.com/discount_requests/application

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 26, 2024

For pages on frontend.preview.zooniverse.org that have already been indexed by Google, you might need to go into Google Search Console (as owners of that subdomain) and explicitly ask Google to remove the frontend.preview subdomain from the search index.

DuckDuckGo seems to be better about hiding the staging domain in search results, but has also indexed it:

@eatyourgreens
Copy link
Contributor Author

You can also add

<meta name="robots" content="noindex nofollow">

to pages that you don't want to be indexed eg. staged projects.
https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag

@goplayoutside3
Copy link
Contributor

Via discussion with @zwolf we decided to use the following strategy:

  • fem apps all serve the restrictive version of /robots.txt
  • host the permissive one in blob storage
  • proxy explicit requests to www.zooniverse.org/robots.txt to the permissive version

See zooniverse/static#380 and #6340

@eatyourgreens
Copy link
Contributor Author

eatyourgreens commented Sep 27, 2024

Looks like static.zooniverse.org is already in the Google index, including some data CSVs.

https://www.google.com/gasearch?q=site:static.zooniverse.org

Including the two-page guide to running a Zooniverse project.
https://static.zooniverse.org/www.citizensciencealliance.org/downloads/zooniverse_guide.pdf

@goplayoutside3
Copy link
Contributor

Fixed by zooniverse/static#380 and Google is re-crawling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants