Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on new and improved process of making Dataverse indexed by Google and other search engines #5639

Closed
landreev opened this issue Mar 13, 2019 · 3 comments
Assignees
Milestone

Comments

@landreev
Copy link
Contributor

This is the knowledge that was acquired after we reopened the site to Googlebot; and addressing reports of datasets not being properly indexed by owners (issue IQSS/dataverse.harvard.edu#1).
The new approach is a combination of advertising the datasets and dataverses that we want to be indexed, and blocking the robots from actually crawling the site (i.e., discouraging them from following the URLs of the facets and pages of search results). It appears to be much more efficient and produces better search results. Explaining it in the guide will benefit other Dataverse installations.

@landreev
Copy link
Contributor Author

landreev commented Mar 15, 2019

I checked in the documentation for the new process now in use in production.
I'm leaving the issue in "this spring", for now, since a) I'm working on other things (BARI, primarily) and b) talking with @mheppler about adding more robots rules to accommodate his work on preview/citation cards for Facebook/Twitter. Since it looks like their bots will have some "special needs" that need to be addressed in robots.txt as well. So we'll add that stuff to the doc - but it still needs to be finalized.

@mheppler
Copy link
Contributor

With much input from @landreev and @pdurbin, I was able to confirm in my work for #5637 that the images directory, which contains the favicon images, needs to be added to the robots.txt. This allows for preview cards to be generated on Facebook, Twitter and other social media sites, when datasets are shared. (There will need to be further development to extend this to dataverse and file URL's.)

Allow: /javax.faces.resource/images/

Screen Shot 2019-03-19 at 3 14 27 PM

@mheppler
Copy link
Contributor

Also, as part of feedback investigations for #5641, I discovered there is a problem with the social media robots accessing dataset thumbnails that I was not able to resolve. This needs to be addressed as part of that thumbnails issue, or this robots.txt issue.

See the warnings produced by the Twitter developer tool card validator.

WARN:  The image URL http://ec2-35-175-247-225.compute-1.amazonaws.com/api/datasets/:persistentId/thumbnail?persistentId=doi:10.5072/FK2/ZWINON&imageThumb=400 specified by the 'og:image' metatag may be restricted by the site's robots.txt file, which will prevent Twitter from fetching it.

Screen Shot 2019-03-19 at 5 31 57 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants