Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tell search engines to not index individual versions #4107

Closed
wants to merge 1 commit into from

Conversation

jjb
Copy link
Contributor

@jjb jjb commented Oct 3, 2023

to improve the search engine experience

CleanShot 2023-10-03 at 11 09 09@2x

@mperham
Copy link

mperham commented Oct 3, 2023

I've been frustrated by this too, excellent idea.

@ryantownsend
Copy link

It looks like there's a canonical tag in the <head> but it's incorrectly pointing to the currently viewed version rather than the latest version, despite how the code reads:

<link rel="canonical" href="<%= rubygem_version_url(@rubygem.slug, @latest_version.slug) %>" />

Personally, I'd just update the canonical tag to point to the non-versioned gem URL and continue to allow indexing (i.e. don't merge this PR). I'm not an SEO expert, but I think Disallowing access to the individual version URLs (as per this PR currently) may affect ranking.

@jjb
Copy link
Contributor Author

jjb commented Oct 3, 2023

ahhhh - nice catch - i think that explains the horrible SEO that is unique to rubygems - an actual bug with the canonical link. will change the other PR. thanks!

#4108

@jjb jjb closed this Oct 3, 2023
@jjb
Copy link
Contributor Author

jjb commented Oct 3, 2023

although after doing 30 seconds of research i'll backpedal on that... i don't think canonical should necessarily point to the non-version url. canonical is to remove ?.. and /amp cruft

but, i think the fact that the versions pages do not link back to the main page at all currently is probably the main SEO problem, so the other PR as-is will probably make progress.

@jjb jjb reopened this Oct 3, 2023
@ryantownsend
Copy link

ryantownsend commented Oct 3, 2023

Again, caveat: I'm no SEO expert, but... the trouble is if you Disallow access to the version-specific pages, any links pointing to those pages won't transfer any value in Google's eyes as it's indexer will think those links are just dead-ends.

Canonical tags are designed to de-dupe largely duplicate pages, not just remove query strings etc, meaning most of the value of links to version-specific pages would transfer to the overall gem page.

For example, just say you have an index page listing products in a category (typical ecommerce example), you don't want every page indexing as the content is largely the same, you want to centralise that inbound link value and you ideally want to direct users from Google to the first page so they see the latest products etc, so you'd canonicalise page 2..N back to page 1, regardless of the URL structure.

@codecov
Copy link

codecov bot commented Oct 28, 2023

Codecov Report

Merging #4107 (87a8265) into master (fa1e13b) will increase coverage by 0.03%.
Report is 54 commits behind head on master.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4107      +/-   ##
==========================================
+ Coverage   98.86%   98.90%   +0.03%     
==========================================
  Files         275      276       +1     
  Lines        6259     6273      +14     
==========================================
+ Hits         6188     6204      +16     
+ Misses         71       69       -2     

see 6 files with indirect coverage changes

@simi
Copy link
Member

simi commented Oct 28, 2023

To be honest, I have no idea what's the best practice in here. @jjb would you mind to check how for example npm or pypi do this? One one side I do understand search form puma rubygem would be best just with 1 top search result, but on the other side search for puma 6.4.0 should find related version. 🤔

@jjb
Copy link
Contributor Author

jjb commented Oct 28, 2023

I'm hoping that #4108 will solve most or all of the problem. before that went live (about 10 days ago?), individual version pages had zero links back to their main gem page.

regarding this PR, i think it's true that we don't want a search for [puma 6.4.0] to have the specific version page completely absent from results, so i'll close this PR

regarding canonical URLs, from my understanding they aren't made to point to a "main" page, they are made to remove query string cruft, or maybe not have m. domains in search results. so adding that for version pages might result in the problem above, a version page completely absent from results

@jjb jjb closed this Oct 28, 2023
@jjb jjb deleted the patch-1 branch October 28, 2023 23:41
@jjb
Copy link
Contributor Author

jjb commented Oct 28, 2023

maybe one approach is to use "priority" in a sitemap. rubygems.org doesn't currently have a sitemap so this would be a nontrivial project.

https://duckduckgo.com/?q=sitemap+priority

@simi
Copy link
Member

simi commented Oct 28, 2023

OK, thanks a lot for insight @jjb. Let's see how #4108 will affect the search result. Feel free to open additional issue/PR later for more tweaks on this.

@jjb
Copy link
Contributor Author

jjb commented Oct 28, 2023

@ryantownsend do you know if i'm wrong on this?

regarding canonical URLs, from my understanding ... adding that for version pages might result in the problem above, a version page completely absent from results

@ryantownsend
Copy link

@jjb correct, if you use canonicals, the specific version pages would no longer be in the Google results, if you searched for "Puma 5.0.4" you'd just see a result for the "Puma" gem page.

The question to me is: how valuable is it to land someone on that specific version page vs landing them on the gem page?

  1. There's limited information on the version page that's different to the gem page.
  2. As I mentioned, when new gem versions are introduced, it can take time for Google to discover and index them
  3. In some cases, Google may even see this as duplicate content given the same description is on every single version page
  4. In some cases, I might be given a very similar, but incorrect result, e.g. "Puma 5.0.2" instead of "5.0.4", given Google also allows for spelling mistakes and there is only one character of difference between the two.

Personally, if I were running the site, I'd just direct everyone to the overall gem page, leaving the version pages linkable for any blogs or sharing links among coworkers etc but canonicialised back to the overall gem page, that way only Google is affected.

I'd want to concentrate all the value from backlinks into that one page for better rankings, plus it would reduce the crawler overhead.

To most sites this definitely doesn't matter, but you effectively have 12,000 gems (estimated from 402 pages of 30) each with potentially hundreds of versions, so Google is unlikely to crawl and index all of those pages, at least not quickly.

This then means you can easily generate a sitemap file that you can add to Google Search Console (with all 12,000 gems, you may be worth breaking down into one sitemap per letter with a sitemap index file). This will mean new gems will be discovered and indexed more promptly, given you're not wasting the crawlers time with all the new gem versions every day.

If you decide against canonicalising to the main gem page, you might want to add JSON-LD to the version pages, this will mean Google may show a breadcrumb against the result, giving people the option to jump straight into the gem page too.

Gem Page

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [{
    "@type": "ListItem",
    "position": 1,
    "name": "Gems",
    "item": "https://rubygems.org/gems"
  }, {
    "@type": "ListItem",
    "position": 2,
    "name": "Puma",
    "item": "https://rubygems.org/gems/puma"
  }]
}
</script>

Version Page

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [{
    "@type": "ListItem",
    "position": 1,
    "name": "Gems",
    "item": "https://rubygems.org/gems"
  }, {
    "@type": "ListItem",
    "position": 2,
    "name": "Puma",
    "item": "https://rubygems.org/gems/puma"
  }, {
    "@type": "ListItem",
    "position": 3,
    "name": "Version 6.4.0",
    "item": "https://rubygems.org/gems/puma/versions/6.4.0"
  }]
}
</script>

@simi
Copy link
Member

simi commented Oct 29, 2023

@ryantownsend ℹ️ there are 192918 gems with total of 1558031 versions (data from latest dump at https://rubygems.org/pages/data)

@ryantownsend
Copy link

@simi wow, okay.

At this volume, I'd personally focus on ensuring all the gem pages are indexed.

It would be worthwhile getting Google Search Console set up if you haven't already - it'll tell you how many pages are discovered/indexed.

Given there's no sitemap file, I'd bet many gems might even be missing, let alone having all 1.5mil versions indexed.

A sitemap would definitely need breaking down into multiple files using a sitemap index file: https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps

As per Google's size limits, you need max 50MB uncompressed, 50,000 entries per file, so generating 26 alphabetical sitemaps, indexing by the first letter of the gem (36 if you include numbers) is probably going to be easiest. It'll need to be more granular if you wanted to include all versions too though.

I'd put this in a periodic job which regenerates each file at least once per day, rather than having an exceptionally slow request flow through the web server.

@jjb
Copy link
Contributor Author

jjb commented Oct 29, 2023

personally i agree with @ryantownsend's reasoning for why specific versions don't need to be in search results, but i am just one random rubyist. if someone wants to get to a specific version, it's easy to click on "show all versions" and find it.

if we think just nuking all the versions from google is acceptable, then i think no need to implement sitemap, just do it in robots like this closed PR does. ryan's proposal would be better and fancier and offer more options like priorities, but if current status quo is pretty good, no need to add more complication IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants