-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dataverse pages more discoverable by search engines #5605
Comments
This problem (empty google search record) appears to be more common for the dataverse urls of the "/dataverse/NAME" format, than for the "/dataverse.xhtml?alias=NAME" one: This may, or may not be related to #3130 (trailing slash in the dataverse URL resulting in a 404). But even if this were the case, adding structured metadata to the dataverse page would still be very useful, and would make the process of getting indexed in the search engines more efficient. |
Huh. I'm surprised that the description of the dataverse isn't indexed. Ever since pull request #4879 was merged (a fix for #4468 which has a screenshot from Google search results), dataverses look much better when you link to them in Slack. @landreev any thoughts on file landing pages? Would helping Google index them better be worth investigating, perhaps in a separate issue? |
As I said, this may simply be the result of that trailing slash issue. |
Hmm, the "worldfish" dataverse has an empty record with the "alias=" URL: Then of course this may be a search result cached from before #4879 was merged. (I'm seeing that the bot has finally re-crawled this dataverse in the last few hours; so hopefully the updated entry will start appearing in searches shortly) |
It might be worth having only one url to resolve to a dataverse page. If I'm remembering correctly, at least some search engines recommend that if there are multiple URLs with the same content, then one should have the rel=canonical meta tag. Having |
It may be a good idea to exclude the "dataverse.xhtml?alias=..." format from crawling, via robots. And to completely exclude ALL the forms of the dataverse page urls except for the canonical "/dataverse/name", without any extra (search) arguments. As of now, we allow/encourage the bot to crawl through all the facets, and through the paginated search results. This is ineffecient, and does not result in anything useful being indexed. |
I would be remiss in my obligations as issue author if I didn't point out Dataset - PrettyFaces URL Format #2486 fitting in both the "dataverse URL forward slash forwarding" and the "dataverse content indexing" story. Especially if we are making changes to block It would make more sense to me, and maybe even to a search engine robot, if we had a URL formatting structure that matched the dataverse > dataset > file hierarchy of our app. Something like:
Maybe this is a bigger ask than I realize, but there is value to improving what we have now. I would very much like to improve the navigation experience of our app. The format of our URL's is a big part of this. Another part, which might be a conversation for another day, is the use of |
@mheppler I think improving the format of the URLs would be a great idea; |
@landreev is going to update this issue and discuss with @jggautier and @mheppler. |
@jggautier @mheppler
So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset. This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.? So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?) Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable. |
Also, |
2024/09/30: @landreev do you know if this issue should be closed because it appears that all (most?) issues may have been addressed? |
Changed the title of the issue, to indicate that the goal is to make individual dataverses more discoverable in google and other search engines.
Embedding structured metadata into the dataverse page is not necessarily the best way to achieve that.
It appears to be more practical to focus on improving the crawl rules (specifically, discouraging the bots from crawling the facets and paginated search results on dataverse pages); in combination with using a sitemap, to point the bots to all the datasets and dataverses directly.
(end update)
We already go to admirable lengths embedding some structured metadata (DC, schema.org) into our dataset pages, making individual datasets more discoverable.
It would benefit our dataverse pages to have similarly easily indexable metadata as well.
The text was updated successfully, but these errors were encountered: