Skip to content
This repository has been archived by the owner on Feb 9, 2022. It is now read-only.

Revert "Add specific start URL" #454

Merged
merged 1 commit into from
Jun 12, 2018
Merged

Conversation

s-pace
Copy link
Contributor

@s-pace s-pace commented Jun 12, 2018

Reverts #453
@JoelMarcey @endiliey
We already avoid duplicates. The redirection is not needed thank to your sitemap (we scrap it in order to find an available link.

ref facebook/docusaurus#744

@s-pace s-pace merged commit 48645b7 into master Jun 12, 2018
@JoelMarcey
Copy link
Contributor

@s-pace Oh ok. Thanks!

I created facebook/docusaurus#765 to hopefully solve the sitemap issue that has been going on.

@endiliey
Copy link
Contributor

endiliey commented Jun 12, 2018

Thanks @s-pace
I think it has got to do with your stop_urls as well

"stop_urls": [
"/help",
"/users",
"https://docusaurus.io/docs/en/[0-9].*",
"https://docusaurus.io/docs/en/next",
"^((?!\\.html).)*$"

facebook/docusaurus#765 should help. It can avoid duplicate for other crawler as well (e.g Google)

@s-pace
Copy link
Contributor Author

s-pace commented Jun 13, 2018

Yes indeed @endiliey, the stop_urls prevent crawling other pages from other version than the latest.

EDIT: It also prevent us to crawl webpage not finishing by .html

"^((?!\\.html).)*$"

@endiliey
Copy link
Contributor

Are we good now @s-pace ?

I'm currently away but if you still need redirections on /docs/ I can try to implement it next week.

@s-pace
Copy link
Contributor Author

s-pace commented Jun 13, 2018

👋 @endiliey, Thank you for being so available.

I had to change the regex to its "opposite" since *.html$ links are still available from links within your footer .nav-footer cf 64ca146

We are fine for the main pages and if you want to only search through the english pages so far.

In order to have a proper filtering for versions and languages, we should use meta tags as described in facebook/docusaurus#744 (comment) and also a sitemap that exposes URL for every languages.

Do you think it is doable to have the meta tags and the sitemap with every links from different languages? Thus every page will have its full context embedded and will be clearly referenced, no need to only rely on URLs.

Let me know

cc @JoelMarcey

@endiliey
Copy link
Contributor

endiliey commented Jun 13, 2018

Naturally we want to be able to search for the correct languages & versions of docusaurus pages depending on current version & languages the user is on.
Example:

  1. if im at version 1.15 and language 'ko' we want to search for version 1.15 and language 'ko'.

  2. Search only chinese docs when language is set to chinese

chinese

  1. Only search for 'korean' documentation when on 'korean'

dynamic language

  1. Only search for 'france' when on 'france'
    screenshot_20180612-174715

I am okay with relying on urls because it seems to works so far for many docusaurus user (see above examples). Docusaurus is used by many websites so if we had to use meta tags then all user of docusaurus might need to change their docsearch config. I'm trying to avoid having too many changes. But if the change is really necessary then we can try to work on it.

The next thing that i want to talk about is that our sitemap actually exposes other URLs for other languages through xhtml:link rel="alternate" hreflang="XX"

Refer to
https://support.google.com/webmasters/answer/2620865?hl=en

You can use chrome developer tools because the chrome xml viewer is wrong
40932617-85eab832-6861-11e8-8a18-a205c61fe74d

What do you think @s-pace ?

@s-pace
Copy link
Contributor Author

s-pace commented Jun 13, 2018

@endiliey

The current behaviour is working on the site you have mentioned because we only scrap one version at the time. You can have a look to the reason config for example. Since start_urls are working as regexs, matching the latest version <URL>/docs/<LANG>/ will encompass all of the other ones <URL>/docs/<LANG>/<VERSION>/. For such reason having something tangible within the content of the webpage would be a steadier way to handle it than to rely on URLs. If you want to avoid having to change the configuration, you can use DocSearch meta data:

<meta name="docsearch:version" content='latest'>

Editing the config would be something most likely to happen since changing the outcome require to change the customised part.

Regarding the sitemap, handling this extra feature might be a good way indeed. we need to investigate on such extra feature for the scraper. I will keep you posted about this one.

@endiliey
Copy link
Contributor

endiliey commented Jun 13, 2018

@s-pace thanks for explaining.

Seems that the problem is mostly on site with many versions & many languages like Docusaurus itself.

What do you think @JoelMarcey ? If we are going ahead with the metadata tag I think we should agree on how the metadata tag should be formatted.

@s-pace, would something like this be sufficient ?

<meta name="docsearch:version" content='next'>
<meta name="docsearch:language" content='en'>

Another example

<meta name="docsearch:version" content='1.2.0'>
<meta name="docsearch:language" content='zh-CN'>

@JoelMarcey
Copy link
Contributor

Given the subset of Docusaurus sites that use languages or versioning, and even less that use both, I am ok adding the meta tags to our Head.js (we would have to implement the logic there to get the languages and versions, but shouldn't be too bad I don't think).

This wouldn't be a breaking change, but we could announce, when this is implemented, that to get full fidelity results in their search, that they should update their docsearch config appropriately.

@s-pace
Copy link
Contributor Author

s-pace commented Jun 18, 2018

Definitely a 💯 Feel free to point me out where you need help

@endiliey
Copy link
Contributor

@s-pace

As discussed, it would be good to add docsearch metadata

We can implement it in Docusaurus.
meta

The metadata is formatted like this

<meta name="docsearch:version" content='1.2.1'>
<meta name="docsearch:language" content='en'>

I have few questions.

What will be the next step ? (Do we have to change the config ?)
Will we be able to search for the correct languages & versions of docusaurus pages depending on current version & languages the user is on ? Example: if im at version 1.15 and language 'ko' we want to search for version 1.15 and language 'ko'.

@s-pace
Copy link
Contributor Author

s-pace commented Jun 18, 2018

You will have to update the search UI in order to restrain the scope of the search to the right faceFilters.
I will update the config even if it shouldn't be that hard since we automatically scrap these meta.

@s-pace s-pace deleted the revert-453-specific-page-start-url branch September 28, 2018 11:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants