-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lastmod to sitemap #2604
Comments
I think this would be a good addition, but I do know web crawlers that use the priority field. |
@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years. |
Fair enough. |
Great idea! Thanks for the suggestion! |
Hello! I want to help solve this issue.
|
Most likely the last build time since even just tiny changes end up changing the chunk hashes, so its constantly being modified. |
@RDIL FYI Webpack 5 might help to make the js chunks more "stable" (see my recent comment in #3383), we may try to migrate after i18n is ready. Not sure what we should do for this date. Also not sure how the sitemaps plugin could access the "last modification date" of the page, as this plugin is decoupled from the others. Is it mandatory to add it to the sitemaps? It could likely be easier to handle this by adding a meta directly on the page, otherwise, we'd have to find a way to provide such metadata per path to the sitemap plugin. Asking this, because for my work on i18n I'll also have to think about how to set up useful headers for localization (hreflang), and thought about adding them to the page directly instead of the sitemaps. @jdevalk as it seems you know more about SEO than the rest of us, can you give us some insights? |
Last modified is somewhat of a must for XML sitemaps indeed. I think for hreflang I'd go for adding it to the page instead of the XML sitemaps as that makes debugging a lot easier and maybe even makes it accessible to other features within docusaurus, like a language switcher. |
Thanks, will do that. About lastModified, some plugins already read git history to get the last modified date. We can enable also to hardcode it through frontmatter. I think we should:
If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often). We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change? |
I would not add it then. Having it change all the time when it's actually not changing is also not beneficial.
Agreed. |
Hi! |
This would be super useful as we are busy automating spell checking and grammar using AI. I was hoping to use the lastmod to understand when a page has changed to do a spell check and grammar check before deploying to live. I wouldn't want to do this for the entire website. I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date. Maybe it can be an input in Layout tag: <Layout title="Dataplane Data & Automation Platform | Open Source" lastmod="2020-04-14T11:22:05+00:00"> |
While I understand @saul-data has different needs, for SEO / crawl efficiency reasons I’d only change the lastmod when the content changes. I’d say basing it on the lastmod date of the underlying source document is probably easiest. Note that search engines are putting more emphasis on adding lastmod as of recently, so I’d prioritize this issue a bit higher. |
Would this be linked to https://docusaurus.io/docs/blog#blog-post-date ? I couldn't see a date reference for pages and docs (only versions). I feel this should be an input by the user when the content or page has changed. |
Note: there's a related issue to add an explicit last update date for blog posts, that could be used as the sitemap lastmod |
I have a prototype for adding @slorber Is this how you envisioned the feature in #2604 (comment)? |
I solved this problem for my own site with a post build script; I blogged about it here: https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date |
|
Yeah I’m sorry it’s basically a requirement now. |
Hey We have merged support for git/front matter last update metadata for blog posts (#8657) which now means both blog and docs have unified support for this feature. (note that the pages plugin doesn't have support, although we could also add it there) Now is a good time to add "lastmod" to the sitemap as well. I'll review your PR soon @pmarschik, sorry for the delay. In the meantime let's decide what should be implemented exactly here, using the Google sitemap doc as a ref:
@saul-data this is not what we will implement because it's not what Google recommends:
@jdevalk I'd rather keep them for now, and maybe we'll remove those later. I guess we can consider the removal as a breaking change? 🤷♂️
@johnnyreilly note that your solution filters pages from the sitemap such as the tags and paginated lists pages, since they do not match your regexp pattern. To implement this feature properly, we should also consider that there isn't always a Markdown document per sitemap URL, and some pages are also displaying multiple documents at once. It's more difficult to define a "lastmod" date for those URLs for example:
My suggestion is to initially keep things simple, and only add a "lastmod" date when the page is backed by a Markdown document. The Google doc says:
Do we agree on this plan? |
Something important to also consider: reading the file history from We only read from git when the Is it a problem? Are some of you looking to have I'd like to refactor the APIs and do breaking changes to make things less confusing, but I wonder if having the behavior above (a bit awkward) can be a problem to some of you? |
If you are using either the sitemap OR showLastUpdateTime then it should work, it doesn't make sense to require |
Decent plan - happy with it. Do the breaking changes - good default |
Thanks for your feedback Agree @wparad, will try to find a solution so that the sitemap lastmod can be used independently from the docs/blog plugin options, and yet we need to avoid reading twice the lastmod date from Git for performance reasons (this can be expensive for thousands of files) |
New sitemap options are implemented in PR, ready to review: #9954 {
lastmod: null | 'date' | 'datetime'
priority: null,
changefreq: null,
} Example with our Docusaurus website sitemap: <urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://docusaurus.io/blog/</loc>
</url>
<url>
<loc>https://docusaurus.io/blog/2017/12/14/introducing-docusaurus/</loc>
<lastmod>2023-01-05</lastmod>
</url>
<url>
<loc>https://docusaurus.io/blog/2018/04/30/How-I-Converted-Profilo-To-Docusaurus/</loc>
<lastmod>2023-01-04</lastmod>
</url>
<url>
<loc>https://docusaurus.io/blog/2018/09/11/Towards-Docusaurus-2/</loc>
<lastmod>2023-04-21</lastmod>
</url>
<url>
<loc>https://docusaurus.io/docs/versioning/</loc>
<lastmod>2024-01-04</lastmod>
</url>
<url>
<loc>https://docusaurus.io/</loc>
<lastmod>2023-10-31</lastmod>
</url>
!-- ... Other URLs, this is just a sample -->
</urlset> You will notice that not all the URLs have a lastmod attribute (ex For now, I'm not changing defaults in Docusauurs v3, and the base sitemap for existing sites will stay the same as before. However, these options should help you remove The sitemap plugin will use in priority the route metadata But the sitemap plugin can also work in isolation, and will also call git history in case Does it look good to you, or do you see any issues with the implementation above? |
This seems pretty good. I note that <url>
<loc>https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date</loc>
<lastmod>2023-11-12T08:33:51+00:00</lastmod>
</url> I suspect the time portion isn't that important. Most blogs won't be meaningfully updated more than once a day and crawlers may run less frequently than that. Looks good! |
Thanks for the review You can choose either const LastmodFormatters: Record<LastModOption, LastModFormatter> = {
date: (timestamp) => new Date(timestamp).toISOString().split('T')[0]!,
datetime: (timestamp) => new Date(timestamp).toISOString(),
}; That date is "relative" and only help Google prioritize page crawls within your own site, so I will probably use "date" as a default in v4. datetime takes more space, and I doubt the default Docusaurus sites are updated enough for time to be useful. So if you want datetime, it will remain opt-in. |
I think I'll stick with the default of |
Hey, not related to lastmod, but should Docusaurus supports sitemap images? Apparently, this is a thing: |
Oh wow! Never heard of this. Despite all the links, I can't work out if there's a compelling reason to have them. Hmmmmm |
Yes 😄 TIL there are also video and news sitemap in @stefanjudis article: I'm not sure it's worth supporting officially or by default, but we could do like the blog plugin and let users provide a |
I think the hook is a good idea - I already manually amend my sitemap to exclude tags and pagination pages. Having a hook in the box would support that use case as well as this. |
This made me laugh BTW: 🤣
https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/ |
🐛 Bug Report
The XML sitemaps currently output
loc
,changefreq
andpriority
for everyurl
set. I would propose dropping thechangefreq
andpriority
fields, as none of the search engines use these, and instead adding thelastmod
field, with the last modification date of the file.Have you read the Contributing Guidelines on issues?
Yes.
To Reproduce
(Write your steps here:)
Expected behavior
The current output would be:
(Write what you thought would happen.)
Actual Behavior
I propose changing it to:
Your Environment
The text was updated successfully, but these errors were encountered: