Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generator for a search index #1853

Closed
wants to merge 1 commit into from
Closed

Add generator for a search index #1853

wants to merge 1 commit into from

Conversation

digitalcraftsman
Copy link
Member

The generator creates an index of all content
files and it's metadata.

See #1635 #144

@digitalcraftsman
Copy link
Member Author

The generator is working so far but the implemenation isn't finished yet (see TODO).

Is there anything ground-breaking to complain about the generator?

QUESTIONS:

  • Currently, I've only an option turn the generator on/off. Does it make sense to stop the regeneration in watch mode (hugo server) and only trigger it in production (hugo)?
  • Should I add an option for a custom path and filename for the index.? At the moment the path is hardcoded
  • Should more data be included in the index?
  • How would you set up a test function?

WISHLIST:

  • partial rebuilds of the cached index (not the json file). This would reduce overhead

/cc @rdwatters @bep @spf13

@rdwatters
Copy link
Contributor

@digitalcraftsman This is pretty fantastic. Thanks for working on what I see as a really powerful feature (and one I have kind of been nagging about).

Question for you: rather than just the ability to create an index that can be built/not built with a flag, how difficult it would it be to just extend Hugo's abilities to write to any .json file using the same templating logic? Would a feature like this (Jekyll has the ability to write JSON files, which comes in pretty handy) slow down builds to the point of not being worth it?

Do you think writing to any-file.json rather than site-index.json would provide the most flexibility (ie, w/r/t using ajax, etc), or is the primary objective to allow for client-side search a la something like Tipue or lunr.js?

Also, sorry for the delayed response to your questions (in the order you presented them above):

  1. I think this depends on whether the goal is for search or for the ability to write to json in general. If the latter, seems to make more sense to write to the json file in both environments.
  2. Again, depending on the search-v-json-in-general idea, I think a convention of have a siteIndex.json (or whatever) in static is easy enough.
  3. This is a tough one. I'll have to defer to @bep and @spf13, but this might depend on thoughts for the overall data model for V1. For example, there's talk on Discuss about preventing Hugo from building empty files in content directories that act more like a data directory (similar to a "collection" in Jekyll). If this is the case, I'm not sure how convenient excludefromindex will be on a per-md/yml level, or if it makes sense to exclude whole directories/content types, etc.
  4. These questions are getting progressively more out of my wheelhouse, haha:smiley: I'm pretty handy with JavaScript, but my business card says "Digital Content Manager" and not "developer." Totally deferring to you guys on this one.

Again, thanks again, brother. I think HUGO is easily the best SSG around. Cheers!

@digitalcraftsman
Copy link
Member Author

rather than just the ability to create an index that can be built/not built with a flag, how difficult it would it be to just extend Hugo's abilities to write to any .json file using the same templating logic? Would a feature like this (Jekyll has the ability to write JSON files, which comes in pretty handy) slow down builds to the point of not being worth it?

I think that would be a powerful addition to the current set of template functions. I'm not sure how much it would slow down the generation of pages. A possible implementation could be the use of a global object shared object, similar to Scratch, but with a few more parameters (filename, destination). Inside a template a coder could specify, if necessary with if-else statements, what should be included in which JSON-file.

Do you think writing to any-file.json rather than site-index.json would provide the most flexibility (ie, w/r/t using ajax, etc), or is the primary objective to allow for client-side search a la something like Tipue or lunr.js?

This is a good question. Why don't take the best of both worlds. Just setting a config variable to true is the most user-friendly way, in my opinion. Your approach would allow much more flexibility, but the user/theme creator maybe needs to include logic in many different places. Imagine a user has different layouts for different content types. He would need to add the logic in each template file of a content type. Shortcodes would be a handy way to avoid redundant code.

But let's wait what the others think about this.
Thank you @rdwatters for sharing your thoughts and ideas.

I think HUGO is easily the best SSG around. Cheers!

This project has grown a lot in it's rather short lifetime 😄. I'm curious what we will see in the v1.0 release.

@rdwatters
Copy link
Contributor

@digitalcraftsman Good points all round. I guess that I am ultimately bringing up two separate feature requests, and you are absolutely right that it could be the best of both worlds.

As far as the ability to write to json files in general, you're right that this would have to be a separate process in that forcing devs to write templating to account for all content/section areas in a single site-index.json would be more than a little tedious.

I like where you are going with the todo for excludefromindex at both the page and section/content level.

Oh, and thanks again:smiley:

Oh, and @bep I just drastically edited this comment after you already replied to it. Sorry about that.

@bep
Copy link
Member

bep commented Feb 16, 2016

Ability to write to json files in general.

There is an open issue somewhere about rendering custom content-types, like JSON, ical ... whatever. We should do that.

@digitalcraftsman
Copy link
Member Author

There is an open issue somewhere about rendering custom content-types, like JSON, ical ... whatever. We should do that.

If you, @bep, agree with @rdwatters and me we should consider this as two different issues. See #1128 (for ical, xcal). But there's no issue about writing content to a JSON file. Should I create a new issue?

@spf13
Copy link
Contributor

spf13 commented Feb 16, 2016

These are two different issues.

The PR is effectively a sitemap in JSON which will enable lots of nice integrations.

A second issue is for Hugo to support rendering into variable and multiple multiple different formats.
The second issue is a pretty considerable one which would require quite a bit of restructure. Including the integration of the text.Template library along side of the html.Template one.

Title string `json:"title"`
Content string `json:"content"`
Permalink string `json:"permalink"`
Tags interface{} `json:"tags"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't assume these taxonomies are being used. I think this is a very limiting approach.

@rdwatters
Copy link
Contributor

@digitalcraftsman Spitballing on this, but is there utility in implementing a stopwords list when creating the index? Here's a decent default list.

http://www.ranks.nl/stopwords

If the intention is client-side search, it looks like it's the same stopwords used by Tipue and similar to the stopword filter for lunr.js. That said, if search results were designed to surface, say, the "description" key in front matter, SERPs would look weird if every definite and indefinite article were omitted from the page. Then again, maybe eliminating stopwords from the index before it's sent could make filesize smaller and potentially reduce demand on the client.

It goes without saying that internationalization efforts being worked on outside this thread would have a different list.

@moorereason
Copy link
Contributor

If you want to remove stopwords in Go, check out https://github.com/bbalet/stopwords. It's multi-lingual and has already been discussed on the forums for adding a related posts feature (https://github.com/bbalet/gorelated).

@bep
Copy link
Member

bep commented Feb 22, 2016

The use of stopwords is within the role of the tokeniser/indekser. This PR is badly named, as it doesn't create a search index, it exports the content in a format suitable for indexing.

@digitalcraftsman
Copy link
Member Author

This PR is badly named, as it doesn't create a search index, it exports the content in a format suitable for indexing.

After revisiting some disucssion here and in the forum I agree. As this PR is currently a WIP, it should just output a json file that is intended for searching the content (with lunr.js or similar tools). Using a stopword filter would consequently be the next step for optimizations.

As I discussed with @rdwatters before, we should create a seperate jsonify template function that can query content like the user wants it.

@digitalcraftsman digitalcraftsman changed the title Add generator for JSON-based content index Add generator for a search index Feb 24, 2016
@digitalcraftsman
Copy link
Member Author

Since we have a jsonify template func and and a good proof-of-concept for a content index (thanks @bep) it would be perfect to do it this way in this PR. This would satisfy @spf13 wish for more flexibility.

However, I saw that @bep needed to create a new content file just set the url properly. Wouldn't it be better to add a saveas template func that saves a string, JSON object etc. as file under a given path:

{{ $contentList | jsonify | saveas "/index.json" }}

The path would be relative to static/.


While keeping an eye on the localization support it would be very easy to create a content index for just a single language. Depending on the current locale scripts like lunr.js could fetch the content index for the current locale and is it as a index.

It doesn't make sense include spanish content in the results for a chinese user. But the setup is completely flexibly due to the filter options.

/cc @bep @moorereason

@bep
Copy link
Member

bep commented Mar 12, 2016

Yes, the extra content file is not good, we need better support for custom file types (json, ical etc.), but the answer isn't saveas(where would you call that from?)

@digitalcraftsman
Copy link
Member Author

but the answer isn't saveas (where would you call that from?)

I would call it from inside a template, like in the example above:

{{ $contentList | jsonify | saveas "/index.json" }}

The function itself would have a signature like

func saveas(path string, data interface{}) error {}

@digitalcraftsman
Copy link
Member Author

I revisited this issue and implemented the feature with a template as @spf13 suggested. That gives users the miximal flexibility. Kudos to @bep for implementing the jsonify template func and for providing a good starting point for the internal template.

I would appreciate a review. According to the contribution guidelines the commit message should mentioned the modified package as prefix. Since I modified multiple packages which should I use?

@rdwatters you asked for an option to exclude certain pages. Have a look at the docs 😉

Furthermore, @rdwatters and @moorereason suggested the usage of a stop word filter? Should this be realized with a template function (in a seperate pull request)?

Last but not least I would like to keep an eye on the localization support (#1744). Having search results in multiple languages doesn't make sense in my opinion. Should we offer an option to generate a content index per locale?

@moorereason
Copy link
Contributor

First, my handle is moorereason. No need to spam whoever moore is.

Second, for the commit message prefixes, you want to use the primary affected package (I use that phrase in my updated but yet-to-be-merged contributing guide). In this case, commit bb688f7 would use hugolib, in my view, since that's where the most important change is made. Choose which package you feel is most relevant to call out in the commit message.

In your subsequent commits, I'd use commands, hugolib, and docs, respectively. The idea is to give someone looking over the git logs a quick identifier of where the changes are occurring without them having to read the full commit message or look at the diffs.

I get the feeling I'm going to need update the contributing guide to give a fuller explanation and rationale for the subject prefix.

@digitalcraftsman
Copy link
Member Author

I'm sorry for misspelling your handle.

The commit messages have been updates with their corresponding package as prefix. However, at first I just wasn't sure if the commits should be squashed or not.

@digitalcraftsman
Copy link
Member Author

I implemented the search feature in the material-docs theme and it works like a charme. But the usability of the default template could be improved.

Currently, we are only linking the pages who match the search query. It would be much better if we also could link the headers of section that contains (parts) of the query. MkDocs uses the headers as dividers for the content and adds each of them as new search result.

@@ -784,6 +791,8 @@ func (s *Site) initializeSiteInfo() {
GoogleAnalytics: viper.GetString("GoogleAnalytics"),
RSSLink: s.permalinkStr(viper.GetString("RSSUri")),
BuildDrafts: viper.GetBool("BuildDrafts"),
DisableSearchJSON: viper.GetBool("DisableSearchJSON"),
SearchIndexLink: viper.GetString("baseURL") + viper.GetString("searchuri"),
Copy link
Member Author

@digitalcraftsman digitalcraftsman Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bep Is there any helper functions that can prepend the baseurl for the SearchIndexLink?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like s.permalinkStr() is used above for the RSSLink.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I've done before. I printed the URL in a template and got http://localhost:1313/search/index.json/ instead of http://localhost:1313/search.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bep do you know if this behavior is intended or how it can be avoided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to helper, see what is used by absURL template func.

@bep
Copy link
Member

bep commented Apr 29, 2016

As to the discussion of stop-words:

  1. I think it is in this case the responsibility of the search lib.
  2. Hugo could have used such a feature (but it is hard: I have seen some of the stop-words lists for Norwegian, and they are crappy), but then as a cross-cutting concern that could be used by others, in this case as a filter.

This this PR should be about geting the data in a parseable format, aka JSON.

@derekperkins
Copy link
Contributor

derekperkins commented Sep 16, 2016

Is this going to make it into 0.17?

@digitalcraftsman
Copy link
Member Author

digitalcraftsman commented Dec 26, 2016

I'm closing this pull request in favor of #2828. Custom output types would be much more flexible. Users could create content in a format they want by using templates and by specifying the output type (e.g. JSON).

My approach would be to specific and de facto deprecated once you can achieve the same with custom output types.

Nonetheless, the long discussion about this topic highlighted some points that should be considered in the future when someone creates a search template

@digitalcraftsman digitalcraftsman deleted the feature/content-index branch December 26, 2016 17:01
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants