Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to strip html tags before generating searchIndex.json? #12

Open
abhijeetvramgir opened this issue Sep 21, 2016 · 1 comment
Open

Comments

@abhijeetvramgir
Copy link

This is my lunr snippet from the build file:

.use(lunr({
        preprocess: function(content) {
        // Replace all occurrences of __title__ with the current file's title metadata.
        return content.replace(/__title__/g, this.title);
        }
 }))

How do I strip HTML tags ??

@janthonyeconomist
Copy link

I'm doing this for: a) strip HTML b) transliteration and c) strip punctuation:

preprocess: function(content) {
          const tr = (str) => {
            const map = {"а":"a" /* truncated for diff */ };
            let new_str = "", char, substitute, n = str.length;
            for(let i = 0; i < n; i++) {
                char = str[i]; substitute = map[char]; new_str += substitute ? substitute : char;
            }
            return new_str;
          };
          return tr(
            content.replace(/<[^>]+>/g, ' ') // Strip HTML
          ) // Transliterate foreign characters
            .replace(/[^\w]/g, ' ') // Strip Punctuation
          ;
        }

That seems to remove the HTML and punctuation from the contents; however, I think some punctuation is still getting through to the index in other fields. Is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants