Skip to content

Latest commit

 

History

History
243 lines (205 loc) · 8.13 KB

20_Using_stopwords.asciidoc

File metadata and controls

243 lines (205 loc) · 8.13 KB

Using Stopwords

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see Using the stop Token Filter). However, some out-of-the-box analyzers come with the stop filter pre-integrated:

Language analyzers

Each language analyzer defaults to using the appropriate stopwords list for that language. For instance, the english analyzer uses the english stopwords list.

standard analyzer

Defaults to the empty stopwords list: none, essentially disabling stopwords.

pattern analyzer

Defaults to none, like the standard analyzer.

Stopwords and the Standard Analyzer

To use custom stopwords in conjunction with the standard analyzer, all we need to do is to create a configured version of the analyzer and pass in the list of stopwords that we require:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": { (1)
          "type": "standard", (2)
          "stopwords": [ "and", "the" ] (3)
        }
      }
    }
  }
}
  1. This is a custom analyzer called my_analyzer.

  2. This analyzer is the standard analyzer with some custom configuration.

  3. The stopwords to filter out are and and the.

Tip
This same technique can be used to configure custom stopword lists for any of the language analyzers.

Maintaining Positions

The output from the analyze API is quite interesting:

GET /my_index/_analyze?analyzer=my_analyzer
The quick and the dead
{
   "tokens": [
      {
         "token":        "quick",
         "start_offset": 4,
         "end_offset":   9,
         "type":         "<ALPHANUM>",
         "position":     2 (1)
      },
      {
         "token":        "dead",
         "start_offset": 18,
         "end_offset":   22,
         "type":         "<ALPHANUM>",
         "position":     5 (1)
      }
   ]
}
  1. Note the position of each token.

The stopwords have been filtered out, as expected, but the interesting part is that the position of the two remaining terms is unchanged: quick is the second word in the original sentence, and dead is the fifth. This is important for phrase queries—​if the positions of each term had been adjusted, a phrase query for quick dead would have matched the preceding example incorrectly.

Specifying Stopwords

Stopwords can be passed inline, as we did in the previous example, by specifying an array:

"stopwords": [ "and", "the" ]

The default stopword list for a particular language can be specified using the lang notation:

"stopwords": "_english_"
Tip
The predefined language-specific stopword lists available in Elasticsearch can be found in the stop token filter documentation.

Stopwords can be disabled by specifying the special list: none. For instance, to use the english analyzer without stopwords, you can do the following:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":      "english", (1)
          "stopwords": "_none_" (2)
        }
      }
    }
  }
}
  1. The my_english analyzer is based on the english analyzer.

  2. But stopwords are disabled.

Finally, stopwords can also be listed in a file with one word per line. The file must be present on all nodes in the cluster, and the path can be specified with the stopwords_path parameter:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type":           "english",
          "stopwords_path": "stopwords/english.txt" (1)
        }
      }
    }
  }
}
  1. The path to the stopwords file, relative to the Elasticsearch config directory

Using the stop Token Filter

The stop token filter can be combined with a tokenizer and other token filters when you need to create a custom analyzer. For instance, let’s say that we wanted to create a Spanish analyzer with the following:

  • A custom stopwords list

  • The light_spanish stemmer

  • The asciifolding filter to remove diacritics

We could set that up as follows:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":        "stop",
          "stopwords": [ "si", "esta", "el", "la" ]  (1)
        },
        "light_spanish": { (2)
          "type":     "stemmer",
          "language": "light_spanish"
        }
      },
      "analyzer": {
        "my_spanish": {
          "tokenizer": "spanish",
          "filter": [ (3)
            "lowercase",
            "asciifolding",
            "spanish_stop",
            "light_spanish"
          ]
        }
      }
    }
  }
}
  1. The stop token filter takes the same stopwords and stopwords_path parameters as the standard analyzer.

  2. See [algorithmic-stemmers].

  3. The order of token filters is important, as explained next.

We have placed the spanish_stop filter after the asciifolding filter. This means that esta, ésta, and está will first have their diacritics removed to become just esta, which will then be removed as a stopword. If, instead, we wanted to remove esta and ésta, but not está, we would have to put the spanish_stop filter before the asciifolding filter, and specify both words in the stopwords list.

Updating Stopwords

A few techniques can be used to update the list of stopwords used by an analyzer. Analyzers are instantiated at index creation time, when a node is restarted, or when a closed index is reopened.

If you specify stopwords inline with the stopwords parameter, your only option is to close the index and update the analyzer configuration with the update index settings API, then reopen the index.

Updating stopwords is easier if you specify them in a file with the stopwords_path parameter. You can just update the file (on every node in the cluster) and then force the analyzers to be re-created by either of these actions:

  • Closing and reopening the index (see open/close index), or

  • Restarting each node in the cluster, one by one

Of course, updating the stopwords list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data. See [reindex].