Token filters accept a stream of tokens from a tokenizer and can modify tokens (e.g., lowercasing), delete tokens (e.g., remove stopwords).
The format of the char filter definition is as follows:
{
"name": <TOKEN_FILTER_NAME>,
"options": <TOKEN_FILTER_OPTIONS>
}
<TOKEN_FILTER_NAME>
:<TOKEN_FILTER_OPTIONS>
:
The following char filters are available:
- Apostrophe
- Camel Case
- Dictionary Compound
- Edge Ngram
- Elision
- Keyword Marker
- Length
- Lower Case
- Ngram
- Porter Stemmer
- Reverse
- Shingle
- Stop Tokens
- Truncate
- Unicode Normalize
- Unique Term
Remove strings even after the apostrophe of the token.
Example:
{
"name": "apostrophe"
}
Split the CamelCase token further. For example, GoLang
will be split into Go
and Lang
.
Example:
{
"name": "camel_case"
}
The token is further divided based on the dictionary. In the following example, softball
will be split into two tokens, soft
and ball
.
Example:
{
"name": "dictionary_compound",
"options": {
"words": [
"soft",
"softest",
"ball"
],
"min_word_size": 5,
"min_sub_word_size": 2,
"max_sub_word_size": 15,
"only_longest_match": false
}
}
Generates edge n-gram tokens of sizes within the given range.
Example:
{
"name": "edge_ngram",
"options": {
"back": false,
"min_length": 1,
"max_length": 2
}
}
Output tokens without the prefix specified by articles
. In the example below, the token ar'word
will be output as a token word
.
Example:
{
"name": "elision",
"options": {
"articles": [
"ar"
]
}
}
Set the KeyWord
member variable to true
for tokens that match the string specified by the keywords
option. You can mark special tokens.
Example:
{
"name": "keyword_marker",
"options": {
"keywords": [
"walk",
"park"
]
}
}
Removes tokens shorter or longer than specified character lengths.
Example:
{
"name": "length",
"options": {
"min_length": 3,
"max_length": 4
}
}
Changes token text to lowercase.
Example:
{
"name": "lower_case"
}
Forms n-grams of specified lengths from a token.
Example:
{
"name": "ngram",
"options": {
"min_length": 1,
"max_length": 2
}
}
Provides algorithmic stemming, based on the Porter stemming algorithm.
Example:
{
"name": "porter_stemmer"
}
Reverses each token in a stream.
Example:
{
"name": "reverse"
}
Add shingles, or word n-grams, to a token stream by concatenating adjacent tokens.
Example:
{
"name": "shingle",
"options":{
"min_length": 2,
"max_length": 2,
"output_original": true,
"token_separator": " ",
"fill": "_"
}
}
Removes stop words from a token stream.
Example:
{
"name": "stop_tokens",
"options":{
"stop_tokens": [
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"but",
"by",
"for",
"if",
"in",
"into",
"is",
"it",
"no",
"not",
"of",
"on",
"or",
"such",
"that",
"the",
"their",
"then",
"there",
"these",
"they",
"this",
"to",
"was",
"will",
"with"
]
}
}
Truncates tokens that exceed a specified character limit.
Example:
{
"name": "stop_tokens",
"options":{
"length": 5
}
}
Performs unicode normalization. The following parameters can be set for form
.
NFD
NFC
NFKD
NFKC
Example:
{
"name": "unicode_normalize",
"options": {
"form": "NFKC"
}
}
Removes duplicate tokens from a stream.
Example:
{
"name": "unique_term"
}