Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.
The format of the char filter definition is as follows:
{
"name": <CHAR_FILTER_NAME>,
"options": <CHAR_FILTER_OPTIONS>
}
<CHAR_FILTER_NAME>
:<CHAR_FILTER_OPTIONS>
:
The following char filters are available:
- ASCII folding
- HTML
- Regular Expression
- Unicode Normalize
- Zero width non-joiner
Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à
to a
.
Example:
{
"name": "ascii_folding"
}
Replace HTML tags to whitespace(
).
Example:
{
"name": "html"
}
Replaces characters that match the regular expression with the specified characters.
Example:
{
"name": "regex",
"options": {
"pattern": "foo",
"replacement": "var"
}
}
Performs unicode normalization. The following parameters can be set for form
.
NFD
NFC
NFKD
NFKC
Example:
{
"name": "unicode_normalize",
"options": {
"form": "NFKC"
}
}
Replaces characters that zero width non-joiner(U+200C
) with the whitespace (
).
Example:
{
"name": "zero_width_non_joiner"
}