Char Filters

Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

The format of the char filter definition is as follows:

{
    "name": <CHAR_FILTER_NAME>,
    "options": <CHAR_FILTER_OPTIONS>
}

<CHAR_FILTER_NAME>:
<CHAR_FILTER_OPTIONS>:

The following char filters are available:

ASCII folding
HTML
Regular Expression
Unicode Normalize
Zero width non-joiner

ASCII folding

Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a.

Example:

{
    "name": "ascii_folding"
}

HTML

Replace HTML tags to whitespace( ).

Example:

{
    "name": "html"
}

Regular Expression

Replaces characters that match the regular expression with the specified characters.

Example:

{
    "name": "regex",
    "options": {
        "pattern": "foo",
        "replacement": "var"
    }
}

Unicode Normalize

Performs unicode normalization. The following parameters can be set for form.

NFD
NFC
NFKD
NFKC

Example:

{
    "name": "unicode_normalize",
    "options": {
        "form": "NFKC"
    }    
}

Zero width non-joiner

Replaces characters that zero width non-joiner(U+200C) with the whitespace ( ).

Example:

{
    "name": "zero_width_non_joiner"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char_filters.md

char_filters.md

Char Filters

ASCII folding

HTML

Regular Expression

Unicode Normalize

Zero width non-joiner

Files

char_filters.md

Latest commit

History

char_filters.md

File metadata and controls

Char Filters

ASCII folding

HTML

Regular Expression

Unicode Normalize

Zero width non-joiner