Skip to content

Latest commit

 

History

History
90 lines (68 loc) · 1.49 KB

char_filters.md

File metadata and controls

90 lines (68 loc) · 1.49 KB

Char Filters

Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

The format of the char filter definition is as follows:

{
    "name": <CHAR_FILTER_NAME>,
    "options": <CHAR_FILTER_OPTIONS>
}
  • <CHAR_FILTER_NAME>:
  • <CHAR_FILTER_OPTIONS>:

The following char filters are available:

  • ASCII folding
  • HTML
  • Regular Expression
  • Unicode Normalize
  • Zero width non-joiner

ASCII folding

Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a.

Example:

{
    "name": "ascii_folding"
}

HTML

Replace HTML tags to whitespace( ).

Example:

{
    "name": "html"
}

Regular Expression

Replaces characters that match the regular expression with the specified characters.

Example:

{
    "name": "regex",
    "options": {
        "pattern": "foo",
        "replacement": "var"
    }
}

Unicode Normalize

Performs unicode normalization. The following parameters can be set for form.

  • NFD
  • NFC
  • NFKD
  • NFKC

Example:

{
    "name": "unicode_normalize",
    "options": {
        "form": "NFKC"
    }    
}

Zero width non-joiner

Replaces characters that zero width non-joiner(U+200C) with the whitespace ( ).

Example:

{
    "name": "zero_width_non_joiner"
}