Skip to content

Configuration

Vincent Foulon edited this page Jan 16, 2020 · 13 revisions

Basic Configuration

You can configure the search engine by giving an array as the first parameter of the constructor:

$engine = new Engine($myConfiguration);

Here's the default configuration array:

$default = [
    "config" => [
        "var_dir" => $_SERVER['DOCUMENT_ROOT'].DIRECTORY_SEPARATOR."var",
        "index_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."index",
        "documents_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."documents",
        "cache_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."cache",
        "fuzzy_cost" => 1,
        "connex" => [
            'threshold' => 0.9,
            'min' => 3,
            'max' => 10,
            'limitToken' => 20,
            'limitDocs' => 10
        ],
        "serializableObjects" => [
            DateTime::class => function($datetime) { /** @var DateTime $datetime */ return $datetime->getTimestamp(); }
        ]
    ],
    "schemas" => [
        "example-post" => [
            "title" => [
                "_type" => "string",
                "_indexed" => true,
                "_boost" => 10
            ],
            "content" => [
                "_type" => "text",
                "_indexed" => true,
                "_boost" => 0.5
            ],
            "date" => [
                "_type" => "datetime",
                "_indexed" => true,
                "_boost" => 2
            ],
            "categories" => [
                "_type" => "list",
                "_type." => "string",
                "_indexed" => true,
                "_filterable" => true,
                "_boost" => 6
            ],
            "comments" => [
                "_type" => "list",
                "_type." => "array",
                "_array" => [
                    "author" => [
                        '_type' => "string",
                        "_indexed" => true,
                        "_filterable" => true,
                        "_boost" => 1
                    ],
                    "date" => [
                        "_type" => "datetime",
                        "_indexed" => true,
                        "_boost" => 0
                    ],
                    "message" => [
                        "_type" => "text",
                        "_indexed" => true,
                        "_boost" => 0.1
                    ]
                ]
            ]
        ]
    ],
    "types" => [
        "datetime" => [
            DateFormatTokenizer::class,
            DateSplitTokenizer::class
        ],
        "_default" => [
            LowerCaseTokenizer::class,
            WhiteSpaceTokenizer::class,
            TrimPunctuationTokenizer::class
        ]
    ]
];

Configuring the search engine

The "config" section

This section defines the engine's parameters, such as working directories.

  • var_dir: The root directory where the engine's files will be created.
  • index_dir: The index subdirectory name where the index will be built
  • documents_dir: the documents subdirectory where every documents will be stored
  • cache_dir: the cache subdirectory. be sure that there'll be nothing else than the engine's cache file in this subdirectory.
  • fuzzy_cost: The cost of the fuzzy searching's approximate function. The number represents how many characters the user can misstype, see examples on release note 0.5
    note: greater value is more CPU-intensive and too much won't help find accurately
  • connex: (since 1.0) Connex Search configuration. This is an array of the following values:
    • threshold: (percentage 0-1) Every document with a score that matches this threshold will be included to the connex search
    • min: Minimum number of documents that will be internally included into the connex search
    • max: Maximum number of documents that will be internally included into the connex search
    • limitToken: Maximum number of tokens that will be retained in the connex search
    • limitDocs: Maximum number of documents that'll be returned from the connex search
  • serializableObjects: This is a list of FQDN classes as key and a closure as value that allows Objects to be serialized in the index. Be sure to return a string or a number from this closure, with the most accurate representation of your object (because it can be used in a search query). Only DateTime objects are natively supported for now.

The "schemas" section

defining a schema

This section defines every schemas that you want to index in the engine. You can define a shema as long as you want, as deep as you want.

the "schemas" section is an associative array which have the document type name as key and the corresponding schema as array.

inside a "schema" array, you have a list of every fields with the name of the field as key and the configuration of the field as array.

finally, the configuration of a field is also an associative array, here's the list:

  • _type: the type of the field, this can be any values, you can customize the behavior of a type with the "types" section below. There is two special values "list" and "array" that we'll see below.
  • _type.: the subtype of the field, required when the type is 'list'. This will define the type of the items in the list. (e.g. '_list'=>'list','_list.'=>'date' will define the field as a list of dates).
  • _indexed: boolean that say to the engine if the field need to be count in the indexation. If set to false, the field will be stored in the document but you will not be able to search something in this field.
  • _filterable: optionnal boolean (default: false) that'll add to the index the possibility to filter by the values of these fields. Useful for researching into a specific category. (in short: enables faceting for this field)
  • _boost: float value that'll be used for determining the score of a document. the more you'll boost a field, the more the values in it will count into the final score.
  • _array: special parameter required if the type or the subtype of the field is "array". see below for more information about the special type 'array'

special types 'list' and 'array'

These two types cannot be used in the index, so naming them in the "types" section will do nothing.

list

The list type will make a field multivalued. Instead of having one value in the field, you'll have an array of values, whose type will be defined by the subtype key '_type.'

example:

$schema = [
    "categories" => [
        "_type" => "list",
        "_type." => "string",
        "_indexed" => true,
        "_filterable" => true,
        "_boost" => 6
    ]
];

array

The array type will define a subschema into your field. You should use it as a subtype of a 'list'. When you put the array type into a field, every other parameters except '_type' and '_type.' will be ignored, and a parameter '_array' will be required. This parameter contain another schema structure, that'll be nested into the current schema.

You can look at the 'comments' field into the default schema above to see an example of array.

the "types" section

This section gives to the user a way to customize the tokenization (= the slicing of values into small 'tokens' to help the engine find easily your documents) of any types. There is a special type "_default" that can be defined for default tokenization, if you have not defined some types in your schema, the "_default" type will be used. There is also a special kind of type named "datetime" that'll convert a value into a DateTime instance. The tokenization is by the way different, as you can see in the default configuration above.

In the section, every type is defined as an associative array where the key is the name of the type and the value is a list of TokenizerInterface class. You can define your own tokenizers, and you can use the default ones that you can retrieve here.

Special types

_default : Fallback type when a "_type" or "_type." is not known.
search : Type used internally to tokenize search terms. If not configured, fallbacks to "_default"