-
Notifications
You must be signed in to change notification settings - Fork 7
Configuration
You can configure the search engine by giving an array as the first parameter of the constructor:
$engine = new Engine($myConfiguration);
Here's the default configuration array:
$default = [
"config" => [
"var_dir" => $_SERVER['DOCUMENT_ROOT'].DIRECTORY_SEPARATOR."var",
"index_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."index",
"documents_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."documents",
"cache_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."cache",
"fuzzy_cost" => 1,
"connex" => [
'threshold' => 0.9,
'min' => 3,
'max' => 10,
'limitToken' => 20,
'limitDocs' => 10
],
"serializableObjects" => [
DateTime::class => function($datetime) { /** @var DateTime $datetime */ return $datetime->getTimestamp(); }
]
],
"schemas" => [
"example-post" => [
"title" => [
"_type" => "string",
"_indexed" => true,
"_boost" => 10
],
"content" => [
"_type" => "text",
"_indexed" => true,
"_boost" => 0.5
],
"date" => [
"_type" => "datetime",
"_indexed" => true,
"_boost" => 2
],
"categories" => [
"_type" => "list",
"_type." => "string",
"_indexed" => true,
"_filterable" => true,
"_boost" => 6
],
"comments" => [
"_type" => "list",
"_type." => "array",
"_array" => [
"author" => [
'_type' => "string",
"_indexed" => true,
"_filterable" => true,
"_boost" => 1
],
"date" => [
"_type" => "datetime",
"_indexed" => true,
"_boost" => 0
],
"message" => [
"_type" => "text",
"_indexed" => true,
"_boost" => 0.1
]
]
]
]
],
"types" => [
"datetime" => [
DateFormatTokenizer::class,
DateSplitTokenizer::class
],
"_default" => [
LowerCaseTokenizer::class,
WhiteSpaceTokenizer::class,
TrimPunctuationTokenizer::class
]
]
];
This section defines the engine's parameters, such as working directories.
-
var_dir
: The root directory where the engine's files will be created. -
index_dir
: The index subdirectory name where the index will be built -
documents_dir
: the documents subdirectory where every documents will be stored -
cache_dir
: the cache subdirectory. be sure that there'll be nothing else than the engine's cache file in this subdirectory. -
fuzzy_cost
: The cost of the fuzzy searching's approximate function. The number represents how many characters the user can misstype, see examples on release note 0.5
note: greater value is more CPU-intensive and too much won't help find accurately -
connex
: (since 1.0) Connex Search configuration. This is an array of the following values:-
threshold
: (percentage 0-1) Every document with a score that matches this threshold will be included to the connex search -
min
: Minimum number of documents that will be internally included into the connex search -
max
: Maximum number of documents that will be internally included into the connex search -
limitToken
: Maximum number of tokens that will be retained in the connex search -
limitDocs
: Maximum number of documents that'll be returned from the connex search
-
-
serializableObjects
: This is a list of FQDN classes as key and a closure as value that allows Objects to be serialized in the index. Be sure to return a string or a number from this closure, with the most accurate representation of your object (because it can be used in a search query). Only DateTime objects are natively supported for now.
This section defines every schemas that you want to index in the engine. You can define a shema as long as you want, as deep as you want.
the "schemas" section is an associative array which have the document type name as key and the corresponding schema as array.
inside a "schema" array, you have a list of every fields with the name of the field as key and the configuration of the field as array.
finally, the configuration of a field is also an associative array, here's the list:
-
_type
: the type of the field, this can be any values, you can customize the behavior of a type with the "types" section below. There is two special values "list" and "array" that we'll see below. -
_type.
: the subtype of the field, required when the type is 'list'. This will define the type of the items in the list. (e.g. '_list'=>'list','_list.'=>'date' will define the field as a list of dates). -
_indexed
: boolean that say to the engine if the field need to be count in the indexation. If set to false, the field will be stored in the document but you will not be able to search something in this field. -
_filterable
: optionnal boolean (default: false) that'll add to the index the possibility to filter by the values of these fields. Useful for researching into a specific category. (in short: enables faceting for this field) -
_boost
: float value that'll be used for determining the score of a document. the more you'll boost a field, the more the values in it will count into the final score. -
_array
: special parameter required if the type or the subtype of the field is "array". see below for more information about the special type 'array'
These two types cannot be used in the index, so naming them in the "types" section will do nothing.
The list type will make a field multivalued. Instead of having one value in the field, you'll have an array of values, whose type will be defined by the subtype key '_type.'
example:
$schema = [
"categories" => [
"_type" => "list",
"_type." => "string",
"_indexed" => true,
"_filterable" => true,
"_boost" => 6
]
];
The array type will define a subschema into your field. You should use it as a subtype of a 'list'. When you put the array type into a field, every other parameters except '_type' and '_type.' will be ignored, and a parameter '_array' will be required. This parameter contain another schema structure, that'll be nested into the current schema.
You can look at the 'comments' field into the default schema above to see an example of array.
This section gives to the user a way to customize the tokenization (= the slicing of values into small 'tokens' to help the engine find easily your documents) of any types. There is a special type "_default" that can be defined for default tokenization, if you have not defined some types in your schema, the "_default" type will be used. There is also a special kind of type named "datetime" that'll convert a value into a DateTime instance. The tokenization is by the way different, as you can see in the default configuration above.
In the section, every type is defined as an associative array where the key is the name of the type and the value is a list of TokenizerInterface class. You can define your own tokenizers, and you can use the default ones that you can retrieve here.
_default : Fallback type when a "_type" or "_type." is not known.
search : Type used internally to tokenize search terms. If not configured, fallbacks to "_default"