Skip to content

Latest commit

 

History

History
executable file
·
281 lines (232 loc) · 13.7 KB

advanced_retriever.md

File metadata and controls

executable file
·
281 lines (232 loc) · 13.7 KB

Advanced Retriever

⚠️ This is an experimental feature.

The Advanced Retriever is a searcher based on lexical matching and search filters. It supports BM25 and TF-IDF as the Sparse Retriever and provides the same resources for multi-lingual text pre-processing. In addition, it supports search filters, i.e., a set of rules that can be used to filter out documents from the search results.

In the following, we show how to build a search engine employing an advanced retriever, index a document collection, and search it.

Schema

The first step to building an Advanced Retriever is to define the schema of document collection. The schema is a dictionary describing the documents' fields and their data types. Based on the data types, search filters can be defined and applied to the search results.

retriv supports the following data types:

  • id: field used for the document IDs.
  • text: text field used for lexical matching.
  • number: numeric value.
  • bool: boolean value (True or False).
  • keyword: string or number representing a keyword or a category.
  • keywords: list of keywords.

An example of schema for a collection of books is shown below.
NB: At the time of writing, retriv supports only one text field per schema. Therefore, the content field is used for both the title and the abstract of the books.

schema = {
  "isbn": "id",
  "content": "text",
  "year": "number",
  "is_english": "bool",
  "author": "keyword",
  "genres": "keywords",
}

Build

The Advanced Retriever provides several options to tailor its functioning to you preferences, as shown below.

from retriv.experimental import AdvancedRetriever

ar = AdvancedRetriever(
  schema=schema,
  index_name="new-index",
  model="bm25",
  min_df=1,
  tokenizer="whitespace",
  stemmer="english",
  stopwords="english",
  do_lowercasing=True,
  do_ampersand_normalization=True,
  do_special_chars_normalization=True,
  do_acronyms_normalization=True,
  do_punctuation_removal=True,
)
  • schema: the documents' schema.
  • index_name: retriv will use index_name as the identifier of your index.
  • model: defines the retrieval model to use for searching (bm25 or tf-idf).
  • min_df: terms that appear in less than min_df documents will be ignored. If integer, the parameter indicates the absolute count. If float, it represents a proportion of documents.
  • tokenizer: tokenizer to use during preprocessing. You can pass a custom callable tokenizer or disable tokenization by setting the parameter to None.
  • stemmer: stemmer to use during preprocessing. You can pass a custom callable stemmer or disable stemming setting the parameter to None.
  • stopwords: stopwords to remove during preprocessing. You can pass a custom stop-word list or disable stop-words removal by setting the parameter to None.
  • do_lowercasing: whether to lowercase texts.
  • do_ampersand_normalization: whether to convert & in and during pre-processing.
  • do_special_chars_normalization: whether to remove special characters for letters, e.g., übermenschubermensch.
  • do_acronyms_normalization: whether to remove full stop symbols from acronyms without splitting them in multiple words, e.g., P.C.I.PCI.
  • do_punctuation_removal: whether to remove punctuation.

Note: text pre-processing is equally applied to documents during indexing and to queries at search time.

Index

Create

You can index a document collection from JSONl, CSV, or TSV files. CSV and TSV files must have a header. retriv automatically infers the file kind, so there's no need to specify it. Use the callback parameter to pass a function for converting your documents in the format defined by your schema on the fly. Indexes are automatically persisted on disk at the end of the process.

ar = ar.index_file(
  path="path/to/collection",  # File kind is automatically inferred
  show_progress=True,         # Default value
  callback=lambda doc: {      # Callback defaults to None.
    "id": doc["id"],
    "text": doc["title"] + ". " + doc["text"],       
    ...   
  )

Load

ar = AdvancedRetriever.load("index-name")

Delete

AdvancedRetriever.delete("index-name")

Search

Query & Filters

Advanced Retriever search query can be either a string or a dictionary. In the former case, the string is used as the query text and no filters are applied. In the latter case, the dictionary defines the query text and the filters to apply to the search results. If the query text is omitted from the dictionary, documents matching the filters will be returned.

retriv supports two way of filtering the search results (where and where_not) and several type-specific operators.

  • where means that only the documents matching the filter will be considered during search.
  • where_not means that the documents matching the filter will be ignored during search.

Below we describe the effects of the supported operators for each data type and way of filtering.

Where

Field Type Operator Value Meaning
number eq number Only the documents whose field value is equal to the provided value will be considered during search.
number gt number Only the documents whose field value is greater than the provided value will be considered during search.
number gte number Only the documents whose field value is greater or equal to the provided value will be considered during search.
number lt number Only the documents whose field value is less than the provided value will be considered during search.
number lte number Only the documents whose field value is less or equal to the provided value will be considered during search.
number between number Only the documents whose field value is between the provided values (included) will be considered during search.
bool True / False Only the documents whose field value is equal to the provided value will be considered during search.
keyword any value / list of values Only the documents whose field value is equal to the provided value or among the provided values will be considered during search.
keywords or any value / list of values Only the documents whose field value is contains the provided value or contains one of the provided values will be considered during search.
keywords and any value / list of values Only the documents whose field value contains all the provided values will be considered during search.

Query example:

query = {
    "text": "search terms",
    "where": {
        "numeric_field_name": ("gte", 1970),
        "boolean_field_name": True,
        "keyword_field_name": "kw_1",
        "keywords_field_name": ("or", ["kws_23", "kws_666"]),
    }
}

Alternatively, you can omit the where key and use the following syntax:

query = {
    "text": "search terms",
    "numeric_field_name": ("gte", 1970),
    "boolean_field_name": True,
    "keyword_field_name": "kw_1",
    "keywords_field_name": ("or", ["kws_23", "kws_666"]),
}

Where not

Field Type Operator Value Meaning
number eq number The documents whose field value is equal to the provided value will be ignored.
number gt number The documents whose field value is greater than the provided value will be ignored.
number gte number The documents whose field value is greater or equal to the provided value will be ignored.
number lt number The documents whose field value is less than the provided value will be ignored.
number lte number The documents whose field value is less or equal to the provided value will be ignored.
number between number The documents whose field value is between the provided values (included) will be ignored.
bool True / False The documents whose field value is equal to the provided value will be ignored.
keyword any value / list of values The documents whose field value is equal to the provided value or among the provided values will be ignored.
keywords or any value / list of values The documents whose field value is contains the provided value or contains one of the provided values will be ignored.
keywords and any value / list of values The documents whose field value contains all the provided values will be ignored.

Query example:

query = {
    "text": "search terms",
    "where": {
        "numeric_field_name": ("gte", 1970),
        "boolean_field_name": True,
        "keyword_field_name": "kw_1",
        "keywords_field_name": ("or", ["kws_23", "kws_666"]),
    }
}

Search

ar.search(
  query: ... 
  return_docs=True     # Default value.
  cutoff=100           # Default value.
  operator="OR"        # Default value.
  subset_doc_ids=None  # Default value.
)
  • query: what to search for and which filters to apply. See the section Query & Filters for more details.
  • return_docs: whether to return documents or only their IDs.
  • cutoff: number of results to return.
  • operator: whether to perform conjunctive (AND) or disjunctive (OR) search. Conjunctive search retrieves documents that contain all the query terms. Disjunctive search retrieves documents that contain at least one of the query terms.
  • subset_doc_ids: restrict the search to the subset of documents having the provided IDs.

Sample output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]