diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md index a682e8b4bf..9c3aaf342f 100644 --- a/docs/_src/api/api/document_store.md +++ b/docs/_src/api/api/document_store.md @@ -1,40 +1,95 @@ - -# memory + +# Module elasticsearch - -## InMemoryDocumentStore + +## ElasticsearchDocumentStore Objects ```python -class InMemoryDocumentStore(BaseDocumentStore) +class ElasticsearchDocumentStore(BaseDocumentStore) ``` -In-memory document store + +#### \_\_init\_\_ - +```python + | __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, analyzer: str = "standard", scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30, return_embedding: Optional[bool] = True) +``` + +A DocumentStore using Elasticsearch to store and query the documents for our search. + +* Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings +* You can either use an existing Elasticsearch index or create a new one via haystack +* Retrievers operate on top of this DocumentStore to find the relevant documents for a query + +**Arguments**: + +- `host`: url of elasticsearch +- `port`: port of elasticsearch +- `username`: username +- `password`: password +- `index`: Name of index in elasticsearch to use. If not existing yet, we will create one. +- `search_fields`: Name of fields used by ElasticsearchRetriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"] +- `text_field`: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text"). +If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned. +- `name_field`: Name of field that contains the title of the the doc +- `embedding_field`: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top) +- `embedding_dim`: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top) +- `custom_mapping`: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary. +- `analyzer`: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index. +Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at: +https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html +- `excluded_meta_data`: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]). +Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors). +- `scheme`: 'https' or 'http', protocol used to connect to your elasticsearch instance +- `ca_certs`: Root certificates for SSL +- `verify_certs`: Whether to be strict about ca certificates +- `create_index`: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case) +- `update_existing_documents`: Whether to update any existing documents with the same ID when adding +documents. When set as True, any document with an existing ID gets updated. +If set to False, an error is raised if the document ID of the document being +added already exists. +- `refresh_type`: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search. +Values: +- 'wait_for' => continue only after changes are visible (slow, but safe) +- 'false' => continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion) +More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html +- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is +more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model. +- `timeout`: Number of seconds after which an ElasticSearch request times out. +- `return_embedding`: To return document embedding + + #### write\_documents ```python | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) ``` -Indexes documents for later queries. +Indexes documents for later queries in Elasticsearch. +Behaviour if a document with the same ID already exists in ElasticSearch: +a) (Default) Throw Elastic's standard error message for duplicate IDs. +b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents. +(This is only relevant if you pass your own ID when initializing a `Document`. +If don't set custom IDs for your Documents or just pass a list of dictionaries here, +they will automatically get UUIDs assigned. See the `Document` class for details) **Arguments**: - `documents`: a list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", -"meta": {"name": ", "author": "somebody", ...}} +"meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder. -- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a -separate index than the documents for search. +Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary +should be changed to what you have set for self.text_field and self.name_field. +- `index`: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used. **Returns**: None - + #### update\_embeddings ```python @@ -53,11 +108,11 @@ This can be useful if want to add or change the embeddings for your documents (e None - + #### add\_eval\_data ```python - | add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None) + | add_eval_data(filename: str, doc_index: str = "eval_document", label_index: str = "label") ``` Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it. @@ -71,94 +126,61 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform - `label_index`: Elasticsearch index where labeled questions should be stored :type label_index: str - + #### delete\_all\_documents ```python - | delete_all_documents(index: Optional[str] = None) + | delete_all_documents(index: str, filters: Optional[Dict[str, List[str]]] = None) ``` -Delete all documents in a index. +Delete documents in an index. All documents are deleted if no filters are passed. **Arguments**: -- `index`: index name +- `index`: Index name to delete the document from. +- `filters`: Optional filters to narrow down the documents to be deleted. **Returns**: None - -# faiss - - -## FAISSDocumentStore - -```python -class FAISSDocumentStore(SQLDocumentStore) -``` - -Document store for very large scale embedding based dense retrievers like the DPR. - -It implements the FAISS library(https://github.com/facebookresearch/faiss) -to perform similarity search on vectors. - -The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while -the vector embeddings are indexed in a FAISS Index. + +# Module memory - -#### \_\_init\_\_ + +## InMemoryDocumentStore Objects ```python - | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, **kwargs, ,) +class InMemoryDocumentStore(BaseDocumentStore) ``` -**Arguments**: - -- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale -deployment, Postgres is recommended. -- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in -smaller chunks to reduce memory footprint. -- `vector_dim`: the embedding vector size. -- `faiss_index_factory_str`: Create a new FAISS index of the specified type. -The type is determined from the given string following the conventions -of the original FAISS index factory. -Recommended options: -- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs. -- "HNSW": Graph-based heuristic. If not further specified, -we use a RAM intense, but more accurate config: -HNSW256, efConstruction=256 and efSearch=256 -- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist. -Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point. -For more details see: -- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes -- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index -- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory -Benchmarks: XXX -- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually -or one with docs that you used in Haystack before and want to load again. -- `return_embedding`: To return document embedding +In-memory document store - + #### write\_documents ```python | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) ``` -Add new documents to the DocumentStore. +Indexes documents for later queries. + **Arguments**: -- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index -them right away in FAISS. If not, you can later call update_embeddings() to create & index them. -- `index`: (SQL) index name for storing the docs and metadata +- `documents`: a list of Python dictionaries or a list of Haystack Document objects. +For documents as dictionaries, the format is {"text": ""}. +Optionally: Include meta data via {"text": "", +"meta": {"name": ", "author": "somebody", ...}} +It can be used for filtering and is accessible in the responses of the Finder. +- `index`: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a +separate index than the documents for search. **Returns**: +None - - + #### update\_embeddings ```python @@ -170,170 +192,85 @@ This can be useful if want to add or change the embeddings for your documents (e **Arguments**: -- `retriever`: Retriever to use to get embeddings for text -- `index`: (SQL) index name for storing the docs and metadata - -**Returns**: - -None - - -#### train\_index - -```python - | train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None) -``` - -Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors. -The train vectors should come from the same distribution as your final ones. -You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on. - -**Arguments**: - -- `documents`: Documents (incl. the embeddings) -- `embeddings`: Plain embeddings +- `retriever`: Retriever +- `index`: Index name to update **Returns**: None - -#### query\_by\_embedding + +#### add\_eval\_data ```python - | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document] + | add_eval_data(filename: str, doc_index: Optional[str] = None, label_index: Optional[str] = None) ``` -Find the document that is most similar to the provided `query_emb` by using a vector similarity metric. +Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it. **Arguments**: -- `query_emb`: Embedding of the query (e.g. gathered from DPR) -- `filters`: Optional filters to narrow down the search space. -Example: {"name": ["some", "more"], "category": ["only_one"]} -- `top_k`: How many documents to return -- `index`: (SQL) index name for storing the docs and metadata -- `return_embedding`: To return document embedding - -**Returns**: - - +- `filename`: Name of the file containing evaluation data +:type filename: str +- `doc_index`: Elasticsearch index where evaluation documents should be stored +:type doc_index: str +- `label_index`: Elasticsearch index where labeled questions should be stored +:type label_index: str - -#### save + +#### delete\_all\_documents ```python - | save(file_path: Union[str, Path]) + | delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None) ``` -Save FAISS Index to the specified file. +Delete documents in an index. All documents are deleted if no filters are passed. **Arguments**: -- `file_path`: Path to save to. +- `index`: Index name to delete the document from. +- `filters`: Optional filters to narrow down the documents to be deleted. **Returns**: None - -#### load - -```python - | @classmethod - | load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000) -``` - -Load a saved FAISS index from a file and connect to the SQL database. -Note: In order to have a correct mapping from FAISS to SQL, -make sure to use the same SQL DB that you used when calling `save()`. - -**Arguments**: - -- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()` -- `sql_url`: Connection string to the SQL database that contains your docs and metadata. -- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in -smaller chunks to reduce memory footprint. - -**Returns**: - - - - -# elasticsearch + +# Module sql - -## ElasticsearchDocumentStore + +## SQLDocumentStore Objects ```python -class ElasticsearchDocumentStore(BaseDocumentStore) +class SQLDocumentStore(BaseDocumentStore) ``` - + #### \_\_init\_\_ ```python - | __init__(host: str = "localhost", port: int = 9200, username: str = "", password: str = "", index: str = "document", label_index: str = "label", search_fields: Union[str, list] = "text", text_field: str = "text", name_field: str = "name", embedding_field: str = "embedding", embedding_dim: int = 768, custom_mapping: Optional[dict] = None, excluded_meta_data: Optional[list] = None, faq_question_field: Optional[str] = None, analyzer: str = "standard", scheme: str = "http", ca_certs: bool = False, verify_certs: bool = True, create_index: bool = True, update_existing_documents: bool = False, refresh_type: str = "wait_for", similarity="dot_product", timeout=30, return_embedding: Optional[bool] = True) + | __init__(url: str = "sqlite://", index: str = "document", label_index: str = "label", update_existing_documents: bool = False) ``` -A DocumentStore using Elasticsearch to store and query the documents for our search. - -* Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings -* You can either use an existing Elasticsearch index or create a new one via haystack -* Retrievers operate on top of this DocumentStore to find the relevant documents for a query - **Arguments**: -- `host`: url of elasticsearch -- `port`: port of elasticsearch -- `username`: username -- `password`: password -- `index`: Name of index in elasticsearch to use. If not existing yet, we will create one. -- `search_fields`: Name of fields used by ElasticsearchRetriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"] -- `text_field`: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text"). -If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned. -- `name_field`: Name of field that contains the title of the the doc -- `embedding_field`: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top) -- `embedding_dim`: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top) -- `custom_mapping`: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary. -- `analyzer`: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index. -Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at: -https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html -- `excluded_meta_data`: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]). -Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors). -- `scheme`: 'https' or 'http', protocol used to connect to your elasticsearch instance -- `ca_certs`: Root certificates for SSL -- `verify_certs`: Whether to be strict about ca certificates -- `create_index`: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case) +- `url`: URL for SQL database as expected by SQLAlchemy. More info here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls +- `index`: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents. +This parameter sets the default value for document index. +- `label_index`: The default value of index attribute for the labels. - `update_existing_documents`: Whether to update any existing documents with the same ID when adding documents. When set as True, any document with an existing ID gets updated. If set to False, an error is raised if the document ID of the document being -added already exists. -- `refresh_type`: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search. -Values: -- 'wait_for' => continue only after changes are visible (slow, but safe) -- 'false' => continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion) -More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html -- `similarity`: The similarity function used to compare document vectors. 'dot_product' is the default sine it is -more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model. -- `timeout`: Number of seconds after which an ElasticSearch request times out. -- `return_embedding`: To return document embedding +added already exists. Using this parameter coud cause performance degradation for document insertion. - + #### write\_documents ```python | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) ``` -Indexes documents for later queries in Elasticsearch. - -Behaviour if a document with the same ID already exists in ElasticSearch: -a) (Default) Throw Elastic's standard error message for duplicate IDs. -b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents. -(This is only relevant if you pass your own ID when initializing a `Document`. -If don't set custom IDs for your Documents or just pass a list of dictionaries here, -they will automatically get UUIDs assigned. See the `Document` class for details) +Indexes documents for later queries. **Arguments**: @@ -342,34 +279,28 @@ For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", "meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder. -Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary -should be changed to what you have set for self.text_field and self.name_field. -- `index`: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used. +- `index`: add an optional index attribute to documents. It can be later used for filtering. For instance, +documents for evaluation can be indexed in a separate index than the documents for search. **Returns**: None - -#### update\_embeddings + +#### update\_vector\_ids ```python - | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None) + | update_vector_ids(vector_id_map: Dict[str, str], index: Optional[str] = None) ``` -Updates the embeddings in the the document store using the encoding model specified in the retriever. -This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config). +Update vector_ids for given document_ids. **Arguments**: -- `retriever`: Retriever -- `index`: Index name to update - -**Returns**: - -None +- `vector_id_map`: dict containing mapping of document_id -> vector_id. +- `index`: filter documents by the optional index attribute for documents in database. - + #### add\_eval\_data ```python @@ -387,37 +318,41 @@ Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform - `label_index`: Elasticsearch index where labeled questions should be stored :type label_index: str - + #### delete\_all\_documents ```python - | delete_all_documents(index: str) + | delete_all_documents(index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None) ``` -Delete all documents in an index. +Delete documents in an index. All documents are deleted if no filters are passed. **Arguments**: -- `index`: index name +- `index`: Index name to delete the document from. +- `filters`: Optional filters to narrow down the documents to be deleted. **Returns**: None - -# sql + +# Module base - -## SQLDocumentStore + +## BaseDocumentStore Objects ```python -class SQLDocumentStore(BaseDocumentStore) +class BaseDocumentStore(ABC) ``` - +Base class for implementing Document Stores. + + #### write\_documents ```python + | @abstractmethod | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) ``` @@ -430,95 +365,186 @@ For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", "meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder. -- `index`: add an optional index attribute to documents. It can be later used for filtering. For instance, -documents for evaluation can be indexed in a separate index than the documents for search. +- `index`: Optional name of index where the documents shall be written to. +If None, the DocumentStore's default index (self.index) will be used. **Returns**: None - -#### update\_vector\_ids + +# Module faiss + + +## FAISSDocumentStore Objects ```python - | update_vector_ids(vector_id_map: Dict[str, str], index: Optional[str] = None) +class FAISSDocumentStore(SQLDocumentStore) ``` -Update vector_ids for given document_ids. +Document store for very large scale embedding based dense retrievers like the DPR. + +It implements the FAISS library(https://github.com/facebookresearch/faiss) +to perform similarity search on vectors. + +The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while +the vector embeddings are indexed in a FAISS Index. + + +#### \_\_init\_\_ + +```python + | __init__(sql_url: str = "sqlite:///", index_buffer_size: int = 10_000, vector_dim: int = 768, faiss_index_factory_str: str = "Flat", faiss_index: Optional[faiss.swigfaiss.Index] = None, return_embedding: Optional[bool] = True, update_existing_documents: bool = False, index: str = "document", **kwargs, ,) +``` **Arguments**: -- `vector_id_map`: dict containing mapping of document_id -> vector_id. -- `index`: filter documents by the optional index attribute for documents in database. +- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale +deployment, Postgres is recommended. +- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in +smaller chunks to reduce memory footprint. +- `vector_dim`: the embedding vector size. +- `faiss_index_factory_str`: Create a new FAISS index of the specified type. +The type is determined from the given string following the conventions +of the original FAISS index factory. +Recommended options: +- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs. +- "HNSW": Graph-based heuristic. If not further specified, +we use a RAM intense, but more accurate config: +HNSW256, efConstruction=256 and efSearch=256 +- "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist. +Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point. +For more details see: +- Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes +- Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index +- FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory +Benchmarks: XXX +- `faiss_index`: Pass an existing FAISS Index, i.e. an empty one that you configured manually +or one with docs that you used in Haystack before and want to load again. +- `return_embedding`: To return document embedding +- `update_existing_documents`: Whether to update any existing documents with the same ID when adding +documents. When set as True, any document with an existing ID gets updated. +If set to False, an error is raised if the document ID of the document being +added already exists. +- `index`: Name of index in document store to use. - -#### add\_eval\_data + +#### write\_documents ```python - | add_eval_data(filename: str, doc_index: str = "eval_document", label_index: str = "label") + | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) ``` -Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it. +Add new documents to the DocumentStore. **Arguments**: -- `filename`: Name of the file containing evaluation data -:type filename: str -- `doc_index`: Elasticsearch index where evaluation documents should be stored -:type doc_index: str -- `label_index`: Elasticsearch index where labeled questions should be stored -:type label_index: str +- `documents`: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index +them right away in FAISS. If not, you can later call update_embeddings() to create & index them. +- `index`: (SQL) index name for storing the docs and metadata - -#### delete\_all\_documents +**Returns**: + + + + +#### update\_embeddings ```python - | delete_all_documents(index=None) + | update_embeddings(retriever: BaseRetriever, index: Optional[str] = None) ``` -Delete all documents in a index. +Updates the embeddings in the the document store using the encoding model specified in the retriever. +This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config). **Arguments**: -- `index`: index name +- `retriever`: Retriever to use to get embeddings for text +- `index`: (SQL) index name for storing the docs and metadata **Returns**: None - -# base + +#### train\_index - -## BaseDocumentStore +```python + | train_index(documents: Optional[Union[List[dict], List[Document]]], embeddings: Optional[np.array] = None) +``` + +Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors. +The train vectors should come from the same distribution as your final ones. +You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on. + +**Arguments**: + +- `documents`: Documents (incl. the embeddings) +- `embeddings`: Plain embeddings + +**Returns**: + +None + + +#### query\_by\_embedding ```python -class BaseDocumentStore(ABC) + | query_by_embedding(query_emb: np.array, filters: Optional[dict] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None) -> List[Document] ``` -Base class for implementing Document Stores. +Find the document that is most similar to the provided `query_emb` by using a vector similarity metric. - -#### write\_documents +**Arguments**: + +- `query_emb`: Embedding of the query (e.g. gathered from DPR) +- `filters`: Optional filters to narrow down the search space. +Example: {"name": ["some", "more"], "category": ["only_one"]} +- `top_k`: How many documents to return +- `index`: (SQL) index name for storing the docs and metadata +- `return_embedding`: To return document embedding + +**Returns**: + + + + +#### save ```python - | @abstractmethod - | write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None) + | save(file_path: Union[str, Path]) ``` -Indexes documents for later queries. +Save FAISS Index to the specified file. **Arguments**: -- `documents`: a list of Python dictionaries or a list of Haystack Document objects. -For documents as dictionaries, the format is {"text": ""}. -Optionally: Include meta data via {"text": "", -"meta":{"name": ", "author": "somebody", ...}} -It can be used for filtering and is accessible in the responses of the Finder. -- `index`: Optional name of index where the documents shall be written to. -If None, the DocumentStore's default index (self.index) will be used. +- `file_path`: Path to save to. **Returns**: None + +#### load + +```python + | @classmethod + | load(cls, faiss_file_path: Union[str, Path], sql_url: str, index_buffer_size: int = 10_000) +``` + +Load a saved FAISS index from a file and connect to the SQL database. +Note: In order to have a correct mapping from FAISS to SQL, +make sure to use the same SQL DB that you used when calling `save()`. + +**Arguments**: + +- `faiss_file_path`: Stored FAISS index file. Can be created via calling `save()` +- `sql_url`: Connection string to the SQL database that contains your docs and metadata. +- `index_buffer_size`: When working with large datasets, the ingestion process(FAISS + SQL) can be buffered in +smaller chunks to reduce memory footprint. + +**Returns**: + + + diff --git a/docs/_src/api/api/file_converter.md b/docs/_src/api/api/file_converter.md index 49571ffbdb..8c40b1e3e5 100644 --- a/docs/_src/api/api/file_converter.md +++ b/docs/_src/api/api/file_converter.md @@ -1,38 +1,8 @@ - -# pdf - - -## PDFToTextConverter - -```python -class PDFToTextConverter(BaseConverter) -``` - - -#### \_\_init\_\_ - -```python - | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None) -``` - -**Arguments**: - -- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables. -The tabular structures in documents might be noise for the reader model if it -does not have table parsing capability for finding answers. However, tables -may also have long strings that could possible candidate for searching answers. -The rows containing strings are thus retained in this option. -- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1 -(https://en.wikipedia.org/wiki/ISO_639-1) format. -This option can be used to add test for encoding errors. If the extracted text is -not one of the valid languages, then it might likely be encoding error resulting -in garbled text. - -# txt +# Module txt -## TextConverter +## TextConverter Objects ```python class TextConverter(BaseConverter) @@ -77,11 +47,36 @@ Reads text from a txt file and executes optional preprocessing steps. Dict of format {"text": "The text from file", "meta": meta}} + +# Module docx + + +## DocxToTextConverter Objects + +```python +class DocxToTextConverter(BaseConverter) +``` + + +#### convert + +```python + | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any] +``` + +Extract text from a .docx file. +Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here. +For compliance with other converters we nevertheless opted for keeping the methods name. + +**Arguments**: + +- `file_path`: Path to the .docx file you want to convert + -# tika +# Module tika -## TikaConverter +## TikaConverter Objects ```python class TikaConverter(BaseConverter) @@ -123,36 +118,11 @@ in garbled text. a list of pages and the extracted meta data of the file. - -# docx - - -## DocxToTextConverter - -```python -class DocxToTextConverter(BaseConverter) -``` - - -#### convert - -```python - | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any] -``` - -Extract text from a .docx file. -Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here. -For compliance with other converters we nevertheless opted for keeping the methods name. - -**Arguments**: - -- `file_path`: Path to the .docx file you want to convert - -# base +# Module base -## BaseConverter +## BaseConverter Objects ```python class BaseConverter() @@ -207,3 +177,33 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar Validate if the language of the text is one of valid languages. + +# Module pdf + + +## PDFToTextConverter Objects + +```python +class PDFToTextConverter(BaseConverter) +``` + + +#### \_\_init\_\_ + +```python + | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None) +``` + +**Arguments**: + +- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables. +The tabular structures in documents might be noise for the reader model if it +does not have table parsing capability for finding answers. However, tables +may also have long strings that could possible candidate for searching answers. +The rows containing strings are thus retained in this option. +- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1 +(https://en.wikipedia.org/wiki/ISO_639-1) format. +This option can be used to add test for encoding errors. If the extracted text is +not one of the valid languages, then it might likely be encoding error resulting +in garbled text. + diff --git a/docs/_src/api/api/generator.md b/docs/_src/api/api/generator.md new file mode 100644 index 0000000000..7c72a4f5a6 --- /dev/null +++ b/docs/_src/api/api/generator.md @@ -0,0 +1,137 @@ + +# Module transformers + + +## RAGenerator Objects + +```python +class RAGenerator(BaseGenerator) +``` + +Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on +HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html). + +Instead of "finding" the answer within a document, these models **generate** the answer. +In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages +for real-world applications: +a) it has a manageable model size +b) the answer generation is conditioned on retrieved documents, +i.e. the model can easily adjust to domain documents even after training has finished +(in contrast: GPT-3 relies on the web data seen during training) + +**Example** + +```python +> question = "who got the first nobel prize in physics?" + +# Retrieve related documents from retriever +> retrieved_docs = retriever.retrieve(query=question) + +> # Now generate answer from question and retrieved documents +> generator.predict( +> question=question, +> documents=retrieved_docs, +> top_k=1 +> ) +{'question': 'who got the first nobel prize in physics', + 'answers': + [{'question': 'who got the first nobel prize in physics', + 'answer': ' albert einstein', + 'meta': { 'doc_ids': [...], + 'doc_scores': [80.42758 ...], + 'doc_probabilities': [40.71379089355469, ... + 'texts': ['Albert Einstein was a ...] + 'titles': ['"Albert Einstein"', ...] + }}]} +``` + + +#### \_\_init\_\_ + +```python + | __init__(model_name_or_path: str = "facebook/rag-token-nq", retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k_answers: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True) +``` + +Load a RAG model from Transformers along with passage_embedding_model. +See https://huggingface.co/transformers/model_doc/rag.html for more details + +**Arguments**: + +- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. +'facebook/rag-token-nq', 'facebook/rag-sequence-nq'. +See https://huggingface.co/models for full list of available models. +- `retriever`: `DensePassageRetriever` used to embedded passage +- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE +- `top_k_answers`: Number of independently generated text to return +- `max_length`: Maximum length of generated text +- `min_length`: Minimum length of generated text +- `num_beams`: Number of beams for beam search. 1 means no beam search. +- `embed_title`: Embedded the title of passage while generating embedding +- `prefix`: The prefix used by the generator's tokenizer. +- `use_gpu`: Whether to use GPU (if available) + + +#### predict + +```python + | predict(question: str, documents: List[Document], top_k: Optional[int] = None) -> Dict +``` + +Generate the answer to the input question. The generation will be conditioned on the supplied documents. +These document can for example be retrieved via the Retriever. + +**Arguments**: + +- `question`: Question +- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on. +- `top_k`: Number of returned answers + +**Returns**: + +Generated answers plus additional infos in a dict like this: + +```python +> {'question': 'who got the first nobel prize in physics', +> 'answers': +> [{'question': 'who got the first nobel prize in physics', +> 'answer': ' albert einstein', +> 'meta': { 'doc_ids': [...], +> 'doc_scores': [80.42758 ...], +> 'doc_probabilities': [40.71379089355469, ... +> 'texts': ['Albert Einstein was a ...] +> 'titles': ['"Albert Einstein"', ...] +> }}]} +``` + + +# Module base + + +## BaseGenerator Objects + +```python +class BaseGenerator(ABC) +``` + +Abstract class for Generators + + +#### predict + +```python + | @abstractmethod + | predict(question: str, documents: List[Document], top_k: Optional[int]) -> Dict +``` + +Abstract method to generate answers. + +**Arguments**: + +- `question`: Question +- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on. +- `top_k`: Number of returned answers + +**Returns**: + +Generated answers plus additional infos in a dict + diff --git a/docs/_src/api/api/preprocessor.md b/docs/_src/api/api/preprocessor.md index fa7e3f23fe..04a9e46ae4 100644 --- a/docs/_src/api/api/preprocessor.md +++ b/docs/_src/api/api/preprocessor.md @@ -1,5 +1,44 @@ + +# Module preprocessor + + +## PreProcessor Objects + +```python +class PreProcessor(BasePreProcessor) +``` + + +#### \_\_init\_\_ + +```python + | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True) +``` + +**Arguments**: + +- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching +for the longest common string. This heuristic uses exact matches and therefore +works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4" +or similar. +- `clean_whitespace`: Strip whitespaces before or after each line in the text. +- `clean_empty_lines`: Remove more than two empty lines in the text. +- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting. +- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> +"sentence", then each output document will have 10 sentences. +- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`, +split_length -> 5 & split_stride -> 2, then the splits would be like: +[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]. +Set the value to None to disable striding behaviour. +- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set +to True, the individual split will always have complete sentences & +the number of words will be <= split_length. + + +# Module cleaning + -# utils +# Module utils #### eval\_data\_from\_file @@ -84,45 +123,6 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o bool if anything got fetched - -# preprocessor - - -## PreProcessor - -```python -class PreProcessor(BasePreProcessor) -``` - - -#### \_\_init\_\_ - -```python - | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True) -``` - -**Arguments**: - -- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching -for the longest common string. This heuristic uses exact matches and therefore -works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4" -or similar. -- `clean_whitespace`: Strip whitespaces before or after each line in the text. -- `clean_empty_lines`: Remove more than two empty lines in the text. -- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting. -- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> -"sentence", then each output document will have 10 sentences. -- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`, -split_length -> 5 & split_stride -> 2, then the splits would be like: -[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]. -Set the value to None to disable striding behaviour. -- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set -to True, the individual split will always have complete sentences & -the number of words will be <= split_length. - -# base - - -# cleaning +# Module base diff --git a/docs/_src/api/api/pydoc-markdown-document-store.yml b/docs/_src/api/api/pydoc-markdown-document-store.yml index 7a52c82466..e359c52935 100644 --- a/docs/_src/api/api/pydoc-markdown-document-store.yml +++ b/docs/_src/api/api/pydoc-markdown-document-store.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: document_store.md diff --git a/docs/_src/api/api/pydoc-markdown-file-converters.yml b/docs/_src/api/api/pydoc-markdown-file-converters.yml index 6e496ae250..2ec184ed61 100644 --- a/docs/_src/api/api/pydoc-markdown-file-converters.yml +++ b/docs/_src/api/api/pydoc-markdown-file-converters.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: file_converter.md diff --git a/docs/_src/api/api/pydoc-markdown-generator.yml b/docs/_src/api/api/pydoc-markdown-generator.yml index bd2bca1441..8774bf8caa 100644 --- a/docs/_src/api/api/pydoc-markdown-generator.yml +++ b/docs/_src/api/api/pydoc-markdown-generator.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: generator.md diff --git a/docs/_src/api/api/pydoc-markdown-preprocessor.yml b/docs/_src/api/api/pydoc-markdown-preprocessor.yml index 165bf1bd4c..973e6dd921 100644 --- a/docs/_src/api/api/pydoc-markdown-preprocessor.yml +++ b/docs/_src/api/api/pydoc-markdown-preprocessor.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: preprocessor.md diff --git a/docs/_src/api/api/pydoc-markdown-reader.yml b/docs/_src/api/api/pydoc-markdown-reader.yml index b6683e8c79..5cc2163263 100644 --- a/docs/_src/api/api/pydoc-markdown-reader.yml +++ b/docs/_src/api/api/pydoc-markdown-reader.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: reader.md diff --git a/docs/_src/api/api/pydoc-markdown-retriever.yml b/docs/_src/api/api/pydoc-markdown-retriever.yml index 995255a385..4ef2387d76 100644 --- a/docs/_src/api/api/pydoc-markdown-retriever.yml +++ b/docs/_src/api/api/pydoc-markdown-retriever.yml @@ -10,5 +10,8 @@ processor: - skip_empty_modules: true renderer: type: markdown - descriptive_class_title: false + descriptive_class_title: true + descriptive_module_title: true + add_method_class_prefix: false + add_member_class_prefix: false filename: retriever.md diff --git a/docs/_src/api/api/reader.md b/docs/_src/api/api/reader.md index a7774f4da0..821bc7689d 100644 --- a/docs/_src/api/api/reader.md +++ b/docs/_src/api/api/reader.md @@ -1,8 +1,8 @@ -# farm +# Module farm -## FARMReader +## FARMReader Objects ```python class FARMReader(BaseReader) @@ -279,10 +279,10 @@ float32 could still be be more performant. - `opset_version`: ONNX opset version -# transformers +# Module transformers -## TransformersReader +## TransformersReader Objects ```python class TransformersReader(BaseReader) @@ -368,5 +368,5 @@ Example: Dict containing question and answers -# base +# Module base diff --git a/docs/_src/api/api/retriever.md b/docs/_src/api/api/retriever.md index 28eb02aa3b..a63d5f722c 100644 --- a/docs/_src/api/api/retriever.md +++ b/docs/_src/api/api/retriever.md @@ -1,8 +1,8 @@ -# sparse +# Module sparse -## ElasticsearchRetriever +## ElasticsearchRetriever Objects ```python class ElasticsearchRetriever(BaseRetriever) @@ -52,7 +52,7 @@ self.retrieve(query="Why did the revenue increase?", ``` -## ElasticsearchFilterOnlyRetriever +## ElasticsearchFilterOnlyRetriever Objects ```python class ElasticsearchFilterOnlyRetriever(ElasticsearchRetriever) @@ -62,7 +62,7 @@ Naive "Retriever" that returns all documents that match the given filters. No im Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever. -## TfidfRetriever +## TfidfRetriever Objects ```python class TfidfRetriever(BaseRetriever) @@ -76,10 +76,10 @@ computations when text is passed on to a Reader for QA. It uses sklearn's TfidfVectorizer to compute a tf-idf matrix. -# dense +# Module dense -## DensePassageRetriever +## DensePassageRetriever Objects ```python class DensePassageRetriever(BaseRetriever) @@ -201,7 +201,7 @@ train a DensePassageRetrieval model - `passage_encoder_save_dir`: directory inside save_dir where passage_encoder model files are saved -## EmbeddingRetriever +## EmbeddingRetriever Objects ```python class EmbeddingRetriever(BaseRetriever) @@ -286,10 +286,10 @@ Create embeddings for a list of passages. For this Retriever type: The same as c Embeddings, one per input passage -# base +# Module base -## BaseRetriever +## BaseRetriever Objects ```python class BaseRetriever(ABC) @@ -330,7 +330,10 @@ position in the ranking of documents the correct document is. - "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank. Only considers the highest ranked relevant document. - "map": Mean of average precision for each question. Rewards retrievers that give relevant -documents a higher rank. Considers all retrieved relevant documents. (only with ``open_domain=False``) +documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``, +average precision is normalized by the number of retrieved relevant documents per query. +If ``open_domain=False``, average precision is normalized by the number of all relevant documents +per query. **Arguments**: