Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Multi-Hop Dense Retrieval #2571

Merged
merged 11 commits into from
Jul 5, 2022
247 changes: 247 additions & 0 deletions docs/_src/api/api/retriever.md
Original file line number Diff line number Diff line change
Expand Up @@ -1472,6 +1472,253 @@ Save the model to the given directory

- `save_dir` (`Union[Path, str]`): The directory where the model will be saved

<a id="dense.MultihopEmbeddingRetriever"></a>

## MultihopEmbeddingRetriever

```python
class MultihopEmbeddingRetriever(EmbeddingRetriever)
```

Retriever that applies iterative retrieval using a shared encoder for query and passage.
See original paper for more details:

Xiong, Wenhan, et. al. (2020): "Answering complex open-domain questions with multi-hop dense retrieval"
(https://arxiv.org/abs/2009.12756)

<a id="dense.MultihopEmbeddingRetriever.__init__"></a>

#### MultihopEmbeddingRetriever.\_\_init\_\_

```python
def __init__(document_store: BaseDocumentStore, embedding_model: str, model_version: Optional[str] = None, num_iterations: int = 2, use_gpu: bool = True, batch_size: int = 32, max_seq_len: int = 512, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1, top_k: int = 10, progress_bar: bool = True, devices: Optional[List[Union[str, torch.device]]] = None, use_auth_token: Optional[Union[str, bool]] = None, scale_score: bool = True, embed_meta_fields: List[str] = [])
```

**Arguments**:

- `document_store`: An instance of DocumentStore from which to retrieve documents.
- `embedding_model`: Local path or name of model in Hugging Face's model hub such as ``'sentence-transformers/all-MiniLM-L6-v2'``
- `model_version`: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
- `num_iterations`: The number of times passages are retrieved, i.e., the number of hops (Defaults to 2.)
- `use_gpu`: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
- `batch_size`: Number of documents to encode at once.
- `max_seq_len`: Longest length of each document sequence. Maximum number of tokens for the document text. Longer ones will be cut down.
- `model_format`: Name of framework that was used for saving the model or model type. If no model_format is
provided, it will be inferred automatically from the model configuration files.
Options:

- ``'farm'`` (will use `_DefaultEmbeddingEncoder` as embedding encoder)
- ``'transformers'`` (will use `_DefaultEmbeddingEncoder` as embedding encoder)
- ``'sentence_transformers'`` (will use `_SentenceTransformersEmbeddingEncoder` as embedding encoder)
- ``'retribert'`` (will use `_RetribertEmbeddingEncoder` as embedding encoder)
- `pooling_strategy`: Strategy for combining the embeddings from the model (for farm / transformers models only).
Options:

- ``'cls_token'`` (sentence vector)
- ``'reduce_mean'`` (sentence vector)
- ``'reduce_max'`` (sentence vector)
- ``'per_token'`` (individual token vectors)
- `emb_extraction_layer`: Number of layer from which the embeddings shall be extracted (for farm / transformers models only).
Default: -1 (very last layer).
- `top_k`: How many documents to return per query.
- `progress_bar`: If true displays progress bar during embedding.
- `devices`: List of GPU (or CPU) devices, to limit inference to certain GPUs and not use all available ones
These strings will be converted into pytorch devices, so use the string notation described here:
https://pytorch.org/docs/stable/tensor_attributes.html?highlight=torch%20device#torch.torch.device
(e.g. ["cuda:0"]). Note: As multi-GPU training is currently not implemented for EmbeddingRetriever,
training will only use the first device provided in this list.
- `use_auth_token`: API token used to download private models from Huggingface. If this parameter is set to `True`,
the local token will be used, which must be previously created via `transformer-cli login`.
Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
- `embed_meta_fields`: Concatenate the provided meta fields and text passage / table to a text pair that is
then used to create the embedding.
This approach is also used in the TableTextRetriever paper and is likely to improve
performance if your titles contain meaningful information for retrieval
(topic, entities etc.).

<a id="dense.MultihopEmbeddingRetriever.retrieve"></a>

#### MultihopEmbeddingRetriever.retrieve

```python
def retrieve(query: str, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, top_k: Optional[int] = None, index: str = None, headers: Optional[Dict[str, str]] = None, scale_score: bool = None) -> List[Document]
```

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query.

**Arguments**:

- `query`: The query
- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
conditions.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.

__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
- `top_k`: How many documents to return per query.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="dense.MultihopEmbeddingRetriever.retrieve_batch"></a>

#### MultihopEmbeddingRetriever.retrieve\_batch

```python
def retrieve_batch(queries: List[str], filters: Optional[
Union[
Dict[str, Union[Dict, List, str, int, float, bool]],
List[Dict[str, Union[Dict, List, str, int, float, bool]]],
]
] = None, top_k: Optional[int] = None, index: str = None, headers: Optional[Dict[str, str]] = None, batch_size: Optional[int] = None, scale_score: bool = None) -> List[List[Document]]
```

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the supplied queries.

If you supply a single query, a single list of Documents is returned. If you supply a list of queries, a list of
lists of Documents (one per query) is returned.

**Arguments**:

- `queries`: Single query string or list of queries.
- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
conditions. Can be a single filter that will be applied to each query or a list of filters
(one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.

__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
- `top_k`: How many documents to return per query.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
- `batch_size`: Number of queries to embed at a time.
- `scale_score`: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true similarity scores (e.g. cosine or dot_product) which naturally have a different
value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

<a id="text2sparql"></a>

# Module text2sparql
Expand Down
131 changes: 131 additions & 0 deletions haystack/json-schemas/haystack-pipeline-master.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,9 @@
{
"$ref": "#/definitions/MarkdownConverterComponent"
},
{
"$ref": "#/definitions/MultihopEmbeddingRetrieverComponent"
},
{
"$ref": "#/definitions/PDFToTextConverterComponent"
},
Expand Down Expand Up @@ -3097,6 +3100,134 @@
],
"additionalProperties": false
},
"MultihopEmbeddingRetrieverComponent": {
"type": "object",
"properties": {
"name": {
"title": "Name",
"description": "Custom name for the component. Helpful for visualization and debugging.",
"type": "string"
},
"type": {
"title": "Type",
"description": "Haystack Class name for the component.",
"type": "string",
"const": "MultihopEmbeddingRetriever"
},
"params": {
"title": "Parameters",
"type": "object",
"properties": {
"document_store": {
"title": "Document Store",
"type": "string"
},
"embedding_model": {
"title": "Embedding Model",
"type": "string"
},
"model_version": {
"title": "Model Version",
"type": "string"
},
"num_iterations": {
"title": "Num Iterations",
"default": 2,
"type": "integer"
},
"use_gpu": {
"title": "Use Gpu",
"default": true,
"type": "boolean"
},
"batch_size": {
"title": "Batch Size",
"default": 32,
"type": "integer"
},
"max_seq_len": {
"title": "Max Seq Len",
"default": 512,
"type": "integer"
},
"model_format": {
"title": "Model Format",
"default": "farm",
"type": "string"
},
"pooling_strategy": {
"title": "Pooling Strategy",
"default": "reduce_mean",
"type": "string"
},
"emb_extraction_layer": {
"title": "Emb Extraction Layer",
"default": -1,
"type": "integer"
},
"top_k": {
"title": "Top K",
"default": 10,
"type": "integer"
},
"progress_bar": {
"title": "Progress Bar",
"default": true,
"type": "boolean"
},
"devices": {
"title": "Devices",
"type": "array",
"items": {
"anyOf": [
{
"type": "string"
},
{
"type": "string"
}
]
}
},
"use_auth_token": {
"title": "Use Auth Token",
"anyOf": [
{
"type": "boolean"
},
{
"type": "string"
}
]
},
"scale_score": {
"title": "Scale Score",
"default": true,
"type": "boolean"
},
"embed_meta_fields": {
"title": "Embed Meta Fields",
"default": [],
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"document_store",
"embedding_model"
],
"additionalProperties": false,
"description": "Each parameter can reference other components defined in the same YAML file."
}
},
"required": [
"type",
"name"
],
"additionalProperties": false
},
"PDFToTextConverterComponent": {
"type": "object",
"properties": {
Expand Down
Loading