-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RouteDocuments
and JoinAnswers
nodes
#2256
Merged
Merged
Changes from 11 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
594101d
Add SplitDocumentList and JoinAnswer nodes
bogdankostic 840fede
Update Documentation & Code Style
github-actions[bot] 598af88
Add tests + adapt tutorial
bogdankostic 511f16e
Merge remote-tracking branch 'origin/split_tables_and_texts' into spl…
bogdankostic e199546
Update Documentation & Code Style
github-actions[bot] d24fb22
Remove branch from installation path in Tutorial
bogdankostic bf55469
Merge remote-tracking branch 'origin/split_tables_and_texts' into spl…
bogdankostic a56532c
Merge branch 'master' into split_tables_and_texts
bogdankostic 5674eff
Update Documentation & Code Style
github-actions[bot] 48198b7
Fix typing
bogdankostic e25834e
Merge remote-tracking branch 'origin/split_tables_and_texts' into spl…
bogdankostic 665133e
Update Documentation & Code Style
github-actions[bot] 867d5ef
Change name of SplitDocumentList to RouteDocuments
bogdankostic 4b4c6b0
Update Documentation & Code Style
github-actions[bot] 1842da3
Adapt tutorials to new name
bogdankostic 13b0297
Add test for JoinAnswers
bogdankostic 2dec1db
Merge remote-tracking branch 'origin/split_tables_and_texts' into spl…
bogdankostic a6042b6
Update Documentation & Code Style
github-actions[bot] 2ad75f5
Adapt name of test for JoinAnswers node
bogdankostic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
from haystack.nodes.other.docs2answers import Docs2Answers | ||
from haystack.nodes.other.join_docs import JoinDocuments | ||
from haystack.nodes.other.split_documents import SplitDocumentList | ||
from haystack.nodes.other.join_answers import JoinAnswers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
from typing import Optional, List, Dict, Tuple | ||
|
||
from haystack.schema import Answer | ||
from haystack.nodes import BaseComponent | ||
|
||
|
||
class JoinAnswers(BaseComponent): | ||
""" | ||
A node to join `Answer`s produced by multiple `Reader` nodes. | ||
""" | ||
|
||
def __init__( | ||
self, join_mode: str = "concatenate", weights: Optional[List[float]] = None, top_k_join: Optional[int] = None | ||
): | ||
""" | ||
:param join_mode: `"concatenate"` to combine documents from multiple `Reader`s. `"merge"` to aggregate scores | ||
of individual `Answer`s. | ||
:param weights: A node-wise list (length of list must be equal to the number of input nodes) of weights for | ||
adjusting `Answer` scores when using the `"merge"` join_mode. By default, equal weight is assignef to each | ||
`Reader` score. This parameter is not compatible with the `"concatenate"` join_mode. | ||
:param top_k_join: Limit `Answer`s to top_k based on the resulting scored of the join. | ||
""" | ||
|
||
assert join_mode in ["concatenate", "merge"], f"JoinAnswers node does not support '{join_mode}' join_mode." | ||
assert not ( | ||
weights is not None and join_mode == "concatenate" | ||
), "Weights are not compatible with 'concatenate' join_mode" | ||
|
||
# Save init parameters to enable export of component config as YAML | ||
self.set_config(join_mode=join_mode, weights=weights, top_k_join=top_k_join) | ||
|
||
self.join_mode = join_mode | ||
self.weights = [float(i) / sum(weights) for i in weights] if weights else None | ||
self.top_k_join = top_k_join | ||
|
||
def run(self, inputs: List[Dict], top_k_join: Optional[int] = None) -> Tuple[Dict, str]: # type: ignore | ||
reader_results = [inp["answers"] for inp in inputs] | ||
|
||
if self.join_mode == "concatenate": | ||
concatenated_answers = [answer for cur_reader_result in reader_results for answer in cur_reader_result] | ||
concatenated_answers = sorted(concatenated_answers, reverse=True) | ||
return {"answers": concatenated_answers, "labels": inputs[0].get("labels", None)}, "output_1" | ||
|
||
elif self.join_mode == "merge": | ||
merged_answers = self._merge_answers(reader_results) | ||
|
||
if not top_k_join: | ||
top_k_join = self.top_k_join if self.top_k_join is not None else len(merged_answers) | ||
merged_answers = merged_answers[:top_k_join] | ||
return {"answers": merged_answers, "labels": inputs[0].get("labels", None)}, "output_1" | ||
|
||
else: | ||
raise ValueError(f"Invalid join_mode: {self.join_mode}") | ||
|
||
def _merge_answers(self, reader_results: List[List[Answer]]) -> List[Answer]: | ||
weights = self.weights if self.weights else [1 / len(reader_results)] * len(reader_results) | ||
|
||
for result, weight in zip(reader_results, weights): | ||
for answer in result: | ||
if isinstance(answer.score, float): | ||
answer.score *= weight | ||
|
||
return sorted([answer for cur_reader_result in reader_results for answer in cur_reader_result], reverse=True) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
from typing import List, Tuple, Dict, Optional | ||
|
||
from haystack.nodes.base import BaseComponent | ||
from haystack.schema import Document | ||
|
||
|
||
class SplitDocumentList(BaseComponent): | ||
""" | ||
A node to split a list of `Document`s by `content_type` or by the values of a metadata field. | ||
""" | ||
|
||
# By default (split_by == "content_type"), the node has two outgoing edges. | ||
outgoing_edges = 2 | ||
|
||
def __init__(self, split_by: str = "content_type", metadata_values: Optional[List[str]] = None): | ||
""" | ||
:param split_by: Field to split the documents by. Either `"content_type"` or a metadata field name. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "by. Either" should become "by either" |
||
If this parameter is set to `"content_type"`, the list of `Document`s will be split into a list containing | ||
only `Document`s of type `"text"` (will be routed to `"output_1"`) and a list containing only `Document`s of | ||
type `"text"` (will be routed to `"output_2"`). | ||
If this parameter is set to a metadata field name, you need to specify the parameter `metadata_values` as | ||
well. | ||
:param metadata_values: If the parameter `split_by` is set to a metadata field name, you need to provide a list | ||
of values to group the `Document`s to. `Document`s whose metadata field is equal to the first value of the | ||
provided list will be routed to `"output_1"`, `Document`s whose metadata field is equal to the second | ||
value of the provided list will be routed to `"output_2"`, etc. | ||
""" | ||
|
||
assert split_by == "content_type" or metadata_values is not None, ( | ||
"If split_by is set to the name of a metadata field, you must provide metadata_values " | ||
"to group the documents to." | ||
) | ||
|
||
# Save init parameters to enable export of component config as YAML | ||
self.set_config(split_by=split_by, metadata_values=metadata_values) | ||
|
||
self.split_by = split_by | ||
self.metadata_values = metadata_values | ||
|
||
# If we split list of Documents by a metadata field, number of outgoing edges might change | ||
if split_by != "content_type" and metadata_values is not None: | ||
self.outgoing_edges = len(metadata_values) | ||
|
||
def run(self, documents: List[Document]) -> Tuple[Dict, str]: # type: ignore | ||
if self.split_by == "content_type": | ||
split_documents: Dict[str, List[Document]] = {"output_1": [], "output_2": []} | ||
|
||
for doc in documents: | ||
if doc.content_type == "text": | ||
split_documents["output_1"].append(doc) | ||
elif doc.content_type == "table": | ||
split_documents["output_2"].append(doc) | ||
|
||
else: | ||
assert isinstance(self.metadata_values, list), "You need to provide metadata_values if you want to split" \ | ||
" a list of Documents by a metadata field." | ||
split_documents = {f"output_{i+1}": [] for i in range(len(self.metadata_values))} | ||
for doc in documents: | ||
current_metadata_value = doc.meta.get(self.split_by, None) | ||
# Disregard current document if it does not contain the provided metadata field | ||
if current_metadata_value is not None: | ||
try: | ||
index = self.metadata_values.index(current_metadata_value) | ||
except ValueError: | ||
# Disregard current document if current_metadata_value is not in the provided metadata_values | ||
continue | ||
|
||
split_documents[f"output_{index+1}"].append(doc) | ||
|
||
return split_documents, "split_documents" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the different names of the other nodes, I am wondering whether we could have a more consistent naming scheme. Unfortunately, I don't have an alternative for
SplitDocumentList
in mind. Maybe we can briefly talk about it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
RouteDocuments
? Similar toJoinDocuments
and in theory there could later be aRouteAnswers
node.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocumentRouter
would be more consistent with the other nodes (TableReader, Summarizer, Retriever) but then I am not immediately convinced byDocumentJoiner
andAnswerJoiner
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RouteDocuments
it is :)