-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RouteDocuments
and JoinAnswers
nodes
#2256
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…it_tables_and_texts # Conflicts: # tutorials/Tutorial15_TableQA.ipynb
A JoinAnswers node would also be helpful for the use case described in this issue #1081 on combining FAQ and ExtractiveQA in a pipeline by @SasikiranJ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me. I am just missing a test case for the JoinAnswers
node, I saw some typing errors in the CI and I would like to briefly talk about the names of the new nodes with you. Feel free to ping me anytime.
@@ -59,6 +59,9 @@ | |||
{ | |||
"$ref": "#/definitions/ImageToTextConverterComponent" | |||
}, | |||
{ | |||
"$ref": "#/definitions/JoinAnswersComponent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the update of the Haystack documentation website isn't completed yet, we haven't upgraded the Haystack version to 1.2.1rc0 yet. Before merging your PR, we need to make sure that json-schemas/haystack-pipeline-1.2.0.schema.json
is unchanged and a json-schemas/haystack-pipeline-1.2.1rc0.schema.json
is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandenchan we'll merge this PR now and then we'll need to correct the schema files once the website updates for the v1.2.0 are done.
test/test_pipeline.py
Outdated
@@ -1041,6 +1043,35 @@ def test_documentsearch_document_store_authentication(retriever_with_docs, docum | |||
assert kwargs["headers"] == auth_headers | |||
|
|||
|
|||
def test_split_document_list_content_type(test_docs_xs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a test case to check both join modes of the JoinAnswers
node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case that covers both join modes.
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it on purpose to include the outputs in the tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the outputs help the users to understand the output without having to run the tutorial.
haystack/nodes/other/__init__.py
Outdated
@@ -1,2 +1,4 @@ | |||
from haystack.nodes.other.docs2answers import Docs2Answers | |||
from haystack.nodes.other.join_docs import JoinDocuments | |||
from haystack.nodes.other.split_documents import SplitDocumentList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the different names of the other nodes, I am wondering whether we could have a more consistent naming scheme. Unfortunately, I don't have an alternative for SplitDocumentList
in mind. Maybe we can briefly talk about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about RouteDocuments
? Similar to JoinDocuments
and in theory there could later be a RouteAnswers
node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocumentRouter
would be more consistent with the other nodes (TableReader, Summarizer, Retriever) but then I am not immediately convinced by DocumentJoiner
and AnswerJoiner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RouteDocuments
it is :)
|
||
def __init__(self, split_by: str = "content_type", metadata_values: Optional[List[str]] = None): | ||
""" | ||
:param split_by: Field to split the documents by. Either `"content_type"` or a metadata field name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"by. Either" should become "by either"
…it_tables_and_texts
SplitDocumentList
and JoinAnswers
nodesRouteDocuments
and JoinAnswers
nodes
test/test_pipeline.py
Outdated
@@ -1072,6 +1074,20 @@ def test_split_document_list_content_type(test_docs_xs): | |||
assert result["output_3"][0].meta["meta_field"] == "test5" | |||
|
|||
|
|||
@pytest.mark.parametrize("join_mode", ["concatenate", "merge"]) | |||
def test_join_answers_concatenate(join_mode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_join_answers_concatenate
is a little bit misleading as you test for "concatenate" and "merge".
This PR adds two nodes: a
RouteDocuments
node and aJoinAnswers
node. Both nodes are needed in order to do QA on the combination of text and tables as the source of information.The
RouteDocuments
takes as input a list of Documents and splits them by eithercontent_type
or a metadata field and routes the resulting splits to different outputs. An alternative to this node would have been to allow both texts and tables for the readers, but make theFARMReader
skip Documents of type tables and theTableReader
skip Documents of type text. However, having a designated node makes this process more explicit and allows splitting not only bycontent_type
, but also other metadata values. (For example, for routing Documents to Readers that are trained on a specific domain or on a specific language)The
JoinAnswers
node takes as input the predicted Answers of two individual Reader nodes and joins them to a single list of Answers.This graph shows how a Pipeline allowing QA on both text and tables would look like: