-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node #3170
bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node #3170
Conversation
@JeffRisberg apologies for the latency here, this is on my radar I'll come back to you in a couple of days. |
Hello @JeffRisberg, thank you for this tricky fix. Much appreciated! I tested out your change and it seems safe to me. However, could you add some tests? They should be added to |
@ZanSara I have added test case as requested, and all PR checks ran successful in the last 24-48 hours. Is there anything else you need from me? |
Hey @JeffRisberg ! Thanks for the ping, I lost sight of this PR. I'll review it shortly and be back with some feedback. |
@ZanSara I have added test case as requested, and all PR checks ran successful in the last 24-48 hours. Is there anything else you need from me? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you and sorry again for the delay
Related Issues
Proposed Changes:
There is subtle bug in the pipeline execution code when a pipeline includes a joinNode followed by another joinNode.
We have a pipeline has four retrievers. They are joined by two pair of JoinDocuments nodes, followed by another JoinDocuments node that uses the results of the prior joins.
However, not all results from the retrievers are processed and returned by the final JoinDocuments node. Documents are lost
The pipeline is built correctly, because all nodes are connected correctly in the DiGraph of class Pipeline.
However, the code at line 526 of pipelines/base.py, builds up a list of inputs. It assumes that the parameters dict does not have a key called "inputs" for the new node.
However, when a joinNode is called, it does have parameter key called "inputs".
This value is returned from execution of the node.
Hence for the second node in the chain, it will receive inputs which include the inputs from the prior node.
Hence the number of inputs is not equal to the number of weights in the join, and the documents are not joined together correctly.
How did you test it?
There is a test located at https://github.com/JeffRisberg/HaystackPipelineTest
Notes for the reviewer
I determined this by putting a breakpoint into the run() method of the JoinNode class, and checking that the inputs are correct.
The solution is at line 258 in nodes/base.py
# add "extra" args that were not used by the node and are not inputs
for k, v in arguments.items():
if k not in output.keys() and k != "inputs":
output[k] = v
Checklist