-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent losing names of utilized components when loaded from config #2525
Conversation
Changes look good to me! Ping me when the tests are green. At a first glance it seems they're failing because your code is working as expected, maybe is the tests that are slightly faulty. Thanks for the fix btw! |
@ZanSara tests are green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Left a few minor comments but it's already good to go. Happy to see this could be done with such a small diff 😊
declarations[name] = f"{variable_name} = {class_name}({init_args})" | ||
declaration = f"{variable_name} = {class_name}({init_args})" | ||
# set name of subcomponents explicitly if it's not the default name as it won't be set via Pipeline.add_node() | ||
if name != class_name and name not in (node["name"] for node in pipeline_definition["nodes"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a curiosity: I see you're using a tuple comprehension here. Are tuples faster in loops, or is just style? If they're faster I'll start to use them too... 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a tuple here: it's a generator. So technically we should save the overhead to create a list object. But I doubt it really has a performance impact unless used a few 10k times. It should however save a little memory and in the end maybe also save some cpu time as the garbage collector has less to do. But that's really minor, a list would be good as well. As it is also just used once a generator is my choice here, but I would even call that flavour.
haystack/pipelines/base.py
Outdated
# E.g. for indexing pipelines it's common to add a retriever first and a document store afterwards. | ||
# The document store is already being used by the retriever however. | ||
# Thus the very same document store will be added twice, first as a subcomponent of the retriever and second as a first level node. | ||
if name in self.components and self.components[name] != component: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit iteration on dict keys make self.components
look like a list. Let's change it to if name in self.components.keys()
haystack/pipelines/base.py
Outdated
component_definition = {"params": component.get_params(), "type": component.type} | ||
component_definitions[name] = component_definition | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get why we're "rebuilding" the _component_config
here. In the original code, these two lines were replaced by component_definitions[name] = component._component_config
. Isn't that the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I originally got confused and thought the name was included in the component_definition, too. But that's not the case so component._component_config
really is the same.
haystack/pipelines/base.py
Outdated
all_components = self._find_all_components(node_components) | ||
return {component.name: component for component in all_components if component.name is not None} | ||
|
||
def _find_all_components(self, components: List[BaseComponent]) -> Set[BaseComponent]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the name of this function. Something like _discover_utilized_components
(or similar), and move the loop in the caller. Not a big deal, but I would expect a function called _find_all_components
to ask for no parameters, not for a nearly complete list of them 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the loop would make it non-recursive, but I got an idea to get rid of the required components param.
Currently there is a bug such that utilized components (e.g. a document store within a retriever) lose their names defined via pipeline_config.
Proposed changes:
Status (please check what you already did):