Prevent losing names of utilized components when loaded from config #2525

tstadel · 2022-05-10T14:01:31Z

Currently there is a bug such that utilized components (e.g. a document store within a retriever) lose their names defined via pipeline_config.

Proposed changes:

Prevent losing names of utilized components when loaded from config

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests

ZanSara · 2022-05-12T07:09:20Z

Changes look good to me! Ping me when the tests are green. At a first glance it seems they're failing because your code is working as expected, maybe is the tests that are slightly faulty.

Thanks for the fix btw!

tstadel · 2022-05-17T19:18:02Z

@ZanSara tests are green now.
However I had to do a bit more to make it work smoothly: e.g. indexing pipelines that naturally add the same document store instance twice to the component definitions (first as part of the retriever, second as a node for itself) weren't working any more.
To make that work, the pipeline now knows (through pipeline.components) which components are part of it even though they are not nodes. We use that information to check whether the very instance of a component is already part of the pipeline and can be safely used as a node under the same name. This of course does not apply for the _add_node_to_pipeline_graph logic. Here adding the same instance twice fails as intended.
Thus the indexing pipeline use case is now working again without allowing the very same component instance to be added twice to the pipeline graph.

ZanSara

Looking good! Left a few minor comments but it's already good to go. Happy to see this could be done with such a small diff 😊

ZanSara · 2022-05-18T09:42:15Z

haystack/pipelines/utils.py

-        declarations[name] = f"{variable_name} = {class_name}({init_args})"
+        declaration = f"{variable_name} = {class_name}({init_args})"
+        # set name of subcomponents explicitly if it's not the default name as it won't be set via Pipeline.add_node()
+        if name != class_name and name not in (node["name"] for node in pipeline_definition["nodes"]):


Just a curiosity: I see you're using a tuple comprehension here. Are tuples faster in loops, or is just style? If they're faster I'll start to use them too... 😁

It's not a tuple here: it's a generator. So technically we should save the overhead to create a list object. But I doubt it really has a performance impact unless used a few 10k times. It should however save a little memory and in the end maybe also save some cpu time as the garbage collector has less to do. But that's really minor, a list would be good as well. As it is also just used once a generator is my choice here, but I would even call that flavour.

ZanSara · 2022-05-18T09:47:41Z

haystack/pipelines/base.py

+        # E.g. for indexing pipelines it's common to add a retriever first and a document store afterwards.
+        # The document store is already being used by the retriever however.
+        # Thus the very same document store will be added twice, first as a subcomponent of the retriever and second as a first level node.
+        if name in self.components and self.components[name] != component:


Implicit iteration on dict keys make self.components look like a list. Let's change it to if name in self.components.keys()

ZanSara · 2022-05-18T09:50:16Z

haystack/pipelines/base.py

+        component_definition = {"params": component.get_params(), "type": component.type}
+        component_definitions[name] = component_definition



I'm not sure I get why we're "rebuilding" the _component_config here. In the original code, these two lines were replaced by component_definitions[name] = component._component_config. Isn't that the same?

You're right, I originally got confused and thought the name was included in the component_definition, too. But that's not the case so component._component_config really is the same.

ZanSara · 2022-05-18T09:56:05Z

haystack/pipelines/base.py

+        all_components = self._find_all_components(node_components)
+        return {component.name: component for component in all_components if component.name is not None}
+
+    def _find_all_components(self, components: List[BaseComponent]) -> Set[BaseComponent]:


I would change the name of this function. Something like _discover_utilized_components (or similar), and move the loop in the caller. Not a big deal, but I would expect a function called _find_all_components to ask for no parameters, not for a nearly complete list of them 😄

Moving the loop would make it non-recursive, but I got an idea to get rid of the required components param.

Prevent losing names of utilized components when loaded from config

4ed8ee3

tstadel requested review from julian-risch and ZanSara May 10, 2022 14:01

github-actions bot and others added 2 commits May 10, 2022 14:06

Update Documentation & Code Style

bd3d44c

update test

d765b2d

julian-risch added the topic:pipeline label May 11, 2022

tstadel and others added 5 commits May 17, 2022 16:57

fix failing tests

d57a887

Update Documentation & Code Style

3ee7d08

Merge branch 'master' into fix_lost_subcomponent_names

a67e5a2

fix even more tests

48f10be

Update Documentation & Code Style

70ae14d

ZanSara approved these changes May 18, 2022

View reviewed changes

incorporate review feedback

cbb8d3a

tstadel added type:bug Something isn't working topic:save/load labels May 18, 2022

tstadel merged commit f6e3a63 into master May 18, 2022

tstadel deleted the fix_lost_subcomponent_names branch May 18, 2022 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent losing names of utilized components when loaded from config #2525

Prevent losing names of utilized components when loaded from config #2525

tstadel commented May 10, 2022

ZanSara commented May 12, 2022

tstadel commented May 17, 2022

ZanSara left a comment

ZanSara May 18, 2022

tstadel May 18, 2022

ZanSara May 18, 2022

ZanSara May 18, 2022

tstadel May 18, 2022 •

edited

Loading

ZanSara May 18, 2022

tstadel May 18, 2022

		component_definition = {"params": component.get_params(), "type": component.type}
		component_definitions[name] = component_definition

Prevent losing names of utilized components when loaded from config #2525

Prevent losing names of utilized components when loaded from config #2525

Conversation

tstadel commented May 10, 2022

ZanSara commented May 12, 2022

tstadel commented May 17, 2022

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara May 18, 2022

Choose a reason for hiding this comment

tstadel May 18, 2022

Choose a reason for hiding this comment

ZanSara May 18, 2022

Choose a reason for hiding this comment

ZanSara May 18, 2022

Choose a reason for hiding this comment

tstadel May 18, 2022 • edited Loading

Choose a reason for hiding this comment

ZanSara May 18, 2022

Choose a reason for hiding this comment

tstadel May 18, 2022

Choose a reason for hiding this comment

tstadel May 18, 2022 •

edited

Loading