Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent losing names of utilized components when loaded from config #2525

Merged
merged 9 commits into from
May 18, 2022

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented May 10, 2022

Currently there is a bug such that utilized components (e.g. a document store within a retriever) lose their names defined via pipeline_config.

Proposed changes:

  • Prevent losing names of utilized components when loaded from config

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests

@tstadel tstadel requested review from julian-risch and ZanSara May 10, 2022 14:01
@ZanSara
Copy link
Contributor

ZanSara commented May 12, 2022

Changes look good to me! Ping me when the tests are green. At a first glance it seems they're failing because your code is working as expected, maybe is the tests that are slightly faulty.

Thanks for the fix btw!

@tstadel
Copy link
Member Author

tstadel commented May 17, 2022

@ZanSara tests are green now.
However I had to do a bit more to make it work smoothly: e.g. indexing pipelines that naturally add the same document store instance twice to the component definitions (first as part of the retriever, second as a node for itself) weren't working any more.
To make that work, the pipeline now knows (through pipeline.components) which components are part of it even though they are not nodes. We use that information to check whether the very instance of a component is already part of the pipeline and can be safely used as a node under the same name. This of course does not apply for the _add_node_to_pipeline_graph logic. Here adding the same instance twice fails as intended.
Thus the indexing pipeline use case is now working again without allowing the very same component instance to be added twice to the pipeline graph.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left a few minor comments but it's already good to go. Happy to see this could be done with such a small diff 😊

declarations[name] = f"{variable_name} = {class_name}({init_args})"
declaration = f"{variable_name} = {class_name}({init_args})"
# set name of subcomponents explicitly if it's not the default name as it won't be set via Pipeline.add_node()
if name != class_name and name not in (node["name"] for node in pipeline_definition["nodes"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a curiosity: I see you're using a tuple comprehension here. Are tuples faster in loops, or is just style? If they're faster I'll start to use them too... 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a tuple here: it's a generator. So technically we should save the overhead to create a list object. But I doubt it really has a performance impact unless used a few 10k times. It should however save a little memory and in the end maybe also save some cpu time as the garbage collector has less to do. But that's really minor, a list would be good as well. As it is also just used once a generator is my choice here, but I would even call that flavour.

# E.g. for indexing pipelines it's common to add a retriever first and a document store afterwards.
# The document store is already being used by the retriever however.
# Thus the very same document store will be added twice, first as a subcomponent of the retriever and second as a first level node.
if name in self.components and self.components[name] != component:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit iteration on dict keys make self.components look like a list. Let's change it to if name in self.components.keys()

Comment on lines 402 to 404
component_definition = {"params": component.get_params(), "type": component.type}
component_definitions[name] = component_definition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get why we're "rebuilding" the _component_config here. In the original code, these two lines were replaced by component_definitions[name] = component._component_config. Isn't that the same?

Copy link
Member Author

@tstadel tstadel May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I originally got confused and thought the name was included in the component_definition, too. But that's not the case so component._component_config really is the same.

all_components = self._find_all_components(node_components)
return {component.name: component for component in all_components if component.name is not None}

def _find_all_components(self, components: List[BaseComponent]) -> Set[BaseComponent]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the name of this function. Something like _discover_utilized_components (or similar), and move the loop in the caller. Not a big deal, but I would expect a function called _find_all_components to ask for no parameters, not for a nearly complete list of them 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the loop would make it non-recursive, but I got an idea to get rid of the required components param.

@tstadel tstadel added type:bug Something isn't working topic:save/load labels May 18, 2022
@tstadel tstadel merged commit f6e3a63 into master May 18, 2022
@tstadel tstadel deleted the fix_lost_subcomponent_names branch May 18, 2022 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants