Skip to content
This repository has been archived by the owner on Nov 28, 2023. It is now read-only.

feat: use full connection data to route I/O #148

Merged
merged 30 commits into from
Nov 14, 2023
Merged

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Oct 26, 2023

Main changes

  • A small, internal (not visible as API) Connection object is introduced to simplify the implementation of the run() method. Access to the same values through networkx's API is just too hard to read and error-prone. The Connection class simplifies access to tuples of (producer_component, producer_socket, consumer_component, consumer_socket), plus a couple of shorthands like a readable __repr__ and a is_mandatory method for the entire connection, that returns True if a connection must be waited for before running a component.

  • InputSocket.is_optional, based on typing, became InputSocket.is_mandatory, based on the presence or absence of a default value

  • The full list of connections (self._connections) and a mapping of all the mandatory connections of each component (self._mandatory_connections) are now Pipeline's state and are populated by the self._direct_connect method instead of being re-computed at every Pipeline.run().

  • The inputs_buffer object has been split into three separate data structures:

    • components_queue: List[str]: contains the list of the components to process, similar to the old inputs_buffer.keys()
    • mandatory_values_buffer: Dict[Connection, Any]: For each connection that exists in the pipeline where the receiving socket is mandatory, it stores the value that is being sent over that connection. Uses Connection objects as keys to ease access.
    • optional_values_buffer: Dict[Connection, Any]: Semantically identical to mandatory_values_buffer, but for connections that are not mandatory.
  • The possible actions for a component have been reduced from four (run, wait, skip, remove) to a boolean value where True == "run" and False == "wait"

  • Skipping branches is now done by not adding a given component to the components_queue instead of signaling the skip with a None. Skipped branches are detected as follows: if from no component in the components queue, networkx.has_path(component, waiting_component) is True, then the connection will never receive a value and therefore it is not waited for.

API Impact

No changes at Pipeline's API level.

Loops

Tests were added to make sure many more looping topologies are now supported. However, note how looping is easier to implement with variadic loop mergers. Fixed loop mergers do work in simple topologies but don't cover all cases, while variadics seem to enable all scenarios I could think of.

While there is a way to make non-variadic components able to behave as loop mergers, we would need to extend the API to allow the user to define alternative sets of mandatory inputs instead of a single one, as it happens now. This change can be performed in a (mostly) non-breaking way later on if we find it necessary, so it was not added in this PR.

Dynamic mandatory + fixed optional inputs

The current PromptBuilder shows a signature that this version would not readily support:

class PromptBuilder:

    def __init__(self, ...):
        ....
        component.set_input_types(self, "messages": Optional[List[ChatMessage]], **dynamic_input_slots)

    def run(self, messages: Optional[List[ChatMessage]] = None, **kwargs): 
        ...

The issue is that input sockets are read from the set_input_types call only, and therefore all count as mandatory. This is fixable in a non-breaking way, so it has not been included in this PR and will be addressed separately.


Fixes #112
Fixes #113
Fixes #111
Fixes #105

@ZanSara ZanSara marked this pull request as draft October 26, 2023 17:15
@ZanSara ZanSara changed the title feat: stop using None to skip branches feat: use full Connection data to route I/O Oct 27, 2023
@ZanSara ZanSara changed the title feat: use full Connection data to route I/O feat: use full connection data to route I/O Oct 27, 2023
@ZanSara ZanSara requested a review from masci October 27, 2023 12:46
@ZanSara ZanSara marked this pull request as ready for review October 27, 2023 13:12
@masci masci self-assigned this Oct 30, 2023
@vblagoje
Copy link
Member

vblagoje commented Nov 2, 2023

Unbeknownst to me that this PR/issue existed (@ZanSara informed me), I encountered its manifestation while working on a demo for a new ConditionalRouter. Here is a small code sample summarizing the connection firing issue.

from haystack.preview import Pipeline, component
pipe = Pipeline()


@component
class ComponentA:

    @ component.output_types(output_a=str, output_b=str)
    def run(self, input_a: str):
        return {"output_a":"A"}


@component
class ComponentB:

    @ component.output_types(output_b=str)
    def run(self, input_b: str):
        return {"output_b":"B"}


a = ComponentA()
b = ComponentB()
pipe.add_component("a", a)
pipe.add_component("b", b)
pipe.connect("a.output_b", "b.input_b")

response = pipe.run(data={"a": {"input_a": "input"}})
assert response == {"a": {"output_a": "A"}}

The current main version of canals returns an empty {} while the canals from this branch returns the correct answer, as asserted at the end of the code sample. 👍 Perhaps this small sample could be added as a unit test. I'll let you decide if that's necessary.

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this PR into mulitple smaller ones, we can follow the order of the list I see in the PR description

@silvanocerza
Copy link
Contributor

  • InputSocket.is_optional, based on typing, became InputSocket.is_mandatory, based on the presence or absence of a default value

Just want to stress that this is semantically wrong. I kinda already went over this point on #142.
We keep confusing type optionality with input optionality, they are semantically different.

is_optional tells me that the input can be None and not that it can be omitted. is_mandatory completely glosses this over.

@ZanSara ZanSara marked this pull request as draft November 10, 2023 11:09
@masci masci marked this pull request as ready for review November 14, 2023 07:55
@ZanSara ZanSara closed this Nov 14, 2023
@masci masci reopened this Nov 14, 2023
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the diff an made some changes:

  • Removed any reference of consumer/producer (we already have sender/receiver, I made them consistent)
  • Moved Connection under the component package - to me the connection looks closer to sockets than pipelines
  • Moved some logic out of the Pipeline methods into the Connection class - this makes the pipeline code more readable
  • Added a minimal, explicit test case for the optional input
  • Added more unit tests to increase coverage
  • Fixed typing to have the pipeline and connection modules clean
  • Added my notes here and there as code comments
  • Changed variable names where I felt it could add clarity

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants