Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that the component node names match the referenced nodes in the pipeline definition (yaml) #2338

Closed
ArzelaAscoIi opened this issue Mar 21, 2022 · 4 comments · Fixed by #2438
Assignees
Labels
topic:pipeline type:bug Something isn't working

Comments

@ArzelaAscoIi
Copy link
Member

If we run validate_config and generate_code (from haystack.pipelines.config) on the following yaml which contains a typo in the FARMReader name, validate_config will not find any errors. generate_code will complain about a key error. Therefore, generate_code can be used to verify that the variables used inside the pipeline definition actually exist in the yaml.

It would be helpful to validate this within the validate_config method as well.

version: '0.9'
    name: "test_pipeline_1"

    components:    # define all the building-blocks for Pipeline
    - name: DocumentStore
      type: ElasticsearchDocumentStore
    - name: Retriever
      type: ElasticsearchRetriever
      params:
        document_store: DocumentStore 
        top_k: 5
    - name: Readerrrrrrrr # TYPO
      type: FARMReader
      params:
        model_name_or_path: deepset/minilm-uncased-squad2
    - name: TextFileConverter
      type: TextConverter
    - name: Preprocessor
      type: PreProcessor
      params:
        split_by: word
        split_length: 1000

    pipelines:
    - name: query
      type: Query
      nodes:
        - name: Retriever
          inputs: [Query]
    - name: indexing
      type: Indexing
      nodes:
        - name: TextFileConverter
          inputs: [File]
        - name: Preprocessor
          inputs: [ TextFileConverter]
        - name: Retriever
          inputs: [Preprocessor]
        - name: DocumentStore
          inputs: [Retriever]
@ZanSara
Copy link
Contributor

ZanSara commented Mar 21, 2022

Hey @ArzelaAscoIi, by looking at your YAML I don't see the issue you're describing.

The Readerrrrrrrr node is not referenced in any pipeline below: that's why validate_config does not complain. In fact there's no issue with your YAML!

I am way more interested about what generate_code says instead. The KeyError you see might be completely unrelated. Can you share the stacktrace?

@ArzelaAscoIi
Copy link
Member Author

Hi @ZanSara , sorry for confusion. The yaml is indeed valid. The case I am talking about can be tested with this script:

import yaml
from haystack.pipelines.config import validate_config
from haystack.pipelines.utils import generate_code

PIPELINE_YAML = """
    version: '1.2.1rc0'
    components:
        - name: DocumentStore
          type: ElasticsearchDocumentStore
        - name: Retriever
          type: ElasticsearchRetriever
          params:
            document_store: DocumentStore 
            top_k: 5
        - name: Readerrrrrrrr # TYPO
          type: FARMReader
          params:
            model_name_or_path: deepset/minilm-uncased-squad2
        - name: TextFileConverter
          type: TextConverter
        - name: Preprocessor
          type: PreProcessor
          params:
            split_by: word
            split_length: 1000

    pipelines:
        - name: query
          nodes:
            - name: Retriever
              inputs: [Query]
            - name: Reader
              inputs: [Retriever]
        - name: indexing
          nodes:
            - name: TextFileConverter
              inputs: [File]
            - name: Preprocessor
              inputs: [ TextFileConverter]
            - name: Retriever
              inputs: [Preprocessor]
            - name: DocumentStore
              inputs: [Retriever]
"""
config = yaml.safe_load(PIPELINE_YAML)
validate_config(config)  # this wont throw an exception
generate_code(config, pipeline_name="query")  # this will throw the exception "KeyError: 'Reader'"

@ZanSara ZanSara self-assigned this Apr 14, 2022
@ZanSara
Copy link
Contributor

ZanSara commented Apr 14, 2022

Ok actually a very quick inspection made me realize the issue here.

This one is in between a naming issue and an unintuitive API design. The fact is that validate_config does not validate the graph: that step happens only in load_from_config. validate_config just checks that all the config abides the schema, not that the graph defined by the nodes is actually valid.

However, your issue has no solution at the moment. load_from_config check the graph while building it, so if you want to check that the YAML is valid without loading any node, there's no clean way in the current API. You're cleverly using generate_code here, which simply assumes the graph is valid: so when it is not, it fails, providing you the "validation" step you were looking for.

I will definitely proceed to extract the graph validation step in a separate API that can be invoked stand-alone on any YAML. Being able to fully validate YAML files before loading the pipelines is indeed quite important.

@ZanSara
Copy link
Contributor

ZanSara commented Apr 20, 2022

I think I have a fix for this issue 🙂 Check out #2438 when you have time and let me know if it works for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants