Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate YAML files without loading the nodes #2438

Merged
merged 80 commits into from
May 4, 2022

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Apr 20, 2022

Closes #2338
Closes #2302 as well

Problem
The validation of YAML pipeline configurations used to simply evaluate if the file respected the schema. Although already quite effective at spotting issues, it had no way to verify whether the resulting pipeline was actually composing a proper graph, and could not even verify whether the node names where spelled correctly between the components and the pipeline.nodes blocks of the configuration file.

Analysis
The validation logic for the graph was entirely performed in the Pipeline.add_node() method, which was taking a fully instantiated instance of the node to add as input. However, such instance was not necessary for the validation step, which could then be extracted and reused by validation functions that would provide only the classname of the node to validate.

Changes

  • The validation logic was extracted from Pipeline.add_node() into the stateless _add_node_to_pipeline_graph in haystack.pipelines.config.py.
  • Pipeline.add_node() was refactored to use this method for validation.
  • validate_config was renamed validate_schema to better highlight what it actually does
  • A new method called validate_pipeline_graph was added, and validate_config now calls both validate_schema and validate_pipeline_graph on each pipeline found in the configuration.
  • pipeline_config was renamed to simply config in several methods, to clarify the fact that it's the whole content of a YAML We decided against that and postponed the decision of how to rename pipeline_config.
  • Additional changes:
    • RayPipeline moved into its separate file
    • BasePipeline is gone
    • RootNode moved with BaseComponent (haystack.nodes.base) due to a circular import
    • A pair of typos were spotted in the JSON schemas and fixed
    • MockDenseRetriever added and used in a FAISS test that was moved from test_pipeline.py into test_faiss_and_milvus.py

@ArzelaAscoIi

@ZanSara
Copy link
Contributor Author

ZanSara commented Apr 20, 2022

Feel free to skim over the changes as they are right now and let me know if you spot some evident/radical issues, or something you don't like. Test are going to be fixed soon (hopefully 😄 )

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two things we need to fix till it can be merged:

  • BaseComponent.load should be used in pipelines/base.py
  • The renaming of the dense retriever's load method introduces too much confusion. Better keep this method's name load and rename BaseComponent.load to BaseComponent._load
    And root_node param has been commented out in two places. We should delete that.

pyproject.toml Outdated
"c-extension-no-member",
"too-many-public-methods", # Or raise the limit from 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah sounds like a good idea. However, it'll slightly complicate the save_to_dc workflow for local pipelines. I'm going to think about it.
Anyways, can't we restrict this too-many-public-methods permission to Pipeline only?

haystack/pipelines/base.py Outdated Show resolved Hide resolved
@@ -486,7 +486,7 @@ def save(
self.passage_tokenizer.save_pretrained(save_dir + f"/{passage_encoder_dir}")

@classmethod
def load(
def load_from_directory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ugly to change: If someone used it before it's easy to confuse with BaseComponent.load. People might not understand that. What about renaming BaseComponent.load (again, sorry btw) to BaseComponent._load? It should only be called by haystack internals anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I was wondering how used this method was. Same reasoning goes for FAISSDocumentStore (load() became load_index()). On the other hand, given that this PR will have some minor breaking changes anyway and I updated the tutorials, do you think it's really worth to shorten it back? At least with the longer name it's harder for it to collide with something else again. Maybe let's see what @julian-risch thinks of this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would short it back. I don't like the value-to-risk-profile: having a prominent load function on each component should support easy loading of the component like it was before. The new load function won't be of much value to users as they could simply call the proper __init__ instead. On the other hand we risk frustrating users whiches previous load calls break for no good reason. But let's consult @julian-risch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this way you're right: that load method would be too prominent for no reason. I'll change it back unless there are other opinions 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you two. 👍

haystack/nodes/_json_schema.py Show resolved Hide resolved
haystack/pipelines/base.py Outdated Show resolved Hide resolved
haystack/pipelines/base.py Show resolved Hide resolved
haystack/pipelines/config.py Outdated Show resolved Hide resolved
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ZanSara ZanSara merged commit f8e0231 into master May 4, 2022
@ZanSara ZanSara deleted the validate_yaml_without_loading branch May 4, 2022 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants