-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate YAML files without loading the nodes #2438
Conversation
…actor load_from_config and add_node to use it
…/haystack into validate_yaml_without_loading
Feel free to skim over the changes as they are right now and let me know if you spot some evident/radical issues, or something you don't like. Test are going to be fixed soon (hopefully 😄 ) |
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two things we need to fix till it can be merged:
BaseComponent.load
should be used in pipelines/base.py- The renaming of the dense retriever's
load
method introduces too much confusion. Better keep this method's nameload
and renameBaseComponent.load
toBaseComponent._load
Androot_node
param has been commented out in two places. We should delete that.
pyproject.toml
Outdated
"c-extension-no-member", | ||
"too-many-public-methods", # Or raise the limit from 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yeah sounds like a good idea. However, it'll slightly complicate the save_to_dc workflow for local pipelines. I'm going to think about it.
Anyways, can't we restrict this too-many-public-methods permission to Pipeline only?
haystack/nodes/retriever/dense.py
Outdated
@@ -486,7 +486,7 @@ def save( | |||
self.passage_tokenizer.save_pretrained(save_dir + f"/{passage_encoder_dir}") | |||
|
|||
@classmethod | |||
def load( | |||
def load_from_directory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ugly to change: If someone used it before it's easy to confuse with BaseComponent.load
. People might not understand that. What about renaming BaseComponent.load
(again, sorry btw) to BaseComponent._load
? It should only be called by haystack internals anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I was wondering how used this method was. Same reasoning goes for FAISSDocumentStore
(load()
became load_index()
). On the other hand, given that this PR will have some minor breaking changes anyway and I updated the tutorials, do you think it's really worth to shorten it back? At least with the longer name it's harder for it to collide with something else again. Maybe let's see what @julian-risch thinks of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would short it back. I don't like the value-to-risk-profile: having a prominent load
function on each component should support easy loading of the component like it was before. The new load
function won't be of much value to users as they could simply call the proper __init__
instead. On the other hand we risk frustrating users whiches previous load
calls break for no good reason. But let's consult @julian-risch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this way you're right: that load method would be too prominent for no reason. I'll change it back unless there are other opinions 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you two. 👍
Co-authored-by: tstadel <[email protected]>
…/haystack into validate_yaml_without_loading
…/haystack into validate_yaml_without_loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #2338
Closes #2302 as well
Problem
The validation of YAML pipeline configurations used to simply evaluate if the file respected the schema. Although already quite effective at spotting issues, it had no way to verify whether the resulting pipeline was actually composing a proper graph, and could not even verify whether the node names where spelled correctly between the
components
and thepipeline.nodes
blocks of the configuration file.Analysis
The validation logic for the graph was entirely performed in the
Pipeline.add_node()
method, which was taking a fully instantiated instance of the node to add as input. However, such instance was not necessary for the validation step, which could then be extracted and reused by validation functions that would provide only the classname of the node to validate.Changes
Pipeline.add_node()
into the stateless_add_node_to_pipeline_graph
inhaystack.pipelines.config.py
.Pipeline.add_node()
was refactored to use this method for validation.validate_config
was renamedvalidate_schema
to better highlight what it actually doesvalidate_pipeline_graph
was added, andvalidate_config
now calls bothvalidate_schema
andvalidate_pipeline_graph
on each pipeline found in the configuration.We decided against that and postponed the decision of how to renamepipeline_config
was renamed to simplyconfig
in several methods, to clarify the fact that it's the whole content of a YAMLpipeline_config
.RayPipeline
moved into its separate fileBasePipeline
is goneRootNode
moved withBaseComponent
(haystack.nodes.base
) due to a circular importMockDenseRetriever
added and used in a FAISS test that was moved fromtest_pipeline.py
intotest_faiss_and_milvus.py
@ArzelaAscoIi