-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate code from pipeline (pipeline.to_code()) #2214
Conversation
@classmethod | ||
def _order_components( | ||
cls, dependency_map: Dict[str, List[str]], components_to_order: Optional[List[str]] = None | ||
) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other methods are kind of straight forward and easy to understand what they do, but with _order_components
it would be helpful to have a line describing what it does.
Something like Orders components according to their interdependency. Components with no dependencies come first, components depending on them come next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think all of these methods should have a docstring. The parameter names are slightly obscure to me, so a few hints of what they contain and their expected structure is welcome 🙂 Minor issue though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a though: maybe networkx
has facilities that do this kind of node ordering on DAGs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, but haven't found one. Maybe I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be it: https://networkx.org/nx-guides/content/algorithms/dag/index.html#topological-sort Have a better look though, I just skimmed through the Topological Sort section and seen this: list(nx.topological_sort(clothing_graph))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, cool! I'll look into that.
exec(query_pipeline_code) | ||
exec(index_pipeline_code) | ||
assert locals()["query_pipeline_from_code"] is not None | ||
assert locals()["index_pipeline_from_code"] is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is more an integration test, both query and indexing pipelines should work.
return "\n \t".join([f"{name}: {value}" for name, value in document_or_answer.items()]) | ||
|
||
@classmethod | ||
def _format_wrong_sample(cls, query: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking :) :
I would expect this method to be called _format_wrong_example
as it doesn't format the whole sample, but rather just one example from it.
Just a couple of minor comments. The PR looks great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this was much more dense than it looked! Sorry that it took me so long. A lot of things can be improved here, but there are only two "critical" bugs that I wouldn't like to see in tomorrow's release. If they're too tough to handle we might release without this PR.
Ping me on Slack about any specific comment, some are definitely debatable so I'll be happy to clarify!
@@ -1446,3 +1395,257 @@ def __call__(self, *args, **kwargs): | |||
Ray calls this method which is then re-directed to the corresponding component's run(). | |||
""" | |||
return self.node._dispatch_run(*args, **kwargs) | |||
|
|||
|
|||
class _PipelineCodeGen: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline.py
is already very long. How about separate private modules for these two classes, like _codegen.py
and _eval_report.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would introduce some ugly cyclic dependencies I'm afraid.
return CAMEL_CASE_TO_SNAKE_CASE_REGEX.sub("_", input).lower() | ||
|
||
@classmethod | ||
def generate_code( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was probably not your main concern, but we should keep in mind that code generation is inherently unsafe. Even though this is a hard problem, this function can be improved a lot to help on this front.
Let me show how you could game generate_code()
to inject something evil on DC. Code is tested, this is the actual output.
evil_pipeline.yml
version: 1.1.0
components:
- name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#"
type: FileTypeClassifier
pipelines:
- name: query
type: Query
nodes:
- name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#"
inputs: [File]
Code behind the "Convert to code" button on DC:
p = Pipeline.load_from_yaml(path="evil_pipeline.yml")
code = p.to_code()
# Save code to file and execute it
Code getting executed on DC:
from haystack.nodes import FileTypeClassifier
from evil_package import evil_function;evil_function()# = FileTypeClassifier()
pipeline = Pipeline()
from evil_package import evil_function;evil_function()#", inputs=["File"])
And here you go for unchecked code execution 😅 Of course this is only part of the attack: evil_package
has to be installed, but supply-chain vulnerability are not rare and evil_package
could be numpy
for what we know.
This vulnerability could be solved by removing all escape characters like \b
? Sure, but with unicode there are way more funky escape chars that I could think of. String tampering is very hard to prevent.
All of this to say that I'd prefer a whitelist of allowed chars, rather than a simple camelcase to snakecase conversion, and in general an eye to stronger security practices when we start to allow user-generated code to run on DC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yes I think we have to take care of this in the future. Currently there is no automated execution planned, but rather to save the code to a notebook file or python pile. I wonder where the right place for this validation is. It might make sense to validate the config during loading / storing in DC.
assert index_pipeline.get_config() == locals()["index_pipeline_from_code"].get_config() | ||
|
||
|
||
@pytest.mark.elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a leftover
|
||
|
||
@pytest.mark.elasticsearch | ||
def test_PipelineCodeGen_simple_sparse_pipeline(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this test covers any different code paths than the above
es_doc_store = ElasticsearchDocumentStore(index="my-index") | ||
es_retriever = ElasticsearchRetriever(document_store=es_doc_store, top_k=20) | ||
dense_doc_store = InMemoryDocumentStore(index="my-index") | ||
emb_retriever = EmbeddingRetriever( | ||
document_store=dense_doc_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mock these away too 🙂 There are other occurrences below, I think most/all of those can be mocked
test/test_pipeline.py
Outdated
assert code == ( | ||
'in_memory_document_store = InMemoryDocumentStore(index="my-index")\n' | ||
'retri = EmbeddingRetriever(document_store=in_memory_document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2")\n' | ||
"\n" | ||
"p = Pipeline()\n" | ||
'p.add_node(component=retri, name="retri", inputs=["Query"])' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail: I'd break the output on newlines and test the lines separately, so if tomorrow we decide to change the spacing the tests won't break for it.
'p.add_node(component=retri, name="retri", inputs=["Query"])' | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add tests for non DAG pipelines, unused nodes, and even meaningless pipelines if it's possible. A little issue with these tests is that only "happy paths" are covered: we should tests that exceptions trigger properly too.
Proposed changes:
to_code()
andto_notebook_cell()
toPipeline
to_code()
returns the code as stringto_notebook_cell
creates a new cell containing the codepipeline_variable_name
controls the name of the pipeline variable to be generatedgenerate_imports
controls whether respecting imports should be created along with the codeadd_comment
controls whether to show a comment depicting that it has been generated before the code. Into_code()
defaults toFalse
, into_notebook_cell()
defaults toTrue
.Main flow for notebook users:
Main flow for backend code users:
Status (please check what you already did):
closes #2195