Generate code from pipeline (pipeline.to_code()) #2214

tstadel · 2022-02-17T15:20:03Z

Proposed changes:

add methods to_code() and to_notebook_cell() to Pipeline
to_code() returns the code as string
to_notebook_cell creates a new cell containing the code
param pipeline_variable_name controls the name of the pipeline variable to be generated
param generate_imports controls whether respecting imports should be created along with the code
param add_comment controls whether to show a comment depicting that it has been generated before the code. In to_code() defaults to False, in to_notebook_cell() defaults to True.

Main flow for notebook users:

p = Pipeline.load_from_deepset_cloud(pipeline_config_name=MY_PIPELINE_CONFIG_NAME)

p.to_notebook_cell()

# This code has been generated.
from haystack.document_stores import DeepsetCloudDocumentStore
from haystack.nodes import ElasticsearchRetriever

deepset_cloud_document_store = DeepsetCloudDocumentStore(index="document_search_pipeline_2")
retriever = ElasticsearchRetriever(document_store=deepset_cloud_document_store, top_k=5)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

Main flow for backend code users:

p = Pipeline.load_from_yaml(path=MY_YAML_PATH)
code = p.to_code()
# do whatever you like with the code

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

closes #2195

…o pipeline_to_code

dmigo · 2022-02-22T10:32:37Z

haystack/pipelines/base.py

+    @classmethod
+    def _order_components(
+        cls, dependency_map: Dict[str, List[str]], components_to_order: Optional[List[str]] = None
+    ) -> List[str]:


The other methods are kind of straight forward and easy to understand what they do, but with _order_components it would be helpful to have a line describing what it does.
Something like Orders components according to their interdependency. Components with no dependencies come first, components depending on them come next

Personally I think all of these methods should have a docstring. The parameter names are slightly obscure to me, so a few hints of what they contain and their expected structure is welcome 🙂 Minor issue though.

Just a though: maybe networkx has facilities that do this kind of node ordering on DAGs?

I checked, but haven't found one. Maybe I missed something.

This might be it: https://networkx.org/nx-guides/content/algorithms/dag/index.html#topological-sort Have a better look though, I just skimmed through the Topological Sort section and seen this: list(nx.topological_sort(clothing_graph))

Ah ok, cool! I'll look into that.

dmigo · 2022-02-22T10:33:48Z

test/test_pipeline.py

+    exec(query_pipeline_code)
+    exec(index_pipeline_code)
+    assert locals()["query_pipeline_from_code"] is not None
+    assert locals()["index_pipeline_from_code"] is not None


As this is more an integration test, both query and indexing pipelines should work.

dmigo · 2022-02-22T10:41:38Z

haystack/pipelines/base.py

+        return "\n \t".join([f"{name}: {value}" for name, value in document_or_answer.items()])
+
+    @classmethod
+    def _format_wrong_sample(cls, query: dict):


Nitpicking :) :
I would expect this method to be called _format_wrong_example as it doesn't format the whole sample, but rather just one example from it.

dmigo · 2022-02-22T10:43:05Z

Just a couple of minor comments. The PR looks great!

ZanSara

Ok this was much more dense than it looked! Sorry that it took me so long. A lot of things can be improved here, but there are only two "critical" bugs that I wouldn't like to see in tomorrow's release. If they're too tough to handle we might release without this PR.

Ping me on Slack about any specific comment, some are definitely debatable so I'll be happy to clarify!

haystack/pipelines/base.py

ZanSara · 2022-02-22T13:29:58Z

haystack/pipelines/base.py

@@ -1446,3 +1395,257 @@ def __call__(self, *args, **kwargs):
        Ray calls this method which is then re-directed to the corresponding component's run().
        """
        return self.node._dispatch_run(*args, **kwargs)
+
+
+class _PipelineCodeGen:


pipeline.py is already very long. How about separate private modules for these two classes, like _codegen.py and _eval_report.py?

This would introduce some ugly cyclic dependencies I'm afraid.

ZanSara · 2022-02-22T14:25:46Z

haystack/pipelines/base.py

+        return CAMEL_CASE_TO_SNAKE_CASE_REGEX.sub("_", input).lower()
+
+    @classmethod
+    def generate_code(


This was probably not your main concern, but we should keep in mind that code generation is inherently unsafe. Even though this is a hard problem, this function can be improved a lot to help on this front.

Let me show how you could game generate_code() to inject something evil on DC. Code is tested, this is the actual output.

evil_pipeline.yml

version: 1.1.0 components: - name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#" type: FileTypeClassifier pipelines: - name: query type: Query nodes: - name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#" inputs: [File]

Code behind the "Convert to code" button on DC:

p = Pipeline.load_from_yaml(path="evil_pipeline.yml") code = p.to_code() # Save code to file and execute it

Code getting executed on DC:

from haystack.nodes import FileTypeClassifier from evil_package import evil_function;evil_function()# = FileTypeClassifier() pipeline = Pipeline() from evil_package import evil_function;evil_function()#", inputs=["File"])

And here you go for unchecked code execution 😅 Of course this is only part of the attack: evil_package has to be installed, but supply-chain vulnerability are not rare and evil_package could be numpy for what we know.
This vulnerability could be solved by removing all escape characters like \b? Sure, but with unicode there are way more funky escape chars that I could think of. String tampering is very hard to prevent.

All of this to say that I'd prefer a whitelist of allowed chars, rather than a simple camelcase to snakecase conversion, and in general an eye to stronger security practices when we start to allow user-generated code to run on DC.

I see, yes I think we have to take care of this in the future. Currently there is no automated execution planned, but rather to save the code to a notebook file or python pile. I wonder where the right place for this validation is. It might make sense to validate the config during loading / storing in DC.

ZanSara · 2022-02-22T15:53:42Z

test/test_pipeline.py

+    assert index_pipeline.get_config() == locals()["index_pipeline_from_code"].get_config()
+
+
+@pytest.mark.elasticsearch


Probably a leftover

ZanSara · 2022-02-22T15:54:20Z

test/test_pipeline.py

+
+
+@pytest.mark.elasticsearch
+def test_PipelineCodeGen_simple_sparse_pipeline():


Not sure this test covers any different code paths than the above

ZanSara · 2022-02-22T15:55:17Z

test/test_pipeline.py

+    es_doc_store = ElasticsearchDocumentStore(index="my-index")
+    es_retriever = ElasticsearchRetriever(document_store=es_doc_store, top_k=20)
+    dense_doc_store = InMemoryDocumentStore(index="my-index")
+    emb_retriever = EmbeddingRetriever(
+        document_store=dense_doc_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2"
+    )


Let's mock these away too 🙂 There are other occurrences below, I think most/all of those can be mocked

ZanSara · 2022-02-22T15:57:40Z

test/test_pipeline.py

+    assert code == (
+        'in_memory_document_store = InMemoryDocumentStore(index="my-index")\n'
+        'retri = EmbeddingRetriever(document_store=in_memory_document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2")\n'
+        "\n"
+        "p = Pipeline()\n"
+        'p.add_node(component=retri, name="retri", inputs=["Query"])'
+    )


Minor detail: I'd break the output on newlines and test the lines separately, so if tomorrow we decide to change the spacing the tests won't break for it.

ZanSara · 2022-02-22T16:00:25Z

test/test_pipeline.py

+        'p.add_node(component=retri, name="retri", inputs=["Query"])'
+    )
+
+


You could add tests for non DAG pipelines, unused nodes, and even meaningless pipelines if it's possible. A little issue with these tests is that only "happy paths" are covered: we should tests that exceptions trigger properly too.

tstadel and others added 12 commits February 16, 2022 17:22

pipeline.to_code() with jupyter support

6ee9ddc

Update Documentation & Code Style

5474997

add imports

4f3d739

Merge branch 'pipeline_to_code' of github.com:deepset-ai/haystack int…

fdd6bff

…o pipeline_to_code

refactoring

466dca4

Update Documentation & Code Style

e02e05c

docstrings added and refactoring

6ed77a0

Merge branch 'pipeline_to_code' of github.com:deepset-ai/haystack int…

578c05a

…o pipeline_to_code

Update Documentation & Code Style

0c33bc7

improve imports code generation

896616b

add comment param

b27d73b

Update Documentation & Code Style

f9fea7a

tstadel mentioned this pull request Feb 17, 2022

get pipeline info, after load_from_deepset_cloud #2134

Closed

tstadel added 2 commits February 18, 2022 10:20

add simple test

62d77a3

add to_notebook_cell()

cf2248f

tstadel requested review from ZanSara and dmigo February 21, 2022 19:21

github-actions bot and others added 2 commits February 21, 2022 19:24

Update Documentation & Code Style

9763423

introduce helper classes for code gen and eval report gen

9751194

dmigo reviewed Feb 22, 2022

View reviewed changes

tstadel and others added 3 commits February 22, 2022 12:06

add more tests

f20b402

Update Documentation & Code Style

4f63666

Merge branch 'master' into pipeline_to_code

2dfb7fe

ZanSara suggested changes Feb 22, 2022

View reviewed changes

ZanSara closed this Feb 22, 2022

ZanSara reopened this Feb 22, 2022

fix Dict typings

dec2f96

github-actions bot and others added 6 commits February 22, 2022 18:03

Update Documentation & Code Style

14c11fb

validate user input before code gen

1fd94b9

enable urls for to_code()

53abe8a

Update Documentation & Code Style

f356f92

remove all chars except colon from validation regex

278358e

Merge branch 'master' into pipeline_to_code

2152060

ZanSara approved these changes Feb 23, 2022

View reviewed changes

tstadel merged commit e20f2e0 into master Feb 23, 2022

tstadel deleted the pipeline_to_code branch February 23, 2022 10:08

This was referenced Feb 24, 2022

Stricter Pipeline validation #2246

Closed

Pipeline's YAML: syntax validation #2226

Merged

ZanSara mentioned this pull request Sep 9, 2022

bug: validate custom_mapping as an object #3189

Merged

6 tasks

julian-risch mentioned this pull request Oct 10, 2022

fix: Allow less restrictive values for parameters in Pipeline configurations #3345

Merged

6 tasks

This was referenced Jan 11, 2023

Error loading pipeline yaml with empty strings #3642

Closed

to_code does not check whether components names are valid Python variable names #3855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate code from pipeline (pipeline.to_code()) #2214

Generate code from pipeline (pipeline.to_code()) #2214

tstadel commented Feb 17, 2022 •

edited

Loading

dmigo Feb 22, 2022

ZanSara Feb 22, 2022 •

edited

Loading

ZanSara Feb 22, 2022 •

edited

Loading

tstadel Feb 22, 2022

ZanSara Feb 22, 2022

tstadel Feb 22, 2022

dmigo Feb 22, 2022

tstadel Feb 22, 2022

dmigo Feb 22, 2022

dmigo commented Feb 22, 2022

ZanSara left a comment

ZanSara Feb 22, 2022

tstadel Feb 22, 2022

ZanSara Feb 22, 2022 •

edited

Loading

tstadel Feb 22, 2022

ZanSara Feb 22, 2022

ZanSara Feb 22, 2022

ZanSara Feb 22, 2022

ZanSara Feb 22, 2022

ZanSara Feb 22, 2022

		assert index_pipeline.get_config() == locals()["index_pipeline_from_code"].get_config()


		@pytest.mark.elasticsearch



		@pytest.mark.elasticsearch
		def test_PipelineCodeGen_simple_sparse_pipeline():

		'p.add_node(component=retri, name="retri", inputs=["Query"])'
		)

Generate code from pipeline (pipeline.to_code()) #2214

Generate code from pipeline (pipeline.to_code()) #2214

Conversation

tstadel commented Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

ZanSara Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

ZanSara Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmigo commented Feb 22, 2022

ZanSara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZanSara Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstadel commented Feb 17, 2022 •

edited

Loading

ZanSara Feb 22, 2022 •

edited

Loading

ZanSara Feb 22, 2022 •

edited

Loading

ZanSara Feb 22, 2022 •

edited

Loading