Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate code from pipeline (pipeline.to_code()) #2214

Merged
merged 26 commits into from
Feb 23, 2022
Merged

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Feb 17, 2022

Proposed changes:

  • add methods to_code() and to_notebook_cell() to Pipeline
  • to_code() returns the code as string
  • to_notebook_cell creates a new cell containing the code
  • param pipeline_variable_name controls the name of the pipeline variable to be generated
  • param generate_imports controls whether respecting imports should be created along with the code
  • param add_comment controls whether to show a comment depicting that it has been generated before the code. In to_code() defaults to False, in to_notebook_cell() defaults to True.

Main flow for notebook users:

p = Pipeline.load_from_deepset_cloud(pipeline_config_name=MY_PIPELINE_CONFIG_NAME)
p.to_notebook_cell()
# This code has been generated.
from haystack.document_stores import DeepsetCloudDocumentStore
from haystack.nodes import ElasticsearchRetriever

deepset_cloud_document_store = DeepsetCloudDocumentStore(index="document_search_pipeline_2")
retriever = ElasticsearchRetriever(document_store=deepset_cloud_document_store, top_k=5)

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

Main flow for backend code users:

p = Pipeline.load_from_yaml(path=MY_YAML_PATH)
code = p.to_code()
# do whatever you like with the code

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

closes #2195

@tstadel tstadel requested review from ZanSara and dmigo February 21, 2022 19:21
@classmethod
def _order_components(
cls, dependency_map: Dict[str, List[str]], components_to_order: Optional[List[str]] = None
) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other methods are kind of straight forward and easy to understand what they do, but with _order_components it would be helpful to have a line describing what it does.
Something like Orders components according to their interdependency. Components with no dependencies come first, components depending on them come next

Copy link
Contributor

@ZanSara ZanSara Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think all of these methods should have a docstring. The parameter names are slightly obscure to me, so a few hints of what they contain and their expected structure is welcome 🙂 Minor issue though.

Copy link
Contributor

@ZanSara ZanSara Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a though: maybe networkx has facilities that do this kind of node ordering on DAGs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, but haven't found one. Maybe I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be it: https://networkx.org/nx-guides/content/algorithms/dag/index.html#topological-sort Have a better look though, I just skimmed through the Topological Sort section and seen this: list(nx.topological_sort(clothing_graph))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, cool! I'll look into that.

Comment on lines +175 to +178
exec(query_pipeline_code)
exec(index_pipeline_code)
assert locals()["query_pipeline_from_code"] is not None
assert locals()["index_pipeline_from_code"] is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is more an integration test, both query and indexing pipelines should work.

return "\n \t".join([f"{name}: {value}" for name, value in document_or_answer.items()])

@classmethod
def _format_wrong_sample(cls, query: dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking :) :
I would expect this method to be called _format_wrong_example as it doesn't format the whole sample, but rather just one example from it.

@dmigo
Copy link
Member

dmigo commented Feb 22, 2022

Just a couple of minor comments. The PR looks great!

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this was much more dense than it looked! Sorry that it took me so long. A lot of things can be improved here, but there are only two "critical" bugs that I wouldn't like to see in tomorrow's release. If they're too tough to handle we might release without this PR.

Ping me on Slack about any specific comment, some are definitely debatable so I'll be happy to clarify!

haystack/pipelines/base.py Show resolved Hide resolved
haystack/pipelines/base.py Outdated Show resolved Hide resolved
haystack/pipelines/base.py Outdated Show resolved Hide resolved
@@ -1446,3 +1395,257 @@ def __call__(self, *args, **kwargs):
Ray calls this method which is then re-directed to the corresponding component's run().
"""
return self.node._dispatch_run(*args, **kwargs)


class _PipelineCodeGen:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline.py is already very long. How about separate private modules for these two classes, like _codegen.py and _eval_report.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would introduce some ugly cyclic dependencies I'm afraid.

return CAMEL_CASE_TO_SNAKE_CASE_REGEX.sub("_", input).lower()

@classmethod
def generate_code(
Copy link
Contributor

@ZanSara ZanSara Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was probably not your main concern, but we should keep in mind that code generation is inherently unsafe. Even though this is a hard problem, this function can be improved a lot to help on this front.

Let me show how you could game generate_code() to inject something evil on DC. Code is tested, this is the actual output.

evil_pipeline.yml

version: 1.1.0

components:
  - name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#"
    type: FileTypeClassifier

pipelines:
  - name: query
    type: Query
    nodes:
      - name: "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\bfrom evil_package import evil_function;evil_function()#"
        inputs: [File]

Code behind the "Convert to code" button on DC:

p = Pipeline.load_from_yaml(path="evil_pipeline.yml")
code = p.to_code()
# Save code to file and execute it

Code getting executed on DC:

from haystack.nodes import FileTypeClassifier

from evil_package import evil_function;evil_function()# = FileTypeClassifier()

pipeline = Pipeline()
from evil_package import evil_function;evil_function()#", inputs=["File"])

And here you go for unchecked code execution 😅 Of course this is only part of the attack: evil_package has to be installed, but supply-chain vulnerability are not rare and evil_package could be numpy for what we know.
This vulnerability could be solved by removing all escape characters like \b? Sure, but with unicode there are way more funky escape chars that I could think of. String tampering is very hard to prevent.

All of this to say that I'd prefer a whitelist of allowed chars, rather than a simple camelcase to snakecase conversion, and in general an eye to stronger security practices when we start to allow user-generated code to run on DC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes I think we have to take care of this in the future. Currently there is no automated execution planned, but rather to save the code to a notebook file or python pile. I wonder where the right place for this validation is. It might make sense to validate the config during loading / storing in DC.

assert index_pipeline.get_config() == locals()["index_pipeline_from_code"].get_config()


@pytest.mark.elasticsearch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a leftover



@pytest.mark.elasticsearch
def test_PipelineCodeGen_simple_sparse_pipeline():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this test covers any different code paths than the above

Comment on lines +217 to +222
es_doc_store = ElasticsearchDocumentStore(index="my-index")
es_retriever = ElasticsearchRetriever(document_store=es_doc_store, top_k=20)
dense_doc_store = InMemoryDocumentStore(index="my-index")
emb_retriever = EmbeddingRetriever(
document_store=dense_doc_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mock these away too 🙂 There are other occurrences below, I think most/all of those can be mocked

Comment on lines 189 to 195
assert code == (
'in_memory_document_store = InMemoryDocumentStore(index="my-index")\n'
'retri = EmbeddingRetriever(document_store=in_memory_document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2")\n'
"\n"
"p = Pipeline()\n"
'p.add_node(component=retri, name="retri", inputs=["Query"])'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail: I'd break the output on newlines and test the lines separately, so if tomorrow we decide to change the spacing the tests won't break for it.

'p.add_node(component=retri, name="retri", inputs=["Query"])'
)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add tests for non DAG pipelines, unused nodes, and even meaningless pipelines if it's possible. A little issue with these tests is that only "happy paths" are covered: we should tests that exceptions trigger properly too.

@ZanSara ZanSara closed this Feb 22, 2022
@ZanSara ZanSara reopened this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pipeline.to_code()
3 participants