Merge remote-tracking branch 'origin/dev/join_docs' into dev/join_docs

deepset-ai · Oct 9, 2023 · c47e0c7 · c47e0c7
2 parents acd5009 + 37c3092
commit c47e0c7
Show file tree

Hide file tree

Showing 10 changed files with 394 additions and 46 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,57 @@
 | Meta    | ![Discord](https://img.shields.io/discord/993534733298450452?logo=discord) ![Twitter Follow](https://img.shields.io/twitter/follow/deepset_ai)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
 </div>
 
-[Haystack](https://haystack.deepset.ai/) is an end-to-end NLP framework that enables you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform question answering, answer generation, semantic document search, or build tools that are capable of complex decision making and query resolution, you can use the state-of-the-art NLP models with Haystack to build end-to-end NLP applications solving your use case.
+[Haystack](https://haystack.deepset.ai/) is an end-to-end NLP framework that enables you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform question answering, answer generation, semantic document search, or build tools that are capable of complex decision-making and query resolution, you can use the state-of-the-art NLP models with Haystack to build end-to-end NLP applications solving your use case.
+
+## Quickstart
+
+Haystack is built around the concept of pipelines. A pipeline is a powerful structure that performs an NLP task. It's made up of components connected together. For example, you can connect a `Retriever` and a `PromptNode` to build a Generative Question Answering pipeline that uses your own data.
+
+Try out how Haystack answers questions about Game of Thrones using the Retrieval Augmented Generation (RAG) approach 👇
+
+First, run the minimal Haystack installation:
+
+```sh
+pip install farm-haystack
+```
+
+Then, index your data to the DocumentStore, build a RAG pipeline, and ask a question on your data: 
+
+```python
+from haystack.document_stores import InMemoryDocumentStore
+from haystack.utils import build_pipeline, add_example_data, print_answers
+
+# We are model agnostic :) Here, you can choose from: "anthropic", "cohere", "huggingface", and "openai".
+provider = "openai"
+API_KEY = "sk-..." # ADD YOUR KEY HERE
+
+# We support many different databases. Here, we load a simple and lightweight in-memory database.
+document_store = InMemoryDocumentStore(use_bm25=True)
+
+# Download and add Game of Thrones TXT articles to Haystack DocumentStore.
+# You can also provide a folder with your local documents.
+add_example_data(document_store, "data/GoT_getting_started")
+
+# Build a pipeline with a Retriever to get relevant documents to the query and a PromptNode interacting with LLMs using a custom prompt.
+pipeline = build_pipeline(provider, API_KEY, document_store)
+
+# Ask a question on the data you just added.
+result = pipeline.run(query="Who is the father of Arya Stark?")
+
+# For details, like which documents were used to generate the answer, look into the <result> object
+print_answers(result, details="medium")
+```
+
+The output of the pipeline will reference the documents used to generate the answer:
+
+```
+'Query: Who is the father of Arya Stark?'
+'Answers:'
+[{'answer': 'The father of Arya Stark is Lord Eddard Stark of '
+                'Winterfell. [Document 1, Document 4, Document 5]'}]
+```
+
+Congratulations, you have just built your first Haystack app!
 
 ## Core Concepts
 

diff --git a/e2e/preview/pipelines/test_extractive_qa_pipeline.py b/e2e/preview/pipelines/test_extractive_qa_pipeline.py
@@ -1,25 +1,39 @@
+import json
+
 from haystack.preview import Pipeline, Document
 from haystack.preview.document_stores import MemoryDocumentStore
 from haystack.preview.components.retrievers import MemoryBM25Retriever
 from haystack.preview.components.readers import ExtractiveReader
 
 
-def test_extractive_qa_pipeline():
-    document_store = MemoryDocumentStore()
+def test_extractive_qa_pipeline(tmp_path):
+    # Create the pipeline
+    qa_pipeline = Pipeline()
+    qa_pipeline.add_component(instance=MemoryBM25Retriever(document_store=MemoryDocumentStore()), name="retriever")
+    qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")
+    qa_pipeline.connect("retriever", "reader")
+
+    # Draw the pipeline
+    qa_pipeline.draw(tmp_path / "test_extractive_qa_pipeline.png")
+
+    # Serialize the pipeline to JSON
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
+        print(json.dumps(qa_pipeline.to_dict(), indent=4))
+        json.dump(qa_pipeline.to_dict(), f)
 
+    # Load the pipeline back
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
+        qa_pipeline = Pipeline.from_dict(json.load(f))
+
+    # Populate the document store
     documents = [
         Document(text="My name is Jean and I live in Paris."),
         Document(text="My name is Mark and I live in Berlin."),
         Document(text="My name is Giorgio and I live in Rome."),
     ]
+    qa_pipeline.get_component("retriever").document_store.write_documents(documents)
 
-    document_store.write_documents(documents)
-
-    qa_pipeline = Pipeline()
-    qa_pipeline.add_component(instance=MemoryBM25Retriever(document_store=document_store), name="retriever")
-    qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")
-    qa_pipeline.connect("retriever", "reader")
-
+    # Query and assert
     questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
     answers_spywords = ["Jean", "Mark", "Giorgio"]
 

diff --git a/e2e/preview/pipelines/test_rag_pipelines.py b/e2e/preview/pipelines/test_rag_pipelines.py
@@ -1,4 +1,5 @@
 import os
+import json
 import pytest
 
 from haystack.preview import Pipeline, Document
@@ -15,15 +16,8 @@
     not os.environ.get("OPENAI_API_KEY", None),
     reason="Export an env var called OPENAI_API_KEY containing the OpenAI API key to run this test.",
 )
-def test_bm25_rag_pipeline():
-    document_store = MemoryDocumentStore()
-
-    documents = [
-        Document(text="My name is Jean and I live in Paris."),
-        Document(text="My name is Mark and I live in Berlin."),
-        Document(text="My name is Giorgio and I live in Rome."),
-    ]
-
+def test_bm25_rag_pipeline(tmp_path):
+    # Create the RAG pipeline
     prompt_template = """
     Given these documents, answer the question.\nDocuments:
     {% for doc in documents %}
@@ -33,11 +27,8 @@ def test_bm25_rag_pipeline():
     \nQuestion: {{question}}
     \nAnswer:
     """
-
-    document_store.write_documents(documents)
-
     rag_pipeline = Pipeline()
-    rag_pipeline.add_component(instance=MemoryBM25Retriever(document_store=document_store), name="retriever")
+    rag_pipeline.add_component(instance=MemoryBM25Retriever(document_store=MemoryDocumentStore()), name="retriever")
     rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
     rag_pipeline.add_component(instance=GPTGenerator(api_key=os.environ.get("OPENAI_API_KEY")), name="llm")
     rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
@@ -47,6 +38,26 @@ def test_bm25_rag_pipeline():
     rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
     rag_pipeline.connect("retriever", "answer_builder.documents")
 
+    # Draw the pipeline
+    rag_pipeline.draw(tmp_path / "test_bm25_rag_pipeline.png")
+
+    # Serialize the pipeline to JSON
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
+        json.dump(rag_pipeline.to_dict(), f)
+
+    # Load the pipeline back
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
+        rag_pipeline = Pipeline.from_dict(json.load(f))
+
+    # Populate the document store
+    documents = [
+        Document(text="My name is Jean and I live in Paris."),
+        Document(text="My name is Mark and I live in Berlin."),
+        Document(text="My name is Giorgio and I live in Rome."),
+    ]
+    rag_pipeline.get_component("retriever").document_store.write_documents(documents)
+
+    # Query and assert
     questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
     answers_spywords = ["Jean", "Mark", "Giorgio"]
 
@@ -71,15 +82,8 @@ def test_bm25_rag_pipeline():
     not os.environ.get("OPENAI_API_KEY", None),
     reason="Export an env var called OPENAI_API_KEY containing the OpenAI API key to run this test.",
 )
-def test_embedding_retrieval_rag_pipeline():
-    document_store = MemoryDocumentStore()
-
-    documents = [
-        Document(text="My name is Jean and I live in Paris."),
-        Document(text="My name is Mark and I live in Berlin."),
-        Document(text="My name is Giorgio and I live in Rome."),
-    ]
-
+def test_embedding_retrieval_rag_pipeline(tmp_path):
+    # Create the RAG pipeline
     prompt_template = """
     Given these documents, answer the question.\nDocuments:
     {% for doc in documents %}
@@ -89,22 +93,14 @@ def test_embedding_retrieval_rag_pipeline():
     \nQuestion: {{question}}
     \nAnswer:
     """
-
-    indexing_pipeline = Pipeline()
-    indexing_pipeline.add_component(
-        instance=SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2"),
-        name="document_embedder",
-    )
-    indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="document_writer")
-    indexing_pipeline.connect("document_embedder", "document_writer")
-    indexing_pipeline.run({"document_embedder": {"documents": documents}})
-
     rag_pipeline = Pipeline()
     rag_pipeline.add_component(
-        instance=SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2"),
+        instance=SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"),
         name="text_embedder",
     )
-    rag_pipeline.add_component(instance=MemoryEmbeddingRetriever(document_store=document_store), name="retriever")
+    rag_pipeline.add_component(
+        instance=MemoryEmbeddingRetriever(document_store=MemoryDocumentStore()), name="retriever"
+    )
     rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
     rag_pipeline.add_component(instance=GPTGenerator(api_key=os.environ.get("OPENAI_API_KEY")), name="llm")
     rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
@@ -115,6 +111,34 @@ def test_embedding_retrieval_rag_pipeline():
     rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
     rag_pipeline.connect("retriever", "answer_builder.documents")
 
+    # Draw the pipeline
+    rag_pipeline.draw(tmp_path / "test_embedding_rag_pipeline.png")
+
+    # Serialize the pipeline to JSON
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
+        json.dump(rag_pipeline.to_dict(), f)
+
+    # Load the pipeline back
+    with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
+        rag_pipeline = Pipeline.from_dict(json.load(f))
+
+    # Populate the document store
+    documents = [
+        Document(text="My name is Jean and I live in Paris."),
+        Document(text="My name is Mark and I live in Berlin."),
+        Document(text="My name is Giorgio and I live in Rome."),
+    ]
+    document_store = rag_pipeline.get_component("retriever").document_store
+    indexing_pipeline = Pipeline()
+    indexing_pipeline.add_component(
+        instance=SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"),
+        name="document_embedder",
+    )
+    indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="document_writer")
+    indexing_pipeline.connect("document_embedder", "document_writer")
+    indexing_pipeline.run({"document_embedder": {"documents": documents}})
+
+    # Query and assert
     questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
     answers_spywords = ["Jean", "Mark", "Giorgio"]
 
@@ -129,7 +153,6 @@ def test_embedding_retrieval_rag_pipeline():
 
         assert len(result["answer_builder"]["answers"]) == 1
         generated_answer = result["answer_builder"]["answers"][0]
-        print(generated_answer)
         assert spyword in generated_answer.data
         assert generated_answer.query == question
         assert hasattr(generated_answer, "documents")

diff --git a/haystack/preview/README.md b/haystack/preview/README.md
@@ -23,5 +23,12 @@ pip install farm-haystack
 ```
 The `farm-haystack` package includes all new features of Haystack 2.0. Note that updates to this package occur less frequently compared to `haystack-ai`. So, you might not get the all latest Haystack 2.0 features immediately when using `farm-haystack`.
 
+## 🚗 Getting Started
+
+In our **end 2 end tests** you can find example code for the following pipelines:
+- [RAG pipeline](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/pipelines/test_rag_pipelines.py)
+- [Extractive QA pipeline](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/pipelines/test_extractive_qa_pipeline.py)
+- more to come, check out the [folder](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/)
+
 ## 💙 Stay Updated
 To learn how and when components will be migrated to the new major version, have a look at the [Migrate Components to Pipeline v2](https://github.com/deepset-ai/haystack/issues/5265) roadmap item, where we keep track of issues and PRs about Haystack 2.0. When you have questions, you can always contact us using the [Shaping Haystack 2.0](https://github.com/deepset-ai/haystack/discussions/5568) discussion or [Haystack Discord server](https://discord.com/channels/993534733298450452/1141683185458094211).
diff --git a/haystack/preview/components/generators/openai/gpt.py b/haystack/preview/components/generators/openai/gpt.py
@@ -121,7 +121,7 @@ def from_dict(cls, data: Dict[str, Any]) -> "GPTGenerator":
         """
         init_params = data.get("init_parameters", {})
         streaming_callback = None
-        if "streaming_callback" in init_params:
+        if "streaming_callback" in init_params and init_params["streaming_callback"]:
             parts = init_params["streaming_callback"].split(".")
             module_name = ".".join(parts[:-1])
             function_name = parts[-1]

diff --git a/haystack/preview/components/samplers/__init__.py b/haystack/preview/components/samplers/__init__.py
@@ -0,0 +1,3 @@
+from haystack.preview.components.samplers.top_p import TopPSampler
+
+__all__ = ["TopPSampler"]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from haystack.preview.components.samplers.top_p import TopPSampler

		__all__ = ["TopPSampler"]