Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/dev/join_docs' into dev/join_docs
Browse files Browse the repository at this point in the history
  • Loading branch information
nickprock committed Oct 9, 2023
2 parents acd5009 + 37c3092 commit c47e0c7
Show file tree
Hide file tree
Showing 10 changed files with 394 additions and 46 deletions.
52 changes: 51 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,57 @@
| Meta | ![Discord](https://img.shields.io/discord/993534733298450452?logo=discord) ![Twitter Follow](https://img.shields.io/twitter/follow/deepset_ai) |
</div>

[Haystack](https://haystack.deepset.ai/) is an end-to-end NLP framework that enables you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform question answering, answer generation, semantic document search, or build tools that are capable of complex decision making and query resolution, you can use the state-of-the-art NLP models with Haystack to build end-to-end NLP applications solving your use case.
[Haystack](https://haystack.deepset.ai/) is an end-to-end NLP framework that enables you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform question answering, answer generation, semantic document search, or build tools that are capable of complex decision-making and query resolution, you can use the state-of-the-art NLP models with Haystack to build end-to-end NLP applications solving your use case.

## Quickstart

Haystack is built around the concept of pipelines. A pipeline is a powerful structure that performs an NLP task. It's made up of components connected together. For example, you can connect a `Retriever` and a `PromptNode` to build a Generative Question Answering pipeline that uses your own data.

Try out how Haystack answers questions about Game of Thrones using the Retrieval Augmented Generation (RAG) approach 👇

First, run the minimal Haystack installation:

```sh
pip install farm-haystack
```

Then, index your data to the DocumentStore, build a RAG pipeline, and ask a question on your data:

```python
from haystack.document_stores import InMemoryDocumentStore
from haystack.utils import build_pipeline, add_example_data, print_answers

# We are model agnostic :) Here, you can choose from: "anthropic", "cohere", "huggingface", and "openai".
provider = "openai"
API_KEY = "sk-..." # ADD YOUR KEY HERE

# We support many different databases. Here, we load a simple and lightweight in-memory database.
document_store = InMemoryDocumentStore(use_bm25=True)

# Download and add Game of Thrones TXT articles to Haystack DocumentStore.
# You can also provide a folder with your local documents.
add_example_data(document_store, "data/GoT_getting_started")

# Build a pipeline with a Retriever to get relevant documents to the query and a PromptNode interacting with LLMs using a custom prompt.
pipeline = build_pipeline(provider, API_KEY, document_store)

# Ask a question on the data you just added.
result = pipeline.run(query="Who is the father of Arya Stark?")

# For details, like which documents were used to generate the answer, look into the <result> object
print_answers(result, details="medium")
```

The output of the pipeline will reference the documents used to generate the answer:

```
'Query: Who is the father of Arya Stark?'
'Answers:'
[{'answer': 'The father of Arya Stark is Lord Eddard Stark of '
'Winterfell. [Document 1, Document 4, Document 5]'}]
```

Congratulations, you have just built your first Haystack app!

## Core Concepts

Expand Down
32 changes: 23 additions & 9 deletions e2e/preview/pipelines/test_extractive_qa_pipeline.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@
import json

from haystack.preview import Pipeline, Document
from haystack.preview.document_stores import MemoryDocumentStore
from haystack.preview.components.retrievers import MemoryBM25Retriever
from haystack.preview.components.readers import ExtractiveReader


def test_extractive_qa_pipeline():
document_store = MemoryDocumentStore()
def test_extractive_qa_pipeline(tmp_path):
# Create the pipeline
qa_pipeline = Pipeline()
qa_pipeline.add_component(instance=MemoryBM25Retriever(document_store=MemoryDocumentStore()), name="retriever")
qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")
qa_pipeline.connect("retriever", "reader")

# Draw the pipeline
qa_pipeline.draw(tmp_path / "test_extractive_qa_pipeline.png")

# Serialize the pipeline to JSON
with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
print(json.dumps(qa_pipeline.to_dict(), indent=4))
json.dump(qa_pipeline.to_dict(), f)

# Load the pipeline back
with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
qa_pipeline = Pipeline.from_dict(json.load(f))

# Populate the document store
documents = [
Document(text="My name is Jean and I live in Paris."),
Document(text="My name is Mark and I live in Berlin."),
Document(text="My name is Giorgio and I live in Rome."),
]
qa_pipeline.get_component("retriever").document_store.write_documents(documents)

document_store.write_documents(documents)

qa_pipeline = Pipeline()
qa_pipeline.add_component(instance=MemoryBM25Retriever(document_store=document_store), name="retriever")
qa_pipeline.add_component(instance=ExtractiveReader(model_name_or_path="deepset/tinyroberta-squad2"), name="reader")
qa_pipeline.connect("retriever", "reader")

# Query and assert
questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
answers_spywords = ["Jean", "Mark", "Giorgio"]

Expand Down
93 changes: 58 additions & 35 deletions e2e/preview/pipelines/test_rag_pipelines.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import os
import json
import pytest

from haystack.preview import Pipeline, Document
Expand All @@ -15,15 +16,8 @@
not os.environ.get("OPENAI_API_KEY", None),
reason="Export an env var called OPENAI_API_KEY containing the OpenAI API key to run this test.",
)
def test_bm25_rag_pipeline():
document_store = MemoryDocumentStore()

documents = [
Document(text="My name is Jean and I live in Paris."),
Document(text="My name is Mark and I live in Berlin."),
Document(text="My name is Giorgio and I live in Rome."),
]

def test_bm25_rag_pipeline(tmp_path):
# Create the RAG pipeline
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
Expand All @@ -33,11 +27,8 @@ def test_bm25_rag_pipeline():
\nQuestion: {{question}}
\nAnswer:
"""

document_store.write_documents(documents)

rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=MemoryBM25Retriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=MemoryBM25Retriever(document_store=MemoryDocumentStore()), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=GPTGenerator(api_key=os.environ.get("OPENAI_API_KEY")), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
Expand All @@ -47,6 +38,26 @@ def test_bm25_rag_pipeline():
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Draw the pipeline
rag_pipeline.draw(tmp_path / "test_bm25_rag_pipeline.png")

# Serialize the pipeline to JSON
with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
json.dump(rag_pipeline.to_dict(), f)

# Load the pipeline back
with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
rag_pipeline = Pipeline.from_dict(json.load(f))

# Populate the document store
documents = [
Document(text="My name is Jean and I live in Paris."),
Document(text="My name is Mark and I live in Berlin."),
Document(text="My name is Giorgio and I live in Rome."),
]
rag_pipeline.get_component("retriever").document_store.write_documents(documents)

# Query and assert
questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
answers_spywords = ["Jean", "Mark", "Giorgio"]

Expand All @@ -71,15 +82,8 @@ def test_bm25_rag_pipeline():
not os.environ.get("OPENAI_API_KEY", None),
reason="Export an env var called OPENAI_API_KEY containing the OpenAI API key to run this test.",
)
def test_embedding_retrieval_rag_pipeline():
document_store = MemoryDocumentStore()

documents = [
Document(text="My name is Jean and I live in Paris."),
Document(text="My name is Mark and I live in Berlin."),
Document(text="My name is Giorgio and I live in Rome."),
]

def test_embedding_retrieval_rag_pipeline(tmp_path):
# Create the RAG pipeline
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
Expand All @@ -89,22 +93,14 @@ def test_embedding_retrieval_rag_pipeline():
\nQuestion: {{question}}
\nAnswer:
"""

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(
instance=SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2"),
name="document_embedder",
)
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="document_writer")
indexing_pipeline.connect("document_embedder", "document_writer")
indexing_pipeline.run({"document_embedder": {"documents": documents}})

rag_pipeline = Pipeline()
rag_pipeline.add_component(
instance=SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-mpnet-base-v2"),
instance=SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"),
name="text_embedder",
)
rag_pipeline.add_component(instance=MemoryEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(
instance=MemoryEmbeddingRetriever(document_store=MemoryDocumentStore()), name="retriever"
)
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=GPTGenerator(api_key=os.environ.get("OPENAI_API_KEY")), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
Expand All @@ -115,6 +111,34 @@ def test_embedding_retrieval_rag_pipeline():
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Draw the pipeline
rag_pipeline.draw(tmp_path / "test_embedding_rag_pipeline.png")

# Serialize the pipeline to JSON
with open(tmp_path / "test_bm25_rag_pipeline.json", "w") as f:
json.dump(rag_pipeline.to_dict(), f)

# Load the pipeline back
with open(tmp_path / "test_bm25_rag_pipeline.json", "r") as f:
rag_pipeline = Pipeline.from_dict(json.load(f))

# Populate the document store
documents = [
Document(text="My name is Jean and I live in Paris."),
Document(text="My name is Mark and I live in Berlin."),
Document(text="My name is Giorgio and I live in Rome."),
]
document_store = rag_pipeline.get_component("retriever").document_store
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(
instance=SentenceTransformersDocumentEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"),
name="document_embedder",
)
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="document_writer")
indexing_pipeline.connect("document_embedder", "document_writer")
indexing_pipeline.run({"document_embedder": {"documents": documents}})

# Query and assert
questions = ["Who lives in Paris?", "Who lives in Berlin?", "Who lives in Rome?"]
answers_spywords = ["Jean", "Mark", "Giorgio"]

Expand All @@ -129,7 +153,6 @@ def test_embedding_retrieval_rag_pipeline():

assert len(result["answer_builder"]["answers"]) == 1
generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer)
assert spyword in generated_answer.data
assert generated_answer.query == question
assert hasattr(generated_answer, "documents")
Expand Down
7 changes: 7 additions & 0 deletions haystack/preview/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,12 @@ pip install farm-haystack
```
The `farm-haystack` package includes all new features of Haystack 2.0. Note that updates to this package occur less frequently compared to `haystack-ai`. So, you might not get the all latest Haystack 2.0 features immediately when using `farm-haystack`.

## 🚗 Getting Started

In our **end 2 end tests** you can find example code for the following pipelines:
- [RAG pipeline](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/pipelines/test_rag_pipelines.py)
- [Extractive QA pipeline](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/pipelines/test_extractive_qa_pipeline.py)
- more to come, check out the [folder](https://github.com/deepset-ai/haystack/blob/main/e2e/preview/)

## 💙 Stay Updated
To learn how and when components will be migrated to the new major version, have a look at the [Migrate Components to Pipeline v2](https://github.com/deepset-ai/haystack/issues/5265) roadmap item, where we keep track of issues and PRs about Haystack 2.0. When you have questions, you can always contact us using the [Shaping Haystack 2.0](https://github.com/deepset-ai/haystack/discussions/5568) discussion or [Haystack Discord server](https://discord.com/channels/993534733298450452/1141683185458094211).
2 changes: 1 addition & 1 deletion haystack/preview/components/generators/openai/gpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ def from_dict(cls, data: Dict[str, Any]) -> "GPTGenerator":
"""
init_params = data.get("init_parameters", {})
streaming_callback = None
if "streaming_callback" in init_params:
if "streaming_callback" in init_params and init_params["streaming_callback"]:
parts = init_params["streaming_callback"].split(".")
module_name = ".".join(parts[:-1])
function_name = parts[-1]
Expand Down
3 changes: 3 additions & 0 deletions haystack/preview/components/samplers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from haystack.preview.components.samplers.top_p import TopPSampler

__all__ = ["TopPSampler"]
Loading

0 comments on commit c47e0c7

Please sign in to comment.