14 Dec 13:35

ZanSara

03a05e2

v.1.23.0

⭐️ Highlights

🪨 Amazon Bedrock support for `PromptNode` (#6226)

Haystack now supports Amazon Bedrock models, including all existing and previously announced
models, like Llama-2-70b-chat. To use these models, simply pass the model ID in the
model_name_or_path parameter, like you do for any other model. For details, see
Amazon Bedrock Documentation.

For example, the following code loads the Llama 2 Chat 13B model:

from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="meta.llama2-13b-chat-v1")

🗺️ Support for MongoDB Atlas Document Store (#6471)

With this release, we introduce support for MongoDB Atlas as a Document Store. Try it with:

from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=f"mongodb+srv://USER:PASSWORD@HOST/?{'retryWrites': 'true', 'w': 'majority'}",
    database_name="database",
    collection_name="collection",
)
...
document_store.write_documents(...)

Note that you need MongoDB Atlas credentials to fill the connection string. You can get such credentials by registering here: https://www.mongodb.com/cloud/atlas/register

⬆️ Upgrade Notes

Remove deprecated OpenAIAnswerGenerator, BaseGenerator, GenerativeQAPipeline, and related tests.
GenerativeQA Pipelines should use PromptNode instead. See https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode.

🚀 New Features

Add PptxConverter: a node to convert pptx files to Haystack Documents.
Add split_length by token in PreProcessor.
Support for dense embedding instructions used in retrieval models such as BGE and LLM-Embedder.
You can use Amazon Bedrock models in Haystack.
Add MongoDBAtlasDocumentStore, providing support for MongoDB Atlas as a document store.

⚡️ Enhancement Notes

Change PromptModel constructor parameter invocation_layer_class to accept a str too.
If a str is used the invocation layer class will be imported and used.
This should ease serialisation to YAML when using invocation_layer_class with PromptModel.
Users can now define the number of pods and pod type directly when creating a PineconeDocumentStore instance.
Add batch_size to the init method of FAISS Document Store. This works as the default value for all methods of
FAISS Document Store that support batch_size.
Introduces a new timeout keyword argument in PromptNode, addressing and fixing the issue #5380 for enhanced control over individual calls to OpenAI.
Upgrade Transformers to the latest version 4.35.2
This version adds support for DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2.
Upgraded openai-whisper to version 20231106 and simplified installation through re-introduced audio extra.
The latest openai-whisper version unpins its tiktoken dependency, which resolved a version conflict with Haystack's dependencies.
Make it possible to load additional fields from the SQUAD format file into the meta field of the Labels.
Add new variable model_kwargs to the ExtractiveReader so we can pass different loading options supported by
HuggingFace.
Add new token limit for gpt-4-1106-preview model.

🐛 Bug Fixes

Fix Pipeline.load_from_deepset_cloud to work with the latest version of deepset Cloud.
When using JoinDocuments with join_mode=concatenate (default) and
passing duplicate documents, including some with a null score, this
node raised an exception.
Now this case is handled correctly and the documents are joined as expected.
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
ensures that they are included in JSON schema generation.
Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus
ensures that they are included in JSON schema generation.

Assets 2

04 Dec 15:20

masci

v2.0.0-beta.1

b25e5e8

v2.0.0-beta.1 Pre-release

Pre-release

Introduction

We are happy to officially share Haystack 2.0-beta with you. The new version is a complete rework of the pipeline, our core concept, with production readiness, ease of use, and customizability in mind.

Haystack 2.0-Beta Documentation.
Check the available features in this Beta release (see section below).
Try out Haystack 2.0-Beta in “Advent of Code”.

What does the “Beta” mean for me?

Production readiness means also caring about stability. Therefore, we decided to release a beta version now and test it thoroughly in public over the next weeks. We will add more features and we might add breaking changes until the stable 2.0 release in late Q1 2024.

We invite you to try this beta version and give candid feedback, it will be heard and we will change Haystack accordingly. We’ve put together 10 code challenges for you in our “Advent of Haystack” to get your hands on it. We don’t recommend migrating your production pipelines yet to 2.0 beta.

We will support Haystack 1.x with updates and important features being added to the codebase even after the final 2.0.0 release, to give users time to migrate.

⭐️ What’s changed?

For a detailed overview of what’s changed in this Beta release, check out our article “Introducing Haystack 2.0 and Advent of Haystack”.

The bulk of the work in this release introduces changes to the fundamental design of:

In the last few months, we've been working with our community members and partners to already start adding some integrations for Haystack 2.0. Today, along with the beta package you can also try integrations tagged with Haystack 2.0 in our Integration inventory!

🚀 Getting started

One way to get started with Haystack 2.0 Beta is to participate in the “Advent of Haystack” and give us feedback on how you got along.

To install the new package:

pip install haystack-ai

To use a simple RAG pipeline:

from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.pipeline_utils import build_rag_pipeline

API_KEY = "sk-xxx" # ADD YOUR OPENAI API KEY

# We support many different databases. Here we load a simple and lightweight in-memory document store.
document_store = InMemoryDocumentStore()

# Create some example documents and add them to the document store.
documents = [
    Document(content="My name is Jean and I live in Paris."),
    Document(content="My name is Mark and I live in Berlin."),
    Document(content="My name is Giorgio and I live in Rome."),
]
document_store.write_documents(documents)

# Let's now build a simple RAG pipeline that uses a generative model to answer questions.
rag_pipeline = build_rag_pipeline(llm_api_key=API_KEY, document_store=document_store)
answers = rag_pipeline.run(query="Who lives in Rome?")
print(answers.data)

For more details on how to get started see: https://docs.haystack.deepset.ai/v2.0/docs/get_started

🪶 List of Features

✅ Ready in this Beta release

🏗️ Under construction

Feature	Haystack 2.0-Beta
Document Stores
InMemoryDocumentStore	✅
ElasticsearchDocumentstore	✅
OpenSearchDocumentStore	✅
ChromaDocumentStore	✅
MarqoDocumentStore	✅
FAISSDocumentStore	🏗️
PineconeDocumentStore	🏗️
WeaviateDocumentStore	🏗️
MilvusDocumentStore	🏗️
QdrantDocumentStore	🏗️
PGVectorDocumentStore	🏗️
MongoDBAtlasDocumentStore	🏗️

Generators
GPTGenerator	✅
HuggingFaceLocalGenerator	✅
HuggingFaceTGIGenerator	✅
GradientGenerator	✅
Anthropic - Claude	🏗️
Cohere - generate	✅
AzureGPT	🏗️
AWS Bedrock	🏗️
AWS SageMaker	🏗️
PromptNode	🏗️
PromptBuilder	✅
AnswerBuilder	✅

Embedders
OpenAI Embedder	✅
SentenceTransformers Embedder	✅
Cohere - embed	🏗️
Gradient Embedder (external)	✅

Retrievers
InMemoryBM25Retriever	✅
InMemoryEmbeddingRetriever	✅
ElasticsearchBM25Retriever	✅
ElasticsearchEmbeddingRetriever	✅
OpensearchBM25Retriever	✅
OpensearchEmbeddingRetriever	✅
SerperDevWebSearch	✅
MultiModalRetriever	🏗️
TableTextRetriever	🏗️
DensePassageRetriever	🏗️

Rankers
TransformersSimilarityRanker	✅
CohereRanker	🏗️
DiversityRanker	🏗️
LostInTheMiddleRanker	🏗️
RecentnessRanker	🏗️
MetaFieldRanker	✅

Readers
ExtractiveReader
(successor of both FARMReader and TransformersReader)	✅
TableReader	🏗️

Data Processing
Local + Remote WhisperTranscriber	✅
UrlCacheChecker	✅
LinkContentFetcher	✅
AzureOCRDocumentConverter	✅
HTMLToDocument	✅
PyPDFToDocument	✅
TikaDocumentConverter	✅
TextFileToDocument	✅
MarkdownToDocument	✅
DocumentCleaner	✅
TextDocumentSplitter	✅
TextLanguageClassifier	✅
FileTypeRouter	✅
MetadataRouter	✅
DocumentWriter	✅
DocumentJoiner	✅

Misc
Evaluation	🏗️
Agents	🏗️
Conversational Agent	🏗️
TopPSampler	✅
TransformersSummarizer	🏗️
TransformersTranslator	🏗️

Assets 2

09 Nov 16:44

masci

v1.22.1

d804ac6

v1.22.1

Release Notes

v1.22.1

Enhancement Notes

Add new token limit for gpt-4-1106-preview model

Bug Fixes

When using JoinDocuments with join_mode=concatenate (default) and passing duplicate documents, including some with a null score, this node raised an exception. Now this case is handled correctly and the documents are joined as expected.

Assets 2

07 Nov 15:02

masci

v1.22.0

58fa94c

v1.22.0

Release Notes

v1.22.0

⭐️ Highlights

Some additions to Haystack 2.0 preview:

New additions include a ByteStream type for binary data abstraction and the ChatMessage data class to streamline chat LLM component integration. AzureOCRDocumentConverter, HTMLToDocument and PyPDFToDocument further expand capability in document conversion. TransformersSimilarityRanker and TopPSampler improve document ranking and query handling capabilities. HuggingFaceLocalGenerator adds to ever-growing LLM components. These significant updates, along with a host of minor fixes and refinements, mark a significant step towards the upcoming Haystack 2.0 beta release.

⬆️ Upgrade Notes

This update enables all Pinecone index types to be used, including Starter. Previously, Pinecone Starter index type couldn't be used as document store. Due to limitations of this index type (https://docs.pinecone.io/docs/starter-environment), in current implementation fetching documents is limited to Pinecone query vector limit (10000 vectors). Accordingly, if the number of documents in the index is above this limit, some of PineconeDocumentStore functions will be limited.
Removes the audio, ray, onnx and beir extras from the extra group all.

🚀 New Features

Add experimental support for asynchronous Pipeline run

⚡️ Enhancement Notes

Added support for Apple Silicon GPU acceleration through "mps pytorch", enabling better performance on Apple M1 hardware.
Document writer returns the number of documents written.
added support for using on_final_answer trough Agent callback_manager
Add asyncio support to the OpenAI invocation layer.
PromptNode can now be run asynchronously by calling the arun method.
Add search_engine_kwargs param to WebRetriever so it can be propagated to WebSearch. This is useful, for example, to pass the engine id when using Google Custom Search.
Upgrade Transformers to the latest version 4.34.1. This version adds support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat models.
Make JoinDocuments return only the document with the highest score if there are duplicate documents in the list.
Add list_of_paths argument to utils.convert_files_to_docs to allow input of list of file paths to be converted, instead of, or as well as, the current dir_path argument.
Optimize particular methods from PineconeDocumentStore (delete_documents and _get_vector_count)
Update the deepset Cloud SDK to the new endpoint format for new saving pipeline configs.
Add alias names for Cohere embed models for an easier map between names

⚠️ Deprecation Notes

Deprecate OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack 1.23.

🐛 Bug Fixes

Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus ensures that they are included in JSON schema generation.
Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1. Added async support for ChatGPT invocation layer.
Added documents_store.update_embeddings call to pipeline examples so that embeddings are calculated for newly added documents.
Remove unsupported medium and finance-sentiment models from supported Cohere embed model list

🩵 Haystack 2.0 preview

Add AzureOCRDocumentConverter to convert files of different types using Azure's Document Intelligence Service.
Add ByteStream type to send binary raw data across components in a pipeline.
Introduce ChatMessage data class to facilitate structured handling and processing of message content within LLM chat interactions.
Adds ChatMessage templating in PromptBuilder
Adds HTMLToDocument component to convert HTML to a Document.
Adds SimilarityRanker, a component that ranks a list of Documents based on their similarity to the query.
Introduce the StreamingChunk dataclass for efficiently handling chunks of data streamed from a language model, encapsulating both the content and associated metadata for systematic processing.
Adds TopPSampler, a component selects documents based on the cumulative probability of the Document scores using top p (nucleus) sampling.
Add dumps, dump, loads and load methods to save and load pipelines in Yaml format.
Adopt Hugging Face token instead of the deprecated use_auth_token. Add this parameter to ExtractiveReader and SimilarityRanker to allow loading private models. Proper handling of token during serialization: if it is a string (a possible valid token) it is not serialized.
Add mime_type field to ByteStream dataclass.
The Document dataclass checks if id_hash_keys is None or empty in __post_init__. If so, it uses the default factory to set a default valid value.
Rework Document.id generation, if an id is not explicitly set it's generated using all Document field' values, score is not used.
Change Document's embedding field type from numpy.ndarray to List[float]
Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to ignore id_hash_keys and create Documents with duplicate ids if the documents differed only in their metadata.
Fix TextDocumentSplitter failing when run with an empty list
Better management of API key in GPT Generator. The API key is never serialized. Make the api_base_url parameter really used (previously it was ignored).
Add a minimal version of HuggingFaceLocalGenerator, a component that can run Hugging Face models locally to generate text.
Migrate RemoteWhisperTranscriber to OpenAI SDK.
Add OpenAI Document Embedder. It computes embeddings of Documents using OpenAI models. The embedding of each Document is stored in the embedding field of the Document.
Add the TextDocumentSplitter component for Haystack 2.0 that splits a Document with long text into multiple Documents with shorter texts. Thereby the texts match the maximum length that the language models in Embedders or other components can process.
Refactor OpenAIDocumentEmbedder to enrich documents with embeddings instead of recreating them.
Refactor SentenceTransformersDocumentEmbedder to enrich documents with embeddings instead of recreating them.
Remove "api_key" from serialization of AzureOCRDocumentConverter and SerperDevWebSearch.
Removed implementations of from_dict and to_dict from all components where they had the same effect as the default implementation from Canals: https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13 This refactoring does not change the behavior of the components.
Remove array field from Document dataclass.
Remove id_hash_keys field from Document dataclass. id_hash_keys has been also removed from Components that were using it:
- DocumentCleaner
- TextDocumentSplitter
- PyPDFToDocument
- AzureOCRDocumentConverter
- HTMLToDocument
- TextFileToDocument
- TikaDocumentConverter
Enhanced file routing capabilities with the introduction of ByteStream handling, and improved clarity by renaming the router to FileTypeRouter.
Rename MemoryDocumentStore to InMemoryDocumentStore Rename MemoryBM25Retriever to InMemoryBM25Retriever Rename MemoryEmbeddingRetriever to InMemoryEmbeddingRetriever
Renamed ExtractiveReader's input from document to documents to match its type List[Document].
Rename SimilarityRanker to TransformersSimilarityRanker, as there will be more similarity rankers in the future.
Allow specifying stopwords to stop text generation for HuggingFaceLocalGenerator.
Add basic telemetry to Haystack 2.0 pipelines
Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text. Useful as a preprocessing step before splitting...

Assets 2

05 Nov 17:13

masci

v1.22.0-rc3

3ad66b5

v1.22.0-rc3 Pre-release

Pre-release

Release Notes

v1.22.0-rc2

Bug Fixes

Adds LostInTheMiddleRanker, DiversityRanker, and RecentnessRanker to haystack/nodes/__init__.py and thus ensures that they are included in JSON schema generation.

v1.22.0-rc1

Upgrade Notes

This update enables all Pinecone index types to be used, including Starter. Previously, Pinecone Starter index type couldn't be used as document store. Due to limitations of this index type (https://docs.pinecone.io/docs/starter-environment), in current implementation fetching documents is limited to Pinecone query vector limit (10000 vectors). Accordingly, if the number of documents in the index is above this limit, some of PineconeDocumentStore functions will be limited.

Removes the audio, ray, onnx and beir extras from the extra group all.

New Features

Add experimental support for asynchronous Pipeline run

Enhancement Notes

Added support for Apple Silicon GPU acceleration through "mps pytorch", enabling better performance on Apple M1 hardware.

Document writer returns the number of documents written.

added support for using on_final_answer trough Agent callback_manager

Add asyncio support to the OpenAI invocation layer.

PromptNode can now be run asynchronously by calling the arun method.

Add search_engine_kwargs param to WebRetriever so it can be propagated to WebSearch. This is useful, for example, to pass the engine id when using Google Custom Search.

Upgrade Transformers to the latest version 4.34.1. This version adds support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat models.

Make JoinDocuments return only the document with the highest score if there are duplicate documents in the list.

Add list_of_paths argument to utils.convert_files_to_docs to allow input of list of file paths to be converted, instead of, or as well as, the current dir_path argument.

Optimize particular methods from PineconeDocumentStore (delete_documents and _get_vector_count)

Update the deepset Cloud SDK to the new endpoint format for new saving pipeline configs.

Add alias names for Cohere embed models for an easier map between names

Deprecation Notes

Deprecate OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack 1.23.

Bug Fixes

Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1. Added async support for ChatGPT invocation layer.

Added documents_store.update_embeddings call to pipeline examples so that embeddings are calculated for newly added documents.

Remove unsupported medium and finance-sentiment models from supported Cohere embed model list

Haystack 2.0 preview

Add AzureOCRDocumentConverter to convert files of different types using Azure's Document Intelligence Service.

Add ByteStream type to send binary raw data across components in a pipeline.

Introduce ChatMessage data class to facilitate structured handling and processing of message content within LLM chat interactions.

Adds ChatMessage templating in PromptBuilder

Adds HTMLToDocument component to convert HTML to a Document.

Adds SimilarityRanker, a component that ranks a list of Documents based on their similarity to the query.

Introduce the StreamingChunk dataclass for efficiently handling chunks of data streamed from a language model, encapsulating both the content and associated metadata for systematic processing.

Adds TopPSampler, a component selects documents based on the cumulative probability of the Document scores using top p (nucleus) sampling.

Add dumps, dump, loads and load methods to save and load pipelines in Yaml format.

Adopt Hugging Face token instead of the deprecated use_auth_token. Add this parameter to ExtractiveReader and SimilarityRanker to allow loading private models. Proper handling of token during serialization: if it is a string (a possible valid token) it is not serialized.

Add mime_type field to ByteStream dataclass.

The Document dataclass checks if id_hash_keys is None or empty in __post_init__. If so, it uses the default factory to set a default valid value.

Rework Document.id generation, if an id is not explicitly set it's generated using all Document field' values, score is not used.

Change Document's embedding field type from numpy.ndarray to List[float]

Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to ignore id_hash_keys and create Documents with duplicate ids if the documents differed only in their metadata.

Fix TextDocumentSplitter failing when run with an empty list

Better management of API key in GPT Generator. The API key is never serialized. Make the api_base_url parameter really used (previously it was ignored).

Add a minimal version of HuggingFaceLocalGenerator, a component that can run Hugging Face models locally to generate text.

Migrate RemoteWhisperTranscriber to OpenAI SDK.

Add OpenAI Document Embedder. It computes embeddings of Documents using OpenAI models. The embedding of each Document is stored in the embedding field of the Document.

Add the TextDocumentSplitter component for Haystack 2.0 that splits a Document with long text into multiple Documents with shorter texts. Thereby the texts match the maximum length that the language models in Embedders or other components can process.

Refactor OpenAIDocumentEmbedder to enrich documents with embeddings instead of recreating them.

Refactor SentenceTransformersDocumentEmbedder to enrich documents with embeddings instead of recreating them.

Remove "api_key" from serialization of AzureOCRDocumentConverter and SerperDevWebSearch.

Removed implementations of from_dict and to_dict from all components where they had the same effect as the default implementation from Canals: https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13 This refactoring does not change the behavior of the components.

Remove array field from Document dataclass.

Remove id_hash_keys field from Document dataclass. id_hash_keys has been also removed from Components that were using it:
- DocumentCleaner
- TextDocumentSplitter
- PyPDFToDocument
- AzureOCRDocumentConverter
- HTMLToDocument
- TextFileToDocument
- TikaDocumentConverter

Enhanced file routing capabilities with the introduction of ByteStream handling, and improved clarity by renaming the router to FileTypeRouter.

Rename MemoryDocumentStore to InMemoryDocumentStore Rename MemoryBM25Retriever to InMemoryBM25Retriever Rename MemoryEmbeddingRetriever to InMemoryEmbeddingRetriever

Renamed ExtractiveReader's input from document to documents to match its type List[Document].

Rename SimilarityRanker to TransformersSimilarityRanker, as there will be more similarity rankers in the future.

Allow specifying stopwords to stop text generation for HuggingFaceLocalGenerator.

Add basic telemetry to Haystack 2.0 pipelines

Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text. Useful as a preprocessing step before sp...

Assets 2

30 Oct 14:38

masci

v1.22.0-rc1

0fb3b82

v1.22.0-rc1 Pre-release

Pre-release

v1.22.0-rc1

Upgrade Notes

This update enables all Pinecone index types to be used, including
Starter. Previously, Pinecone Starter index type couldn't be used as
document store. Due to limitations of this index type
(https://docs.pinecone.io/docs/starter-environment), in current
implementation fetching documents is limited to Pinecone query
vector limit (10000 vectors). Accordingly, if the number of
documents in the index is above this limit, some of
PineconeDocumentStore functions will be limited.
Removes the audio,
ray,
onnx and
beir extras from the extra group
all.

New Features

Add experimental support for asynchronous
Pipeline run

Enhancement Notes

Added support for Apple Silicon GPU acceleration through "mps
pytorch", enabling better performance on Apple M1 hardware.
Document writer returns the number of documents written.
added support for using
on_final_answer trough
Agent
callback_manager
Add asyncio support to the OpenAI invocation layer.
PromptNode can now be run asynchronously by calling the
arun method.
Add search_engine_kwargs param to
WebRetriever so it can be propagated to WebSearch. This is useful,
for example, to pass the engine id when using Google Custom Search.
Upgrade Transformers to the latest version 4.34.1. This version adds
support for the new Mistral, Persimmon, BROS, ViTMatte, and Nougat
models.
Make JoinDocuments return only the document with the highest score
if there are duplicate documents in the list.
Add list_of_paths argument to
utils.convert_files_to_docs to allow
input of list of file paths to be converted, instead of, or as well
as, the current dir_path argument.
Optimize particular methods from PineconeDocumentStore
(delete_documents and _get_vector_count)
Update the deepset Cloud SDK to the new endpoint format for new
saving pipeline configs.
Add alias names for Cohere embed models for an easier map between
names

Deprecation Notes

Deprecate OpenAIAnswerGenerator in
favor of PromptNode.
OpenAIAnswerGenerator will be removed
in Haystack 1.23.

Bug Fixes

Fixed the bug that prevented the correct usage of ChatGPT invocation
layer in 1.21.1. Added async support for ChatGPT invocation layer.
Added documents_store.update_embeddings call to pipeline examples so
that embeddings are calculated for newly added documents.
Remove unsupported medium and
finance-sentiment models from
supported Cohere embed model list

Haystack 2.0 preview

Add AzureOCRDocumentConverter to convert files of different types
using Azure's Document Intelligence Service.
Add ByteStream type to send binary raw data across components in a
pipeline.
Introduce ChatMessage data class to facilitate structured handling
and processing of message content within LLM chat interactions.
Adds ChatMessage templating in
PromptBuilder
Adds HTMLToDocument component to convert HTML to a Document.
Adds SimilarityRanker, a component that ranks a list of Documents
based on their similarity to the query.
Introduce the StreamingChunk dataclass for efficiently handling
chunks of data streamed from a language model, encapsulating both
the content and associated metadata for systematic processing.
Adds TopPSampler, a component selects documents based on the
cumulative probability of the Document scores using top p (nucleus)
sampling.
Add dumps,
dump,
loads and
load methods to save and load
pipelines in Yaml format.
Adopt Hugging Face token instead of
the deprecated use_auth_token. Add
this parameter to ExtractiveReader
and SimilarityRanker to allow loading
private models. Proper handling of
token during serialization: if it is
a string (a possible valid token) it is not serialized.
Add mime_type field to
ByteStream dataclass.
The Document dataclass checks if
id_hash_keys is None or empty in
__post_init__. If so, it uses the default factory to set a
default valid value.
Rework Document.id generation, if an
id is not explicitly set it's
generated using all Document field'
values, score is not used.
Change Document's
embedding field type from
numpy.ndarray to
List[float]
Fixed a bug that caused TextDocumentSplitter and DocumentCleaner to
ignore id_hash_keys and create Documents with duplicate ids if the
documents differed only in their metadata.
Fix TextDocumentSplitter failing when run with an empty list
Better management of API key in GPT Generator. The API key is never
serialized. Make the api_base_url
parameter really used (previously it was ignored).
Add a minimal version of HuggingFaceLocalGenerator, a component that
can run Hugging Face models locally to generate text.
Migrate RemoteWhisperTranscriber to OpenAI SDK.
Add OpenAI Document Embedder. It computes embeddings of Documents
using OpenAI models. The embedding of each Document is stored in the
embedding field of the Document.
Add the TextDocumentSplitter
component for Haystack 2.0 that splits a Document with long text
into multiple Documents with shorter texts. Thereby the texts match
the maximum length that the language models in Embedders or other
components can process.
Refactor OpenAIDocumentEmbedder to enrich documents with embeddings
instead of recreating them.
Refactor SentenceTransformersDocumentEmbedder to enrich documents
with embeddings instead of recreating them.
Remove "api_key" from serialization of AzureOCRDocumentConverter and
SerperDevWebSearch.
Removed implementations of from_dict and to_dict from all components
where they had the same effect as the default implementation from
Canals:
https://github.com/deepset-ai/canals/blob/main/canals/serialization.py#L12-L13
This refactoring does not change the behavior of the components.
Remove array field from
Document dataclass.
Remove id_hash_keys field from
Document dataclass.
id_hash_keys has been also removed
from Components that were using it:
- DocumentCleaner
- TextDocumentSplitter
- PyPDFToDocument
- AzureOCRDocumentConverter
- HTMLToDocument
- TextFileToDocument
- TikaDocumentConverter
Enhanced file routing capabilities with the introduction of
ByteStream handling, and improved
clarity by renaming the router to
FileTypeRouter.
Rename MemoryDocumentStore to
InMemoryDocumentStore Rename
MemoryBM25Retriever to
InMemoryBM25Retriever Rename
MemoryEmbeddingRetriever to
InMemoryEmbeddingRetriever
Renamed ExtractiveReader's input from
document to
documents to match its type
List[Document].
Rename SimilarityRanker to
TransformersSimilarityRanker, as
there will be more similarity rankers in the future.
Allow specifying stopwords to stop text generation for
HuggingFaceLocalGenerator.
Add basic telemetry to Haystack 2.0 pipelines
Added DocumentCleaner, which removes extra whitespace, empty lines,
headers, etc. from Documents containing text. Useful as a
preprocessing step before splitting into shorter text documents.
Add TextLanguageClassifier component so that an input string, for
example a query, can be routed to different components based on the
detected language.
Upgrade canals to 0.9.0 to support variadic inputs for Joiner
c...

Assets 2

06 Oct 08:02

julian-risch

v1.21.2

188262c

v1.21.2

🐛 Bug Fixes

Fixed the bug that prevented the correct usage of ChatGPT invocation layer in 1.21.1.
Added async support for ChatGPT invocation layer.

Assets 2

04 Oct 11:10

julian-risch

v1.21.1

d9e9925

v1.21.1

✨ Enhancements

Added experimental support for asynchronous Pipeline run.
Added asyncio support to the OpenAI invocation layer.
PromptNode can now be run asynchronously by calling the arun method.

⏰ Deprecations

Deprecated OpenAIAnswerGenerator in favor of PromptNode. OpenAIAnswerGenerator will be removed in Haystack v1.23.0

Assets 2

27 Sep 12:08

julian-risch

v1.21.0

29acd3c

v1.21.0

⭐ Highlights

🚀 Support for `gpt-3.5-turbo-instruct`

We are happy to announce that Haystack now supports OpenAI's new gpt-3.5-turbo-instruct model! Simply provide the model name in the PromptNode to use it:

pn = PromptNode("gpt-3.5-turbo-instruct", api_key=os.environ.get("OPENAI_API_KEY"))

2️⃣ Preview Installation Extra

Excited about the upcoming Haystack 2.0? We have introduced a new installation extra called preview which you can install to try out the Haystack 2.0 preview! This extra also makes Haystack's core dependencies leaner and thus speeds up installation. If you would like to start experiencing the new Haystack 2.0 components and pipeline design right away, run:

pip install farm-haystack[preview]

⚡️ WeaviateDocumentStore Performance

We fixed a bottleneck in WeaviateDocumentStore which was slowing down the indexing. The fix led to a notable performance improvement, reducing the indexing time of one million documents by 6 times!

🐣 PineconeDocumentStore Robustness

The PineconeDocumentStore now uses metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels. This is a breaking change and it makes the PineconeDocumentStore more robust to use in Haystack pipelines. If you want to retrieve all documents with an embedding, specify the metadata instead of the namespace as follows:

from haystack.document_stores.pinecone import DOCUMENT_WITH_EMBEDDING
# docs = doc_store.get_all_documents(namespace="vectors") # old way using namespaces
docs = doc_store.get_all_documents(type_metadata=DOCUMENT_WITH_EMBEDDING)

Additionally, if you want to retrieve all documents without an embedding, specify the metadata instead of the namespace:

# docs = doc_store.get_all_documents(namespace="no-vectors") # old way using namespaces
docs = doc_store_.get_all_documents(type_metadata="no-vector")

⬆️ Upgrade Notes

SklearnQueryClassifier is removed and users should switch to the more powerful TransformersQueryClassifier instead. #5447
Refactor PineconeDocumentStore to use metadata instead of namespaces for the distinction between documents with embeddings, documents without embeddings, and labels.

✨ Enhancements

ci: Fix typos discovered by codespell running in pre-commit.
Support OpenAI's new gpt-3.5-turbo-instruct model

🐛 Bug Fixes

Fix EntityExtractor output not JSON serializable.
Fix model_max_length not being set in the Tokenizer in DefaultPromptHandler.
Fixed a bottleneck in Weaviate document store which was slowing down the indexing.
gpt-35-turbo-16k model from Azure can integrate correctly.
Upgrades tiktoken to 0.5.1 to account for a breaking release.

👁️ Haystack 2.0 preview

Add the AnswerBuilder component for Haystack 2.0 that creates Answer objects from the string output of Generators.
Adds LinkContentFetcher component to Haystack 2.0. LinkContentFetcher fetches content from a given URL and
converts it into a Document object, which can then be used within the Haystack 2.0 pipeline.
Add MetadataRouter, a component that routes documents to different edges based on the content of their fields.
Adds support for PDF files to the Document converter via pypdf library.
Adds SerperDevWebSearch component to retrieve URLs from the web. See https://serper.dev/ for more information.
Add TikaDocumentConverter component to convert files of different types to Documents.
This adds an ExtractiveReader for v2. It should be a replacement where
FARMReader would have been used before for inference.
The confidence scores are calculated differently from FARMReader because
each span is considered to be an independent binary classification task.
Introduce GPTGenerator, a class that can generate completions using OpenAI Chat models like GPT3.5 and GPT4.
Remove id parameter from Document constructor as it was ignored and a new one was generated anyway.
This is a backwards incompatible change.
Add generators module for LLM generator components.
Adds GPT4Generator, an LLM component based on GPT35Generator.
Add embedding_retrieval method to MemoryDocumentStore,
which allows to retrieve the relevant Documents, given a query embedding.
It will be called the MemoryEmbeddingRetriever.
Rename MemoryRetriever to MemoryBM25Retriever
Add MemoryEmbeddingRetriever, which takes as input a query embedding and
retrieves the most relevant Documents from a MemoryDocumentStore.
Adds proposal for an extended Document class in Haystack 2.0.
Adds the implementation of said class.
Add OpenAI Text Embedder.
It is a component that uses OpenAI models to embed strings into vectors.
Revert #5826 and optionally take the id in the Document
class constructor.
Create a dedicated dependency list for the preview package, farm-haystack[preview].
Using haystack-ai is still the recommended way to test Haystack 2.0.
Add PromptBuilder component to render prompts from template strings.
Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
They can be used to add a prefix and suffix to the Document text before
embedding it. This is necessary to take full advantage of modern embedding
models, such as E5.
Add support for dates in filters.
Add UrlCacheChecker to support Web retrieval pipelines.
Check if documents coming from a given list of URLs are already present in the store and if so, returns them.
All URLs with no matching documents are returned on a separate connection.

Assets 2

12 Sep 13:16

ZanSara

v1.20.1

d84afbc

v1.20.1

Changelog

fix: temporary pin tiktoken #5774

Full Changelog: v1.20.0...v1.20.1

Assets 2

Releases: deepset-ai/haystack

v.1.23.0

⭐️ Highlights

🪨 Amazon Bedrock support for PromptNode (#6226)

🗺️ Support for MongoDB Atlas Document Store (#6471)

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

🐛 Bug Fixes

v2.0.0-beta.1

Introduction

What does the “Beta” mean for me?

⭐️ What’s changed?

🚀 Getting started

🪶 List of Features

v1.22.1

Release Notes

v1.22.1

Enhancement Notes

Bug Fixes

v1.22.0

Release Notes

v1.22.0

⭐️ Highlights

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

🩵 Haystack 2.0 preview

v1.22.0-rc3

Release Notes

v1.22.0-rc2

Bug Fixes

v1.22.0-rc1

Upgrade Notes

New Features

Enhancement Notes

Deprecation Notes

Bug Fixes

Haystack 2.0 preview

v1.22.0-rc1

v1.22.0-rc1

Upgrade Notes

New Features

Enhancement Notes

Deprecation Notes

Bug Fixes

Haystack 2.0 preview

v1.21.2

🐛 Bug Fixes

v1.21.1

✨ Enhancements

⏰ Deprecations

v1.21.0

⭐ Highlights

🚀 Support for gpt-3.5-turbo-instruct

2️⃣ Preview Installation Extra

⚡️ WeaviateDocumentStore Performance

🐣 PineconeDocumentStore Robustness

⬆️ Upgrade Notes

✨ Enhancements

🐛 Bug Fixes

👁️ Haystack 2.0 preview

v1.20.1

Changelog

🪨 Amazon Bedrock support for `PromptNode` (#6226)

🚀 Support for `gpt-3.5-turbo-instruct`