Skip to content

Releases: deepset-ai/haystack

v1.26.3

29 Aug 14:00
Compare
Choose a tag to compare

Release Notes

v1.26.3

⬆️ Upgrade Notes

  • Upgrades ntlk to 3.9.1 as prior versions are affect by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. Due to these security vulnerabilities, it is not possible to use custom NLTK tokenizer models with the new version (for example in PreProcessor). Users can still use built-in nltk tokenizers by specifying the language parameter in the PreProcessor. See PreProcessor documentation for more details.

⚡️ Enhancement Notes

  • Pins sentence-transformers<=3.0.0,>=2.3.1 and python-pptx<=1.0 to avoid some minor typing incompatibilities with the newer version of the respective libraries.

🐛 Bug Fixes

v2.4.0

15 Aug 09:39
8dd610a
Compare
Choose a tag to compare

Release Notes

v2.4.0

Highlights

🙌 Local LLMs and custom generation parameters in evaluation

The new api_params init parameter added to LLM-based evaluators such as ContextRelevanceEvaluator and FaithfulnessEvaluator can be used to pass in supported OpenAIGenerator parameters, allowing for custom generation parameters (via generation_kwargs) and local LLM support (via api_base_url).

📝 New Joiner

New AnswerJoiner component to combine multiple lists of Answers.

⬆️ Upgrade Notes

  • The ContextRelevanceEvaluator now returns a list of relevant sentences for each context, instead of all the sentences in a context. Also, a score of 1 is now returned if a relevant sentence is found, and 0 otherwise.
  • Removed the deprecated DynamicPromptBuilder and DynamicChatPromptBuilder components. Use PromptBuilder and ChatPromptBuilder instead.
  • OutputAdapter and ConditionalRouter can't return users inputs anymore.
  • Multiplexer is removed and users should switch to BranchJoiner instead.
  • Removed deprecated init parameters extractor_type and try_others from HTMLToDocument.
  • SentenceWindowRetrieval component has been renamed to SenetenceWindowRetriever.
  • The serialize_callback_handler and deserialize_callback_handler utility functions have been removed. Use serialize_callable and deserialize_callable instead. For more information on serialize_callable and deserialize_callable, see the API reference: https://docs.haystack.deepset.ai/reference/utils-api#module-callable_serialization

🚀 New Features

  • LLM based evaluators can pass in supported OpenAIGenerator parameters via api_params. This allows for custom generation_kwargs, changing the api_base_url (for local evaluation), and all other supported parameters as described in the OpenAIGenerator docs.
  • Introduced a new AnswerJoiner component that allows joining multiple lists of Answers into a single list using the Concatenate join mode.
  • Add truncate_dim parameter to Sentence Transformers Embedders, which allows truncating embeddings. Especially useful for models trained with Matryoshka Representation Learning.
  • Add precision parameter to Sentence Transformers Embedders, which allows quantized embeddings. Especially useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.

⚡️ Enhancement Notes

  • Adds model_kwargs and tokenizer_kwargs to the components TransformersSimilarityRanker, SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder. This allows passing things like model_max_length or torch_dtype for better management of model inference.
  • Added unicode_normalization parameter to the DocumentCleaner, allowing to normalize the text to NFC, NFD, NFKC, or NFKD.
  • Added ascii_only parameter to the DocumentCleaner, transforming letters with diacritics to their ASCII equivalent and removing other non-ASCII characters.
  • Improved error messages for deserialization errors.
  • TikaDocumentConverter now returns page breaks ("f") in the output. This only works for PDF files.
  • Enhanced filter application logic to support merging of filters. It facilitates more precise retrieval filtering, allowing for both init and runtime complex filter combinations with logical operators. For more details see https://docs.haystack.deepset.ai/docs/metadata-filtering
  • The streaming_callback parameter can be passed to OpenAIGenerator and OpenAIChatGenerator during pipeline run. This prevents the need to recreate pipelines for streaming callbacks.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Document Python 3.11 and 3.12 support in project configuration.
  • Refactor DocumentJoiner to use enum pattern for the 'join_mode' parameter instead of bare string.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new filter style as described in the documentation - https://docs.haystack.deepset.ai/docs/metadata-filtering
  • Deprecate the method to_openai_format of the ChatMessage dataclass. This method was never intended to be public and was only used internally. Now, each Chat Generator will know internally how to convert the messages to the format of their specific provider.
  • Deprecate the unused debug parameter in the Pipeline.run method.
  • SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

🐛 Bug Fixes

  • Fix ChatPromptBuilder from_dict method when template value is None.
  • Fix the DocumentCleaner removing the f tag from content preventing from counting page number (by Splitter for example).
  • The DocumentSplitter was incorrectly calculating the split_start_idx and _split_overlap information due to slight miscalculations of appropriate indices. This fixes those so the split_start_idx and _split_overlap information is correct.
  • Fix bug in Pipeline.run() executing Components in a wrong and unexpected order
  • Encoding of HTML files in LinkContentFetcher
  • Fix Output Adapter from_dict method when custom_filters value is None.
  • Prevent Pipeline.from_dict from modifying the dictionary parameter passed to it.
  • Fix a bug in Pipeline.run() that would cause it to get stuck in an infinite loop and never return. This was caused by Components waiting forever for their inputs when parts of the Pipeline graph are skipped cause of a "decision" Component not returning outputs for that side of the Pipeline.
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber from_dict methods to work when loading with init_parameters only containing required parameters.
  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.
  • Correctly expose PPTXToDocument component in haystack namespace.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter from_dict methods to work when init_parameters only contain required variables.
  • For components that support multiple Document Stores, prioritize using the specific from_dict class method for deserialization when available. Otherwise, fall back to the generic default_from_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

v2.4.0-rc1

14 Aug 13:53
495bf55
Compare
Choose a tag to compare
v2.4.0-rc1 Pre-release
Pre-release

Release Notes

v2.4.0-rc1

Highlights

🙌 Local LLMs and custom generation parameters in evaluation

The new api_params init parameter added to LLM-based evaluators such as ContextRelevanceEvaluator and FaithfulnessEvaluator can be used to pass in supported OpenAIGenerator parameters, allowing for custom generation parameters (via generation_kwargs) and local LLM support (via api_base_url).

📝 New Joiner

New AnswerJoiner component to combine multiple lists of Answers.

⬆️ Upgrade Notes

  • The ContextRelevanceEvaluator now returns a list of relevant sentences for each context, instead of all the sentences in a context. Also, a score of 1 is now returned if a relevant sentence is found, and 0 otherwise.
  • Removed the deprecated DynamicPromptBuilder and DynamicChatPromptBuilder components. Use PromptBuilder and ChatPromptBuilder instead.
  • OutputAdapter and ConditionalRouter can't return users inputs anymore.
  • Multiplexer is removed and users should switch to BranchJoiner instead.
  • Removed deprecated init parameters extractor_type and try_others from HTMLToDocument.
  • SentenceWindowRetrieval component has been renamed to SenetenceWindowRetriever.
  • The serialize_callback_handler and deserialize_callback_handler utility functions have been removed. Use serialize_callable and deserialize_callable instead. For more information on serialize_callable and deserialize_callable, see the API reference: https://docs.haystack.deepset.ai/reference/utils-api#module-callable_serialization

🚀 New Features

  • LLM based evaluators can pass in supported OpenAIGenerator parameters via api_params. This allows for custom generation_kwargs, changing the api_base_url (for local evaluation), and all other supported parameters as described in the OpenAIGenerator docs.
  • Introduced a new AnswerJoiner component that allows joining multiple lists of Answers into a single list using the Concatenate join mode.
  • Add truncate_dim parameter to Sentence Transformers Embedders, which allows truncating embeddings. Especially useful for models trained with Matryoshka Representation Learning.
  • Add precision parameter to Sentence Transformers Embedders, which allows quantized embeddings. Especially useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.

⚡️ Enhancement Notes

  • Adds model_kwargs and tokenizer_kwargs to the components TransformersSimilarityRanker, SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder. This allows passing things like model_max_length or torch_dtype for better management of model inference.
  • Added unicode_normalization parameter to the DocumentCleaner, allowing to normalize the text to NFC, NFD, NFKC, or NFKD.
  • Added ascii_only parameter to the DocumentCleaner, transforming letters with diacritics to their ASCII equivalent and removing other non-ASCII characters.
  • Improved error messages for deserialization errors.
  • TikaDocumentConverter now returns page breaks ("f") in the output. This only works for PDF files.
  • Enhanced filter application logic to support merging of filters. It facilitates more precise retrieval filtering, allowing for both init and runtime complex filter combinations with logical operators. For more details see https://docs.haystack.deepset.ai/docs/metadata-filtering
  • The streaming_callback parameter can be passed to OpenAIGenerator and OpenAIChatGenerator during pipeline run. This prevents the need to recreate pipelines for streaming callbacks.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Document Python 3.11 and 3.12 support in project configuration.
  • Refactor DocumentJoiner to use enum pattern for the 'join_mode' parameter instead of bare string.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Introduce an utility function to deserialize a generic Document Store from the init_parameters of a serialized component.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new filter style as described in the documentation - https://docs.haystack.deepset.ai/docs/metadata-filtering
  • Deprecate the method to_openai_format of the ChatMessage dataclass. This method was never intended to be public and was only used internally. Now, each Chat Generator will know internally how to convert the messages to the format of their specific provider.
  • Deprecate the unused debug parameter in the Pipeline.run method.
  • SentenceWindowRetrieval is deprecated and will be removed in future. Use SentenceWindowRetriever instead.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

🐛 Bug Fixes

  • Fix ChatPromptBuilder from_dict method when template value is None.
  • Fix the DocumentCleaner removing the f tag from content preventing from counting page number (by Splitter for example).
  • The DocumentSplitter was incorrectly calculating the split_start_idx and _split_overlap information due to slight miscalculations of appropriate indices. This fixes those so the split_start_idx and _split_overlap information is correct.
  • Fix bug in Pipeline.run() executing Components in a wrong and unexpected order
  • Encoding of HTML files in LinkContentFetcher
  • Fix Output Adapter from_dict method when custom_filters value is None.
  • Prevent Pipeline.from_dict from modifying the dictionary parameter passed to it.
  • Fix a bug in Pipeline.run() that would cause it to get stuck in an infinite loop and never return. This was caused by Components waiting forever for their inputs when parts of the Pipeline graph are skipped cause of a "decision" Component not returning outputs for that side of the Pipeline.
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber from_dict methods to work when loading with init_parameters only containing required parameters.
  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.
  • Correctly expose PPTXToDocument component in haystack namespace.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter from_dict methods to work when init_parameters only contain required variables.
  • For components that support multiple Document Stores, prioritize using the specific from_dict class method for deserialization when available. Otherwise, fall back to the generic default_from_dict method. This impacts the following generic components: CacheChecker, DocumentWriter, FilterRetriever, and SentenceWindowRetriever.

v2.3.1

29 Jul 12:27
Compare
Choose a tag to compare

Release Notes

v2.3.1

⬆️ Upgrade Notes

  • For security reasons, OutputAdapter and ConditionalRouter can only return the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, None and Ellipsis (...). This implies that types like ChatMessage, Document, and Answer cannot be used as output types.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • DynamicPromptBuilder
    • DynamicChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

🐛 Bug Fixes

  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.

v2.3.1-rc1

26 Jul 16:21
Compare
Choose a tag to compare

Release Notes

v2.3.1-rc1

⬆️ Upgrade Notes

  • OutputAdapter and ConditionalRouter can't return users inputs anymore.

Security Notes

  • Fix issue that could lead to remote code execution when using insecure Jinja template in the following Components:

    • PromptBuilder
    • ChatPromptBuilder
    • DynamicPromptBuilder
    • DynamicChatPromptBuilder
    • OutputAdapter
    • ConditionalRouter

    The same issue has been fixed in the PipelineTemplate class too.

🐛 Bug Fixes

  • Pins structlog to <= 24.2.0 to avoid some unit test failures. This is a temporary fix until we can upgrade tests to a newer versions of structlog.

v2.3.0

15 Jul 12:17
Compare
Choose a tag to compare

Release Notes

Highlights

🧑‍🔬 Haystack Experimental Package

Alongside this release, we're introducing a new repository and package: haystack-experimental.
This package will be installed alongside haystack-ai and will give you access to experimental components. As the name suggests, these components will be highly exploratory, and may or may not make their way into the main haystack package.

  • Each experimental component in the haystack-experimental repo will have a life-span of 3 months
  • The end of the 3 months marks the end of the experiment. In which case the component will either move to the core haystack package, or be discontinued

To learn more about the experimental package, check out the Experimental Package docs[LINK] and the API references[LINK]
To use components in the experimental package, simply from haystack_experimental.component_type import Component
What's in there already?

  • The OpenAIFunctionCaller: Use this component after Chat Generators to call the functions that the LLM returns with
  • The OpenAPITool: The OpenAPITool is a component designed to interact with RESTful endpoints of OpenAPI services. Its primary function is to generate and send appropriate payloads to these endpoints based on human-provided instructions. OpenAPITool bridges the gap between natural language inputs and structured API calls, making it easier for users to interact with complex APIs and thus integrating the structured world of OpenAPI-specified services with the LLMs apps.
  • The EvaluationHarness - A tool that can wrap pipelines to be evaluated as well as complex evaluation tasks into one simple runnable component

For more information, visit https://github.com/deepset-ai/haystack-experimental or the haystack_experimental reference API at https://docs.haystack.deepset.ai/v2.3/reference/ (bottom left pane)

📝 New Converter

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

  • The deprecated converter_name parameter has been removed from PyPDFToDocument.

    To specify a custom converter for PyPDFToDocument, use the converter initialization parameter and pass an instance of a class that implements the PyPDFConverter protocol.

    The PyPDFConverter protocol defines the methods convert, to_dict and from_dict. A default implementation of PyPDFConverter is provided in the DefaultConverter class.

  • Deprecated HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder have been removed. Use HuggingFaceAPITextEmbedder and HuggingFaceAPIDocumentEmbedder instead.

  • Deprecated HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator have been removed. Use HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator instead.

🚀 New Features

  • Adding a new SentenceWindowRetrieval component allowing to perform sentence-window retrieval, i.e. retrieves surrounding documents of a given document from the document store. This is useful when a document is split into multiple chunks and you want to retrieve the surrounding context of a given chunk.
  • Added custom filters support to ConditionalRouter. Users can now pass in one or more custom Jinja2 filter callables and be able to access those filters when defining condition expressions in routes.
  • Added a new mode in JoinDocuments, Distribution-based rank fusion as [the article](https://medium.com/plain-simple-software/distribution-based-score-fusion-dbsf-a-new-approach-to-vector-search-ranking-f87c37488b18)
  • Adding the DocxToDocument component inside the converters category. It uses the python-docx library to convert Docx files to haystack Documents.
  • Add a PPTX to Document converter using the python-pptx library. Extracts all text from each slide. Each slide is separated with a page break "f" so a Document Splitter could split by slide.
  • The DocumentSplitter now has support for the split_id and split_overlap to allow for more control over the splitting process.
  • Introduces the TransformersTextRouter! This component uses a transformers text classification pipeline to route text inputs onto different output connections based on the labels of the chosen text classification model.
  • Add memory sharing between different instances of InMemoryDocumentStore. Setting the same index argument as another instance will make sure that the memory is shared. e.g. `python index = "my_personal_index" document_store_1 = InMemoryDocumentStore(index=index) document_store_2 = InMemoryDocumentStore(index=index) assert document_store_1.count_documents() == 0 assert document_store_2.count_documents() == 0 document_store_1.write_documents([Document(content="Hello world")]) assert document_store_1.count_documents() == 1 assert document_store_2.count_documents() == 1`
  • Add a new missing_meta param to MetaFieldRanker, which determines what to do with documents that lack the ranked meta field. Supported values are "bottom" (which puts documents with missing meta at the bottom of the sorted list), "top" (which puts them at the top), and "drop" (which removes them from the results entirely).

⚡️ Enhancement Notes

  • Added the apply_filter_policy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Added a new parameter to EvaluationRunResult.comparative_individual_scores_report() to specify columns to keep in the comparative DataFrame.
  • Added the 'remove_component' method in 'PipelineBase' to delete components and its connections.
  • Added serialization methods save_to_disk and write_to_disk to InMemoryDocumentStore.
  • When using "openai" for the LLM-based evaluators the metadata from OpenAI will be in the output dictionary, under the key "meta".
  • Remove trafilatura as direct dependency and make it a lazily imported one
  • Renamed component from DocxToDocument to DOCXToDocument to follow the naming convention of other converter components.
  • Made JSON schema validator compatible with all LLM by switching error template handling to a single user message. Also reduce cost by only including last error instead of full message history.
  • Enhanced flexibility in HuggingFace API environment variable names across all related components to support both 'HF_API_TOKEN' and 'HF_TOKEN', improving compatibility with the widely used HF environmental variable naming conventions.
  • Updated the ContextRelevance evaluator prompt, explicitly asking to score each statement.
  • Improve LinkContentFetcher to support a broader range of content types, including glob patterns for text, application, audio, and video types. This update introduces a more flexible content handler resolution mechanism, allowing for direct matches and pattern matching, thereby greatly improving the handler's adaptability to various content types encountered on the web.
  • Add max_retries to AzureOpenAIGenerator. AzureOpenAIGenerator can now be initialised by setting max_retries. If not set, it is inferred from the OPENAI_MAX_RETRIES environment variable or set to 5. The timeout for AzureOpenAIGenerator, if not set, it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.
  • Pipeline serialization to YAML now supports tuples as field values.
  • Add support for [structlog context variables](https://www.structlog.org/en/24.2.0/contextvars.html) to structured logging.
  • AnswerBuilder can now accept ChatMessages as input in addition to strings. When using ChatMessages, metadata will be automatically added to the answer.
  • Update the error message when the sentence-transformers library is not installed and the used component requires it.
  • Add max_retries and timeout parameters to the AzureOpenAIChatGenerator initializations.
  • Add max_retries and timeout parameters to the AzureOpenAITextEmbedder initializations.
  • Add max_retries, timeout parameters to the AzureOpenAIDocumentEmbedder initialization.
  • Improved error messages for deserialization errors.

⚠️ Deprecation Notes

  • Haystack 1.x legacy filters are deprecated and will be removed in a future release. Please use the new ...
Read more

v2.3.0-rc2

10 Jul 14:55
Compare
Choose a tag to compare
v2.3.0-rc2 Pre-release
Pre-release

Release Notes

v2.3.0-rc2

🚀 New Features

  • Adding a new component allowing to perform sentence-window retrieval, i.e. retrieves surrounding documents of a given document from the document store. This is useful when a document is split into multiple chunks and you want to retrieve the surrounding context of a given chunk.

⚡️ Enhancement Notes

  • Enhanced the PyPDF converter to ensure backwards compatibility with Pipelines dumped with versions older than 2.3.0. The update includes a conditional check to automatically default to the DefaultConverter if a specific converter is not provided, improving the component's robustness and ease of use.

⚠️ Deprecation Notes

🐛 Bug Fixes

  • Encoding of HTML files in LinkContentFetcher
  • This updates the components, TransformersSimilarityRanker, SentenceTransformersDiversityRanker, SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder and LocalWhisperTranscriber from_dict methods to work when loading with init_parameters only containing required parameters.
  • Fix TransformersZeroShotTextRouter and TransformersTextRouter from_dict methods to work when init_parameters only contain required variables.

v2.3.0-rc1

08 Jul 13:02
Compare
Choose a tag to compare
v2.3.0-rc1 Pre-release
Pre-release

Release Notes

Highlights

Adding the DocxToDocument component to convert Docx files to Documents.

⬆️ Upgrade Notes

  • trafilatura must now be manually installed with pip install trafilatura to use the HTMLToDocument Component.

  • The deprecated converter_name parameter has been removed from PyPDFToDocument.

    To specify a custom converter for PyPDFToDocument, use the converter initialization parameter and pass an instance of a class that implements the PyPDFConverter protocol.

    The PyPDFConverter protocol defines the methods convert, to_dict and from_dict. A default implementation of PyPDFConverter is provided in the DefaultConverter class.

  • Deprecated HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder have been removed. Use HuggingFaceAPITextEmbedder and HuggingFaceAPIDocumentEmbedder instead.

  • Deprecated HuggingFaceTGIGenerator and HuggingFaceTGIChatGenerator have been removed. Use HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator instead.

🚀 New Features

  • Added custom filters support to ConditionalRouter. Users can now pass in one or more custom Jinja2 filter callables and be able to access those filters when defining condition expressions in routes.
  • Added a new mode in JoinDocuments, Distribution-based rank fusion as [the article](https://medium.com/plain-simple-software/distribution-based-score-fusion-dbsf-a-new-approach-to-vector-search-ranking-f87c37488b18)
  • Adding the DocxToDocument component inside the converters category. It uses the python-docx library to convert Docx files to haystack Documents.
  • Added haystack-experimental to the project's dependencies to enable automatic use of cutting-edge features from Haystack. Users can now access components from haystack-experimental by simply importing them from haystack_experimental instead of haystack. For more information, visit https://github.com/deepset-ai/haystack-experimental.
  • Add a PPTX to Document converter using the python-pptx library. Extracts all text from each slide. Each slide is separated with a page break "f" so a Document Splitter could split by slide.
  • The DocumentSplitter now has support for the split_id and split_overlap to allow for more control over the splitting process.
  • Introduces the TransformersTextRouter! This component uses a transformers text classification pipeline to route text inputs onto different output connections based on the labels of the chosen text classification model.
  • Add memory sharing between different instances of InMemoryDocumentStore. Setting the same index argument as another instance will make sure that the memory is shared. e.g. `python index = "my_personal_index" document_store_1 = InMemoryDocumentStore(index=index) document_store_2 = InMemoryDocumentStore(index=index) assert document_store_1.count_documents() == 0 assert document_store_2.count_documents() == 0 document_store_1.write_documents([Document(content="Hello world")]) assert document_store_1.count_documents() == 1 assert document_store_2.count_documents() == 1`
  • Add a new missing_meta param to MetaFieldRanker, which determines what to do with documents that lack the ranked meta field. Supported values are "bottom" (which puts documents with missing meta at the bottom of the sorted list), "top" (which puts them at the top), and "drop" (which removes them from the results entirely).

⚡️ Enhancement Notes

  • Added the apply_filter_policy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Added a new parameter to EvaluationRunResult.comparative_individual_scores_report() to specify columns to keep in the comparative DataFrame.
  • Added the 'remove_component' method in 'PipelineBase' to delete components and its connections.
  • Added serialization methods save_to_disk and write_to_disk to InMemoryDocumentStore.
  • When using "openai" for the LLM-based evaluators the metadata from OpenAI will be in the output dictionary, under the key "meta".
  • Remove trafilatura as direct dependency and make it a lazily imported one
  • Renamed component from DocxToDocument to DOCXToDocument to follow the naming convention of other converter components.
  • Made JSON schema validator compatible with all LLM by switching error template handling to a single user message. Also reduce cost by only including last error instead of full message history.
  • Enhanced flexibility in HuggingFace API environment variable names across all related components to support both 'HF_API_TOKEN' and 'HF_TOKEN', improving compatibility with the widely used HF environmental variable naming conventions.
  • Updated the ContextRelevance evaluator prompt, explicitly asking to score each statement.
  • Improve LinkContentFetcher to support a broader range of content types, including glob patterns for text, application, audio, and video types. This update introduces a more flexible content handler resolution mechanism, allowing for direct matches and pattern matching, thereby greatly improving the handler's adaptability to various content types encountered on the web.
  • Add max_retries to AzureOpenAIGenerator. AzureOpenAIGenerator can now be initialised by setting max_retries. If not set, it is inferred from the OPENAI_MAX_RETRIES environment variable or set to 5. The timeout for AzureOpenAIGenerator, if not set, it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.
  • Pipeline serialization to YAML now supports tuples as field values.
  • Add support for [structlog context variables](https://www.structlog.org/en/24.2.0/contextvars.html) to structured logging.
  • AnswerBuilder can now accept ChatMessages as input in addition to strings. When using ChatMessages, metadata will be automatically added to the answer.
  • Update the error message when the sentence-transformers library is not installed and the used component requires it.

⚠️ Deprecation Notes

  • The output of the ContextRelevanceEvaluator will change in Haystack 2.4.0. Contexts will be scored as a whole instead of individual statements and only the relevant sentences will be returned. A score of 1 is now returned if a relevant sentence is found, and 0 otherwise.

🐛 Bug Fixes

  • SASEvaluator now raises a ValueError if a None value is contained in the predicted_answers input.
  • Auto enable tracing upon import if ddtrace or opentelemetry is installed.
  • Meta handling of bytestreams in Azure OCR has been fixed.
  • Use new filter syntax in the CacheChecker component instead of legacy one.
  • Solve serialization bug on 'ChatPromptBuilder' by creating 'to_dict' and 'from_dict' methods on 'ChatMessage' and 'ChatPromptBuilder'.
  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.
  • Fixed the calculation for MRR and MAP scores.
  • Fix the deserialization of pipelines containing evaluator components that were subclasses of LLMEvaluator.
  • Fix recursive JSON type conversion in the schema validator to be less aggressive (no infinite recursion).
  • Adds the missing 'organization' parameter to the serialization function.
  • Correctly serialize tuples and types in the init parameters of the LLMEvaluator component and its subclasses.
  • Pin numpy<2 to avoid breaking changes that cause several core integrations to fail. Pin tenacity too (8.4.0 is broken).

v2.2.4

04 Jul 14:42
Compare
Choose a tag to compare

Release Notes

v2.2.4

⚡️ Enhancement Notes

  • Added the apply_filter_policy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.

🐛 Bug Fixes

  • Meta handling of bytestreams in Azure OCR has been fixed.
  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.

v2.2.4-rc1

03 Jul 11:49
Compare
Choose a tag to compare
v2.2.4-rc1 Pre-release
Pre-release

Release Notes

v2.2.4-rc1

⚡️ Enhancement Notes

  • Added the apply_filter_policy function to standardize the application of filter policies across all document store-specific retrievers, allowing for consistent handling of initial and runtime filters based on the chosen policy (replace or merge).
  • Introduced a 'filter_policy' init parameter for both InMemoryBM25Retriever and InMemoryEmbeddingRetriever, allowing users to define how runtime filters should be applied with options to either 'replace' the initial filters or 'merge' them, providing greater flexibility in filtering query results.

🐛 Bug Fixes

  • Fix some bugs running a Pipeline that has Components with conditional outputs. Some branches that were expected not to run would run anyway, even if they received no inputs. Some branches instead would cause the Pipeline to get stuck waiting to run that branch, even if they received no inputs. The behaviour would depend whether the Component not receiving the input has a optional input or not.