Releases: deepset-ai/haystack
v1.11.1rc1
What's Changed
Full Changelog: v1.11.0...v1.11.1rc1
v1.11.0
⭐ Highlights
Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small
, medium
, or large
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Extracting headlines from Markdown and PDF files (#3445 #3488)
Using the MarkdownConverter
or the ParsrConverter
you can set the parameter extract_headlines
to True
to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:
{
"headline": <THE HEADLINE STRING>,
"start_idx": <IDX OF HEADLINE START IN document.content >,
"level": <LEVEL OF THE HEADLINE>
}
Introducing the proposals design process (#3333)
We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.
⚠️ Breaking change: removing Milvus1DocumentStore
From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore
and renamed Milvus2DocumentStore
to MilvusDocumentStore
.
What's Changed
Breaking Changes
- bug: removed duplicated meta "name" field addition to content before embedding in
update_embeddings
workflow by @mayankjobanputra in #3368 - BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552
Pipeline
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- Fix: update pyworld pin by @anakin87 in #3435
- feat: send event if number of queries exceeds threshold by @vblagoje in #3419
- Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
- feat: add
__cointains__
toSpan
by @ZanSara in #3446 - Bug: Fix prompt length computation by @Timoeller in #3448
- Add indexing pipeline type by @vblagoje in #3461
- fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
- feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
- feat: Extraction of headlines in markdown files by @bogdankostic in #3445
- bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
- feat: add
document_store
to allBaseRetriever.retrieve()
andBaseRetriever.retrieve_batch()
implementations by @ZanSara in #3379 - refactor: TableReader by @sjrl in #3456
- fix: do not reference package directory in
PDFToTextOCRConverter.convert()
by @ZanSara in #3478 - feat: Create the TextIndexingPipeline by @brandenchan in #3473
- refactor: remove YAML save/load methods for subclasses of
BaseStandardPipeline
by @ZanSara in #3443 - fix: strip whitespaces safely from
FARMReader
's answers by @ZanSara in #3526
DocumentStores
- Document Store test refactoring by @masci in #3449
- fix: support long texts for labels in
ElasticsearchDocumentStore
by @anakin87 in #3346 - feat: add SQLDocumentStore tests by @masci in #3517
- refactor: Refactor Weaviate tests by @masci in #3541
- refactor: Pinecone tests by @masci in #3555
- fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
- fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
- fix: discard metadata fields if not set in Weaviate by @masci in #3578
UI / Demo
Documentation
- docs: Extend utils API docs coverage by @brandenchan in #3402
- refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
- feat: introduce proposal design process by @masci in #3333
Other Changes
- fix: Update env variable for model caching timeout by @sjrl in #3405
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
- fix: improve Document
__repr__
by @anakin87 in #3385 - fix: disabling telemetry prevents writing config by @julian-risch in #3465
- refactor: Change
no_answer
attribute by @anakin87 in #3411 - feat: Speed up reader tests by @sjrl in #3476
- fix: pattern to match tags push by @masci in #3469
- fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
- feat: Add headline extraction to
ParsrConverter
by @bogdankostic in #3488 - refactor: upgrade actions version by @ZanSara in #3506
- docs: Update docker readme by @brandenchan in #3531
- refactor: refactor FAISS tests by @masci in #3537
- feat: include error message in HaystackError telemetry events by @vblagoje in #3543
- fix: [rest_api] support TableQA in the endpoint
/documents/get_by_filters
by @ju-gu in #3551 - bug: fix release number by @mayankjobanputra in #3559
- refactor: Generate JSON schema when missing by @masci in #3533
New Contributors
- @brunnurs made their first contribution in #3330
- @mayankjobanputra made their first contribution in #3368
Full Changelog: v1.10.0...v1.11.0rc1
v1.11.0rc1
⭐ Highlights
Expanding Haystack’s LLM support further with the new CohereEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using Cohere’s large language models: if you have a Cohere account, all you have to do is set the name of one of the supported models (small
, medium
, or large
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Extracting headlines from Markdown and PDF files (#3445 #3488)
Using the MarkdownConverter
or the ParsrConverter
you can set the parameter extract_headlines
to True
to extract the headlines out of your files together with their start start position in the file and their level. Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:
{
"headline": <THE HEADLINE STRING>,
"start_idx": <IDX OF HEADLINE START IN document.content >,
"level": <LEVEL OF THE HEADLINE>
}
Introducing the proposals design process (#3333)
We've introduced the proposal design process for substantial changes. A proposal is a single Markdown file that explains why a change is needed and how it would be implemented. You can find a detailed explanation of the process and a proposal template in the proposals directory.
⚠️ Breaking change: removing Milvus1DocumentStore
From this version onwards, Haystack no longer supports version 1 of Milvus. We still support Milvus version 2. We removed Milvus1DocumentStore
and renamed Milvus2DocumentStore
to MilvusDocumentStore
.
What's Changed
Breaking Changes
- bug: removed duplicated meta "name" field addition to content before embedding in
update_embeddings
workflow by @mayankjobanputra in #3368 - BREAKING CHANGE: remove Milvus1DocumentStore along with support for Milvus < 2.x by @masci in #3552
Pipeline
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- Fix: update pyworld pin by @anakin87 in #3435
- feat: send event if number of queries exceeds threshold by @vblagoje in #3419
- Feat: allow decreasing size of datasets loaded from BEIR by @ugm2 in #3392
- feat: add
__cointains__
toSpan
by @ZanSara in #3446 - Bug: Fix prompt length computation by @Timoeller in #3448
- Add indexing pipeline type by @vblagoje in #3461
- fix: warning if doc store similarity function is incompatible with Sentence Transformers model by @anakin87 in #3455
- feat: Add CohereEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3453
- feat: Extraction of headlines in markdown files by @bogdankostic in #3445
- bug: replace decorator with counter attribute for pipeline event by @julian-risch in #3462
- feat: add
document_store
to allBaseRetriever.retrieve()
andBaseRetriever.retrieve_batch()
implementations by @ZanSara in #3379 - refactor: TableReader by @sjrl in #3456
- fix: do not reference package directory in
PDFToTextOCRConverter.convert()
by @ZanSara in #3478 - feat: Create the TextIndexingPipeline by @brandenchan in #3473
- refactor: remove YAML save/load methods for subclasses of
BaseStandardPipeline
by @ZanSara in #3443 - fix: strip whitespaces safely from
FARMReader
's answers by @ZanSara in #3526
DocumentStores
- Document Store test refactoring by @masci in #3449
- fix: support long texts for labels in
ElasticsearchDocumentStore
by @anakin87 in #3346 - feat: add SQLDocumentStore tests by @masci in #3517
- refactor: Refactor Weaviate tests by @masci in #3541
- refactor: Pinecone tests by @masci in #3555
- fix: write metadata to SQL Document Store when duplicate_documents!="overwrite" by @anakin87 in #3548
- fix: Elasticsearch / OpenSearch brownfield function does not incorporate meta by @tstadel in #3572
- fix: discard metadata fields if not set in Weaviate by @masci in #3578
UI / Demo
Documentation
- docs: Extend utils API docs coverage by @brandenchan in #3402
- refactor: simplify Summarizer, add Document Merger by @anakin87 in #3452
- feat: introduce proposal design process by @masci in #3333
Other Changes
- fix: Update env variable for model caching timeout by @sjrl in #3405
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
- fix: improve Document
__repr__
by @anakin87 in #3385 - fix: disabling telemetry prevents writing config by @julian-risch in #3465
- refactor: Change
no_answer
attribute by @anakin87 in #3411 - feat: Speed up reader tests by @sjrl in #3476
- fix: pattern to match tags push by @masci in #3469
- fix: using onnx converter on XLMRoberta architecture by @sjrl in #3470
- feat: Add headline extraction to
ParsrConverter
by @bogdankostic in #3488 - refactor: upgrade actions version by @ZanSara in #3506
- docs: Update docker readme by @brandenchan in #3531
- refactor: refactor FAISS tests by @masci in #3537
- feat: include error message in HaystackError telemetry events by @vblagoje in #3543
- fix: [rest_api] support TableQA in the endpoint
/documents/get_by_filters
by @ju-gu in #3551 - bug: fix release number by @mayankjobanputra in #3559
- refactor: Generate JSON schema when missing by @masci in #3533
New Contributors
- @brunnurs made their first contribution in #3330
- @mayankjobanputra made their first contribution in #3368
Full Changelog: v1.10.0...v1.11.0rc1
v1.10.0
⭐ Highlights
Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada
, babbage
, davinci
or curie
) and add your API key to the EmbeddingRetriever
component in your pipelines (see docs).
Multimodal retrieval is here! (#2891)
Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever
: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever
and see it in action.
retriever = MultiModalRetriever(
document_store=InMemoryDocumentStore(embedding_dim=512),
query_embedding_model = "sentence-transformers/clip-ViT-B-32",
query_type="text",
document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
)
Multi-platform Docker images
Starting with 1.10, we're making the deepset/haystack
images available for linux/amd64
and linux/arm64
.
⚠️ Breaking change in embed_queries
method (#3252)
We've changed the text
argument in the embed_queries
method for DensePassageRetriever
and EmbeddingRetriever
to queries
.
What's Changed
Breaking Changes
Pipeline
- fix: ONNX FARMReader model conversion is broken by @vblagoje in #3211
- bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in #3170
- fix: eval() with
add_isolated_node_eval=True
breaks if no node supports it by @tstadel in #3347 - feat: extract label aggregation by @tstadel in #3363
- feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3356
- fix: stable YAML schema generation by @ZanSara in #3388
- fix: Update how schema is ordered by @sjrl in #3399
- feat:
MultiModalRetriever
by @ZanSara in #2891
DocumentStores
- feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in #3217
- feat: add support for Elasticsearch 7.16.2 by @masci in #3318
- refactor: remove dead code from
FAISSDocumentStore
by @anakin87 in #3372 - fix: allow same
vector_id
in different indexes for SQL-based Document stores by @anakin87 in #3383
UI / Demo
Documentation
- docs: Fix a docstring in ray.py by @tanertopal in #3282
Other Changes
- refactor: make
TransformersDocumentClassifier
output consistent between different types of classification by @anakin87 in #3224 - Classify pipeline's type based on its components by @vblagoje in #3132
- docs: sync Haystack API with Readme by @brandenchan in #3223
- fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in #3265
- bug: make
ElasticSearchDocumentStore
usebatch_size
inget_documents_by_id
by @anakin87 in #3166 - refactor: better tests for
TransformersDocumentClassifier
by @anakin87 in #3270 - fix: AttributeError in TranslationWrapperPipeline by @nickchomey in #3290
- refactor: remove Inferencer multiprocessing by @vblagoje in #3283
- fix: opensearch script score with filters by @tstadel in #3321
- feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in #3301
- feat: add multi-platform Docker images by @masci in #3354
- fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in #3366
- fix: QuestionGenerator generates wrong document questions for non-default
num_queries_per_doc
parameter by @vblagoje in #3381 - bug: Adds better way of checking
query
in BaseRetriever and Pipeline.run() by @ugm2 in #3304 - feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in #3154
- docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in #3404
- feat: Speed up integration tests (nodes) by @sjrl in #3408
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
New Contributors
- @tanertopal made their first contribution in #3282
- @JeffRisberg made their first contribution in #3170
- @JacdDev made their first contribution in #3301
- @hsm207 made their first contribution in #3351
- @ugm2 made their first contribution in #3304
- @brunnurs made their first contribution in #3330
Full Changelog: v1.9.1...v1.10.0rc1
v1.10.0rc1
⭐ Highlights
Expanding Haystack's LLM support with the new OpenAIEmbeddingEncoder
(#3356)
Now you can easily create document and query embeddings using large language models: if you have an OpenAI account, all you have to do is set the name of one of the supported models (ada
, babbage
, davinci
or curie
) and add your API key to the EmbeddedRetriver
component in your pipelines.
Multimodal retrieval is here! (#2891)
Multimodality with Haystack just made a big leap forward with the addition of MultiModalRetriever
: a Retriever that can handle different modalities for query and documents independently. Take it for a spin and experiment with new Document formats, like images. You can now use the same Retriever for text-to-image, text-to-table, and text-to-text retrieval but also image similarity, table similarity, and more! Feed your favorite multimodal model to MultiModalRetriever
and see it in action.
retriever = MultiModalRetriever(
document_store=InMemoryDocumentStore(embedding_dim=512),
query_embedding_model = "sentence-transformers/clip-ViT-B-32",
query_type="text",
document_embedding_models = {"image": "sentence-transformers/clip-ViT-B-32"}
)
Multi-platform Docker images
Starting with 1.10, we're making the deepset/haystack
images available for linux/amd64
and linux/arm64
.
⚠️ Breaking change in embed_queries
method (#3252)
We've changed the text
argument in the embed_queries
method for DensePassageRetriever
and EmbeddingRetriever
to queries
.
What's Changed
Breaking Changes
Pipeline
- fix: ONNX FARMReader model conversion is broken by @vblagoje in #3211
- bug: JoinDocuments nodes produce incorrect results if preceded by another JoinDocuments node by @JeffRisberg in #3170
- fix: eval() with
add_isolated_node_eval=True
breaks if no node supports it by @tstadel in #3347 - feat: extract label aggregation by @tstadel in #3363
- feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by @vblagoje in #3356
- fix: stable YAML schema generation by @ZanSara in #3388
- fix: Update how schema is ordered by @sjrl in #3399
- feat:
MultiModalRetriever
by @ZanSara in #2891
DocumentStores
- feat: FAISS in OpenSearch: Support HNSW for cosine by @tstadel in #3217
- feat: add support for Elasticsearch 7.16.2 by @masci in #3318
- refactor: remove dead code from
FAISSDocumentStore
by @anakin87 in #3372 - fix: allow same
vector_id
in different indexes for SQL-based Document stores by @anakin87 in #3383
UI / Demo
Documentation
- docs: Fix a docstring in ray.py by @tanertopal in #3282
Other Changes
- refactor: make
TransformersDocumentClassifier
output consistent between different types of classification by @anakin87 in #3224 - Classify pipeline's type based on its components by @vblagoje in #3132
- docs: sync Haystack API with Readme by @brandenchan in #3223
- fix: MostSimilarDocumentsPipeline doesn't have pipeline property by @vblagoje in #3265
- bug: make
ElasticSearchDocumentStore
usebatch_size
inget_documents_by_id
by @anakin87 in #3166 - refactor: better tests for
TransformersDocumentClassifier
by @anakin87 in #3270 - fix: AttributeError in TranslationWrapperPipeline by @nickchomey in #3290
- refactor: remove Inferencer multiprocessing by @vblagoje in #3283
- fix: opensearch script score with filters by @tstadel in #3321
- feat: Adding filters param to MostSimilarDocumentsPipeline run and run_batch by @JacdDev in #3301
- feat: add multi-platform Docker images by @masci in #3354
- fix: Added checks for DataParallel and WrappedDataParallel by @sjrl in #3366
- fix: QuestionGenerator generates wrong document questions for non-default
num_queries_per_doc
parameter by @vblagoje in #3381 - bug: Adds better way of checking
query
in BaseRetriever and Pipeline.run() by @ugm2 in #3304 - feat: Updated EntityExtractor to handle long texts and added better postprocessing by @sjrl in #3154
- docs: Add comment about the generation of no-answer samples in FARMReader training by @brandenchan in #3404
- feat: Speed up integration tests (nodes) by @sjrl in #3408
- fix: Fix the error of wrong page numbers when documents contain empty pages. by @brunnurs in #3330
- bug: change type of split_by to Literal including None by @julian-risch in #3389
- feat: Add exponential backoff decorator; apply it to OpenAI requests by @vblagoje in #3398
New Contributors
- @tanertopal made their first contribution in #3282
- @JeffRisberg made their first contribution in #3170
- @JacdDev made their first contribution in #3301
- @hsm207 made their first contribution in #3351
- @ugm2 made their first contribution in #3304
- @brunnurs made their first contribution in #3330
Full Changelog: v1.9.1...v1.10.0rc1
v1.9.1
What's Changed
- fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345
Full Changelog: v1.9.0...v1.9.1rc1
v1.9.1rc1
What's Changed
- fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345
Full Changelog: v1.9.0...v1.9.1rc1
v1.9.0
⭐ Highlights
Haystack 1.9 comes with nice performance improvements and two important pieces of news about its ecosystem. Let's see it in more detail!
Logging speed set to ludicrous (#3212)
This feature alone makes Haystack 1.9 worth testing out, just sayin'... We switched from f-strings to the string formatting operator when composing a log message, observing an astonishing speed of up to 120% in some pipelines.
Tutorials moved out! (#3244)
They grow up so fast! Tutorials now have their own git repository, CI, and release cycle, making it easier than ever to contribute ideas, fixes, and bug reports. Have a look at the tutorials repo, Star it, and open an issue if you have an idea for a new tutorial!
Docker pull deepset/haystack (#3162)
A new Docker image is ready to be pulled shipping Haystack 1.9, providing different flavors and versions that you can specify with the proper Docker tag - have a look at the README.
On this occasion, we also revamped the build process so that it's now using bake, while the older images are deprecated (see below).
⚠️ Deprecation notice
With the release of the new Docker image deepset/haystack, the following images are now deprecated and won't be updated any more starting with Haystack 1.10:
New Documentation Site and Haystack Website Revamp:
The Haystack website is going through a make-over to become a developer portal that surrounds Haystack and NLP topics beyond pure documentation. With that, we've published our new documentation site. From now on, content surrounding pure developer documentation will live under Haystack Documentation, while the Haystack website becomes a place for the community with tutorials, learning material and soon, a place where the community can share their own content too.
What's Changed
Pipeline
- feat: standardize devices parameter and device initialization by @vblagoje in #3062
- fix: Reduce GPU to CPU copies at inference by @sjrl in #3127
- test: lower low boundary for accuracy in
test_calculate_context_similarity_on_non_matching_contexts
by @ZanSara in #3199 - bug: fix pdftotext installation verification by @banjocustard in #3233
- chore: remove f-strings from logs for performance reasons by @ZanSara in #3212
- bug: reactivate benchmarks with quick fixes by @tholor in #2766
Models
DocumentStores
- bug:
OpensearchDocumentStore.custom_mapping
should accept JSON strings at validation by @ZanSara in #3065 - feat: Add warnings to PineconeDocumentStore about indexing metadata if filters return no documents by @Namoush in #3086
- bug: validate
custom_mapping
as an object by @ZanSara in #3189
Tutorials
- docs: Fix the word length splitting; should be set to 100 not 1,000 by @stevenhaley in #3133
- chore: remove tutorials from the repo by @masci in #3244
Other Changes
- chore: Upgrade and pin transformers to 4.21.2 by @vblagoje in #3098
- bug: adapt UI random question for streamlit 1.12 and pin to streamlit>=1.9.0 by @anakin87 in #3121
- build: pin pydantic to 1.9.2 by @masci in #3126
- fix: document FARMReader.train() evaluation report log level by @brandenchan in #3129
- feat: add a security policy for Haystack by @masci in #3130
- refactor: update dependencies and remove pins by @danielbichuetti in #3147
- refactor: update package strategy in rest_api by @masci in #3148
- fix: give default index for torch.device('cuda') in initialize_device_settings by @sjrl in #3161
- fix: add type hints to all component init constructor parameters by @vblagoje in #3152
- fix: Add 15 min timeout for downloading cached HF models by @vblagoje in #3179
- fix: replace torch.device("cuda") with torch.device("cuda:0") in devices initialization by @vblagoje in #3184
- feat: add health check endpoint to rest api by @danielbichuetti in #3168
- refactor: improve support for dataclasses by @danielbichuetti in #3142
- feat: Updates docs and types for language param in PreProcessor by @sjrl in #3186
- feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers by @bglearning in #3164
- refactoring: reimplement Docker strategy by @masci in #3162
- refactor: remove pre haystack-1.0 import paths support by @ZanSara in #3204
- feat: exponential backoff with exp decreasing batch size for opensearch and elasticsearch client by @ArzelaAscoIi in #3194
- feat: add public layout-base extraction support on PDFToTextConverter by @danielbichuetti in #3137
- bug: fix embedding_dim mismatch in DocumentStore by @kalki7 in #3183
- fix: update rest_api Docker Compose yamls for recent refactoring of rest_api by @nickchomey in #3197
- chore: fix Windows CI by @masci in #3222
- fix: type of
temperature
param and adjust defaults forOpenAIAnswerGenerator
by @tholor in #3073 - fix: handle Documents containing dataframes in Multilabel constructor by @masci in #3237
- fix: make pydoc-markdown hook correctly resolve paths relative to repo root by @masci in #3238
- fix: proper retrieval of answers for batch eval by @vblagoje in #3245
- chore: updating colab links in older docs versions by @TuanaCelik in #3250
- docs: establish API docs sync between v1.9.x and Readme by @brandenchan in #3266
New Contributors
- @Namoush made their first contribution in #3086
- @kalki7 made their first contribution in #3183
- @nickchomey made their first contribution in #3197
- @banjocustard made their first contribution in #3233
Full Changelog: v1.8.0...v1.9.0
v1.8.0
⭐ Highlights
This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:
Pipeline Evaluation in Batch Mode #2942
The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval()
with pipeline.eval_batch()
when you evaluate your question answering pipeline:
...
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})
Early Stopping in Reader and Retriever Training #3071
When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss
, EM
, f1
, and top_n_accuracy
for FARMReader
or loss
, acc
, f1
, and average_rank
for DensePassageRetriever
. For example, reader training can be stopped when loss
doesn't further decrease by at least 0.001 compared to the previous epoch:
from haystack.nodes import FARMReader
from haystack.utils.early_stopping import EarlyStopping
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")
PineconeDocumentStore Without SQL Database #2749
Thanks to @jamescalam the PineconeDocumentStore
does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore
from now on, all you need to provide is a Pinecone API key:
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key="...")
docs = [Document(content="...")]
document_store.write_documents(docs)
FAISS in OpenSearchDocumentStore: #3101 #3029
OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore
you can now set the knn_engine
parameter to choose between nmslib
and faiss
. When loading an existing index you can also specify a knn_engine
and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.
Highlighted Bug Fixes
A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]]
, in which case the validation failed if the parameter was explicitly set to None
.
- fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in #3094
- bug: handle
Optional
params in schema validation by @anakin87 in #2980
Other Changes
DocumentStores
Documentation
- refactor: rename
master
intomain
in documentation and links by @ZanSara in #3063 - docs:fixed typo (or old documentation) in ipynb tutorial 3 by @DavidGerva in #3033
- docs: Add OpenAI Answer Generator API by @brandenchan in #3050
Crawler
- fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in #3043
- fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in #3070
Other Changes
- fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in #3077
- test: Use
random_sample
instead ofndarray
for random array inOpenSearchDocumentStore
test by @bogdankostic in #3083 - feat: add progressbar to upload_files() for deepset Cloud client by @tholor in #3069
- refactor: update package metadata by @ofek in #3079
New Contributors
- @DavidGerva made their first contribution in #3033
- @ofek made their first contribution in #3079
❤️ Big thanks to all contributors and the whole community!
Full Changelog: v1.7.1...v1.8.0
v1.7.1
Patch Release
Main Changes
Other Changes
- fix: pin version of pyworld to
0.2.12
by @sjrl in #3047 - test: update filtering of Pinecone mock to imitate doc store by @jamescalam in #3020
Full Changelog: v1.7.0...v1.7.1