10 Oct 12:37

masci

256321d

v1.9.1rc1 Pre-release

Pre-release

What's Changed

fix: Allow less restrictive values for parameters in Pipeline configurations by @bogdankostic in #3345

Full Changelog: v1.9.0...v1.9.1rc1

Contributors

bogdankostic

Assets 2

21 Sep 11:23

masci

v1.9.0

ce36be8

v1.9.0

⭐ Highlights

Haystack 1.9 comes with nice performance improvements and two important pieces of news about its ecosystem. Let's see it in more detail!

Logging speed set to ludicrous (#3212)

This feature alone makes Haystack 1.9 worth testing out, just sayin'... We switched from f-strings to the string formatting operator when composing a log message, observing an astonishing speed of up to 120% in some pipelines.

Tutorials moved out! (#3244)

They grow up so fast! Tutorials now have their own git repository, CI, and release cycle, making it easier than ever to contribute ideas, fixes, and bug reports. Have a look at the tutorials repo, Star it, and open an issue if you have an idea for a new tutorial!

Docker pull deepset/haystack (#3162)

A new Docker image is ready to be pulled shipping Haystack 1.9, providing different flavors and versions that you can specify with the proper Docker tag - have a look at the README.
On this occasion, we also revamped the build process so that it's now using bake, while the older images are deprecated (see below).

⚠️ Deprecation notice

With the release of the new Docker image deepset/haystack, the following images are now deprecated and won't be updated any more starting with Haystack 1.10:

New Documentation Site and Haystack Website Revamp:

The Haystack website is going through a make-over to become a developer portal that surrounds Haystack and NLP topics beyond pure documentation. With that, we've published our new documentation site. From now on, content surrounding pure developer documentation will live under Haystack Documentation, while the Haystack website becomes a place for the community with tutorials, learning material and soon, a place where the community can share their own content too.

What's Changed

Pipeline

feat: standardize devices parameter and device initialization by @vblagoje in #3062
fix: Reduce GPU to CPU copies at inference by @sjrl in #3127
test: lower low boundary for accuracy in test_calculate_context_similarity_on_non_matching_contexts by @ZanSara in #3199
bug: fix pdftotext installation verification by @banjocustard in #3233
chore: remove f-strings from logs for performance reasons by @ZanSara in #3212
bug: reactivate benchmarks with quick fixes by @tholor in #2766

Models

fix: Replace multiprocessing tokenization with batched fast tokenization by @vblagoje in #3089

DocumentStores

bug: OpensearchDocumentStore.custom_mapping should accept JSON strings at validation by @ZanSara in #3065
feat: Add warnings to PineconeDocumentStore about indexing metadata if filters return no documents by @Namoush in #3086
bug: validate custom_mapping as an object by @ZanSara in #3189

Tutorials

docs: Fix the word length splitting; should be set to 100 not 1,000 by @stevenhaley in #3133
chore: remove tutorials from the repo by @masci in #3244

Other Changes

chore: Upgrade and pin transformers to 4.21.2 by @vblagoje in #3098
bug: adapt UI random question for streamlit 1.12 and pin to streamlit>=1.9.0 by @anakin87 in #3121
build: pin pydantic to 1.9.2 by @masci in #3126
fix: document FARMReader.train() evaluation report log level by @brandenchan in #3129
feat: add a security policy for Haystack by @masci in #3130
refactor: update dependencies and remove pins by @danielbichuetti in #3147
refactor: update package strategy in rest_api by @masci in #3148
fix: give default index for torch.device('cuda') in initialize_device_settings by @sjrl in #3161
fix: add type hints to all component init constructor parameters by @vblagoje in #3152
fix: Add 15 min timeout for downloading cached HF models by @vblagoje in #3179
fix: replace torch.device("cuda") with torch.device("cuda:0") in devices initialization by @vblagoje in #3184
feat: add health check endpoint to rest api by @danielbichuetti in #3168
refactor: improve support for dataclasses by @danielbichuetti in #3142
feat: Updates docs and types for language param in PreProcessor by @sjrl in #3186
feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers by @bglearning in #3164
refactoring: reimplement Docker strategy by @masci in #3162
refactor: remove pre haystack-1.0 import paths support by @ZanSara in #3204
feat: exponential backoff with exp decreasing batch size for opensearch and elasticsearch client by @ArzelaAscoIi in #3194
feat: add public layout-base extraction support on PDFToTextConverter by @danielbichuetti in #3137
bug: fix embedding_dim mismatch in DocumentStore by @kalki7 in #3183
fix: update rest_api Docker Compose yamls for recent refactoring of rest_api by @nickchomey in #3197
chore: fix Windows CI by @masci in #3222
fix: type of temperature param and adjust defaults for OpenAIAnswerGenerator by @tholor in #3073
fix: handle Documents containing dataframes in Multilabel constructor by @masci in #3237
fix: make pydoc-markdown hook correctly resolve paths relative to repo root by @masci in #3238
fix: proper retrieval of answers for batch eval by @vblagoje in #3245
chore: updating colab links in older docs versions by @TuanaCelik in #3250
docs: establish API docs sync between v1.9.x and Readme by @brandenchan in #3266

New Contributors

@Namoush made their first contribution in #3086
@kalki7 made their first contribution in #3183
@nickchomey made their first contribution in #3197
@banjocustard made their first contribution in #3233

Full Changelog: v1.8.0...v1.9.0

Contributors

masci, vblagoje, and 14 other contributors

Assets 2

26 Aug 16:08

julian-risch

v1.8.0

4e518cd

v1.8.0

⭐ Highlights

This release comes with a bunch of new features, improvements and bug fixes. Let us know how you like it on our brand new Haystack Discord server! Here are the highlights of the release:

Pipeline Evaluation in Batch Mode #2942

The evaluation of pipelines often uses large datasets and with this new feature batches of queries can be processed at the same time on a GPU. Thereby, the time needed for an evaluation run is decreased and we are working on further speed improvements. To try it out, you only need to replace the call to pipeline.eval() with pipeline.eval_batch() when you evaluate your question answering pipeline:

...
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)
eval_result = pipeline.eval_batch(labels=eval_labels, params={"Retriever": {"top_k": 5}})

Early Stopping in Reader and Retriever Training #3071

When training a reader or retriever model, you need to specify the number of training epochs. If the model doesn't further improve after the first few epochs, the training usually still continues for the rest of the specified number of epochs. Early Stopping can now automatically monitor how much the model improves during training and stop the process when there is no significant improvement. Various metrics can be monitored, including loss, EM, f1, and top_n_accuracy for FARMReader or loss, acc, f1, and average_rank for DensePassageRetriever. For example, reader training can be stopped when loss doesn't further decrease by at least 0.001 compared to the previous epoch:

from haystack.nodes import FARMReader
from haystack.utils.early_stopping import EarlyStopping
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")
reader.train(data_dir="data/squad20", train_filename="dev-v2.0.json", early_stopping=EarlyStopping(min_delta=0.001), use_gpu=True, n_epochs=8, save_dir="my_model")

PineconeDocumentStore Without SQL Database #2749

Thanks to @jamescalam the PineconeDocumentStore does not depend on a local SQL database anymore. So when you initialize a PineconeDocumentStore from now on, all you need to provide is a Pinecone API key:

from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key="...")
docs = [Document(content="...")]
document_store.write_documents(docs)

FAISS in OpenSearchDocumentStore: #3101 #3029

OpenSearch supports different approximate k-NN libraries for indexing and search. In Haystack's OpenSearchDocumentStore you can now set the knn_engine parameter to choose between nmslib and faiss. When loading an existing index you can also specify a knn_engine and Haystack checks if the same engine was used to create the index. If not, it falls back to slow exact vector calculation.

Highlighted Bug Fixes

A bug was fixed that prevented users from loading private models in some components because the authentication token wasn't passed on correctly. A second bug was fixed in the schema files affecting parameters that are of type Optional[List[]], in which case the validation failed if the parameter was explicitly set to None.

fix: Use use_auth_token in all cases when loading from the HF Hub by @sjrl in #3094
bug: handle Optional params in schema validation by @anakin87 in #2980

Other Changes

DocumentStores

feat: Allow exact list matching with field in Elasticsearch filtering by @masci in #2988

Documentation

refactor: rename master into main in documentation and links by @ZanSara in #3063
docs:fixed typo (or old documentation) in ipynb tutorial 3 by @DavidGerva in #3033
docs: Add OpenAI Answer Generator API by @brandenchan in #3050

Crawler

fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter by @danielbichuetti in #3043
fix: Crawler quits ChromeDriver on destruction by @danielbichuetti in #3070

Other Changes

fix(translator): write translated text to output documents, while keeping input untouched by @danielbichuetti in #3077
test: Use random_sample instead of ndarray for random array in OpenSearchDocumentStore test by @bogdankostic in #3083
feat: add progressbar to upload_files() for deepset Cloud client by @tholor in #3069
refactor: update package metadata by @ofek in #3079

New Contributors

@DavidGerva made their first contribution in #3033
@ofek made their first contribution in #3079

❤️ Big thanks to all contributors and the whole community!

Full Changelog: v1.7.1...v1.8.0

Contributors

masci, tholor, and 9 other contributors

Assets 2

19 Aug 11:32

julian-risch

v1.7.1

eb0f0da

v1.7.1

Patch Release

Main Changes

feat: take the list of models to cache instead of hardcoding one by @masci in #3060

Other Changes

fix: pin version of pyworld to 0.2.12 by @sjrl in #3047
test: update filtering of Pinecone mock to imitate doc store by @jamescalam in #3020

Full Changelog: v1.7.0...v1.7.1

Contributors

masci, sjrl, and jamescalam

Assets 2

15 Aug 12:43

tstadel

v1.7.0

baefd32

v1.7.0

⭐ Highlights

This time we have a couple of smaller yet important feature highlights: lots of them coming from you, our amazing community!
🥂 Alongside that, as we notice more frequent and great contributions from our community, we are also announcing our brand new Haystack Discord server to help us interact better with the people that make Haystack what it is! 🥳

Here's what you'll find in Haystack 1.7:

Support for OpenAI GPT-3

If you always wanted to know how OpenAI's famous GPT-3 model compares to other models, now your time has come. It's been fully integrated into Haystack, so you can use it as any other model. Just sign up to OpenAI, copy your API key from here and run the following code.To compare it to other models, check out our evaluation guide.

from haystack.nodes import OpenAIAnswerGenerator
from haystack import Document

reader = OpenAIAnswerGenerator(api_key="<your-api-token>", max_tokens=15, temperature=0.3)

docs = [Document(content="""The Big Bang Theory is an American sitcom.
                            The four main characters are all avid fans of nerd culture. 
                            Among their shared interests are science fiction, fantasy, comic books and collecting memorabilia. 
                            Star Trek in particular is frequently referenced""")]
res = reader.predict(query="Do the main characters of big bang theory like Star Trek?", documents=docs)
print(res)

#2605
#3036

Zero-Shot Query Classification

Till now, TransformersQueryClassifier was very closely built around the excellent binary query-type classifier model of hahrukhx01. Although it was already possible to use other Transformer models, the choice was restricted to the models that output binary labels. One of our amazing community contributions now lifted this restriction.
But that's not all: @anakin87 added support for zero-shot classification models as well!
So now that you're completely free to choose the classification categories you want, you can let your creativity run wild. One thing you could do is customize the behavior of your pipeline based on the semantic category of the query, like this:

from haystack.nodes import TransformersQueryClassifier

# In zero-shot-classification, you are free to choose the labels
labels = ["music", "cinema", "food"]

query_classifier = TransformersQueryClassifier(
    model_name_or_path="typeform/distilbert-base-uncased-mnli",
    use_gpu=True,
    task="zero-shot-classification",
    labels=labels,
)

queries = [
    "In which films does John Travolta appear?",  # query about cinema
    "What is the Rolling Stones first album?",  # query about music
    "Who was Sergio Leone?",  # query about cinema
]

for query in queries:
    result = query_classifier.run(query=query)
    print(f'Query "{query}" was sent to {result[1]}')

#2965

Adding Page Numbers to Document Meta

Sometimes it's not enough to find the right answer or paragraph inside a document and just print it on the screen. Context matters and thus, for search applications, it's essential to send the user exactly to the place where the information came from. For huge documents, we're just halfway there if the user clicks a result and the document opens. To get to the right position, they still need to search the document using the document viewer. To make it easier, we added the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. If you set it to True, it adds a meta field "page" to documents containing the page number of the text snippet or a table within the original file.

from haystack.nodes import PDFToTextConverter, PreProcessor
from haystack.document_stores import InMemoryDocumentStore

converter = PDFToTextConverter()
preprocessor = PreProcessor(add_page_number=True)
document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_node(component=converter, name="Converter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Converter"])
pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

#2932

Gradient Accumulation for FARMReader

Training big Transformer models in low-resource environments is hard. Batch size plays a significant role when it comes to hyper-parameter tuning during the training process. The number of batches you can run on your machine is restricted by the amount of memory that fits into your GPUs. Gradient accumulation is a well-known technique to work around that restriction: adding up the gradients across iterations and running the backward pass only once after a certain number of iterations.
We tested it when we fine-tuned roberta-base on SQuAD, which led to nearly the same results as using a higher batch size. We also used it for training deepset/deberta-v3-large, which significantly outperformed its predecessors (see Question Answering on SQuAD).

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
reader.train(
    data_dir=data_dir, 
    train_filename="dev-v2.0.json", 
    use_gpu=True, n_epochs=1, 
    save_dir="my_model", 
    grad_acc_steps=8
)

#2925

Extended Ray Support

Another great contribution from our community comes from @zoltan-fedor: it's now possible to run more complex pipelines with dual-retriever setup on Ray. Also, we now support ray serve deployment arguments in Pipeline YAMLs so that you can fully control your ray deployments.

pipelines:
  - name: ray_query_pipeline
    nodes:
      - name: EmbeddingRetriever
        replicas: 2
        inputs: [ Query ]
        serve_deployment_kwargs:
          num_replicas: 2
          version: Twenty
          ray_actor_options:
            num_gpus: 0.25
            num_cpus: 0.5
          max_concurrent_queries: 17
      - name: Reader
        inputs: [ EmbeddingRetriever ]

#2981
#2918

Support for Custom Sentence Tokenizers in Preprocessor

On some specific domains (for example, legal with lots of custom abbreviations), the default sentence tokenizer can be improved by some extra training on the domain data. To support a custom model for sentence splitting, @danielbichuetti added the tokenizer_model_folder parameter to Preprocessor.

from haystack.nodes import PreProcessor

preprocessor = PreProcessor(
        split_length=10,
        split_overlap=0,
        split_by="sentence",
        split_respect_sentence_boundary=False,
        language="pt",
        tokenizer_model_folder="/home/user/custom_tokenizer_models",
    )

#2783

Making it Easier to Switch Document Stores

We had yet another amazing community contribution by @zoltan-fedor about the support for BM25 with the Weaviate document store.
Besides that we streamlined methods of BaseDocumentStore and added update_document_meta() to InMemoryDocumentStore. These are all steps to make it easier for you to run the same pipeline with different document stores (for example, for quick prototyping, use in-memory, then head to something more production-ready).
#2860
#2689

Almost 2x Performance Gain for Electra Reader Models

We did a major refactoring of our language_modeling module resolving a bug that caused Electra models to execute the forward pass twice.
#2703.

⚠️ Breaking Changes

Add update_document_meta to InMemoryDocumentStore by @bogdankostic in #2689
Add support for BM25 with the Weaviate document store by @zoltan-fedor in #2860
Extending the Ray Serve integration to allow attributes for Serve deployments by @zoltan-fedor in #2918
bug: make MultiLabel ids consistent across python interpreters by @camillepradel in #2998

⚠️ Breaking Changes for Contributors

Default Branch will be Renamed to `main` on Tuesday, 16th of August

We will rename the default branch from master to main after this release. For a nice recap about good reasons for doing this, have a look at the Software Freedom Conservancy's blog.
Whether coming from this repository or from a fork, local clones of the Haystack repository will need to be updated by running the following commands:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Pre-Commit Hooks Instead of CI Jobs

To give you full control over your changes, we switched from CI jobs that automatically reformat files, generate schemas, and so on, to pre-commit hooks. To install them, run:

pre-commit install

For more information, check our contributing guidelines.
#2819

Other Changes

Pipelin...

Contributors

masci, vblagoje, and 23 other contributors

Assets 2

06 Jul 09:00

julian-risch

v1.6.0

c80336c

v1.6.0

⭐ Highlights

Make Your QA Pipelines Talk with Audio Nodes! (#2584)

Indexing pipelines can use a new DocumentToSpeech node, which generates an audio file for each indexed document and stores it alongside the text content in a SpeechDocument. A GPU is recommended for this step to increase indexing speed. During querying, SpeechDocuments allow accessing the stored audio version of the documents the answers are extracted from. There is also a new AnswerToSpeech node that can be used in QA pipelines to generate the audio of an answer on the fly. See the new tutorial for a step by step guide on how to make your QA pipelines talk!

Save Models to Remote (#2618)

A new save_to_remote method was introduced to the FARMReader, so that you can easily upload a trained model to the Hugging Face Model Hub. More of this to come in the following releases!

from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="roberta-base")
reader.train(data_dir="my_squad_data", train_filename="squad2.json", n_epochs=1, save_dir="my_model")

reader.save_to_remote(repo_id="your-user-name/roberta-base-squad2", private=True, commit_message="First version of my qa model trained with Haystack")

Note that you need to be logged in with transformers-cli login. Otherwise there will be an error message with instructions how to log in. Further, if you make your model private by setting private=True, others won't be able to use it and you will need to pass an authentication token when you reload the model from the Model Hub, which is created also via transformers-cli login.

new_reader = FARMReader(model_name_or_path="your-user-name/roberta-base-squad2", use_auth_token=True)

Multi-Hop Dense Retrieval (#2571)

There is a new MultihopEmbeddingRetriever node that applies iterative retrieval steps and a shared encoder for the query and the documents. Used together with a reader node in a QA pipeline, it is suited for answering complex open-domain questions that require "hopping" multiple relevant documents. See the original paper by Xiong et al. for more details: "Answering complex open-domain questions with multi-hop dense retrieval".

from haystack.nodes import MultihopEmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
retriever = MultihopEmbeddingRetriever(
            document_store=document_store,
            embedding_model="deutschmann/mdr_roberta_q_encoder",
        )

Big thanks to our community member @deutschmn for the PR!

InMemoryKnowledgeGraph (#2678)

Besides querying texts and tables, Haystack also allows querying knowledge graphs with the help of pre-trained models that translate text queries to graph queries. The latest Haystack release adds an InMemoryKnowledgeGraph allowing to store knowledge graphs without setting up complex graph databases. Try out the tutorial as a notebook on colab!

from pathlib import Path
from haystack.nodes import Text2SparqlRetriever
from haystack.document_stores import InMemoryKnowledgeGraph
from haystack.utils import fetch_archive_from_http

# Fetch data represented as triples of subject, predicate, and object statements
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/triples_and_config.zip", output_dir="data/tutorial10")

# Fetch a pre-trained BART model that translates text queries to SPARQL queries
fetch_archive_from_http(url="https://fandom-qa.s3-eu-west-1.amazonaws.com/saved_models/hp_v3.4.zip", output_dir="../saved_models/tutorial10/")

# Initialize knowledge graph and import triples from a ttl file
kg = InMemoryKnowledgeGraph(index="tutorial10")
kg.create_index()
kg.import_from_ttl_file(index="tutorial10", path=Path("data/tutorial10/triples.ttl"))

# Initialize retriever from pre-trained model
kgqa_retriever = Text2SparqlRetriever(knowledge_graph=kg, model_name_or_path=Path("../saved_models/tutorial10/hp_v3.4"))

# Translate a text query to a SPARQL query and execute it on the knowledge graph
print(kgqa_retriever.retrieve(query="In which house is Harry Potter?"))

Big thanks to our community member @anakin87 for the PR!

Torch 1.12 and Transformers 4.20.1 Support

Haystack is now compatible with last week's PyTorch v1.12 release so that you can take advantage of Apple silicon GPUs (Apple M1) for accelerated training and evaluation. PyTorch shared an impressive analysis of speedups over CPU-only here.
Haystack is also compatible with the latest Transformers v4.20.1 release and we will continuously ensure that you can benefit from the latest features in Haystack!

Other Changes

Pipeline

Fix JoinAnswer/JoinNode by @MichelBartels in #2612
Reduce logging messages and simplify logging by @julian-risch in #2682
Correct docstring parameter name by @julian-risch in #2757
AnswerToSpeech by @ZanSara in #2584
Fix params being changed during pipeline.eval() by @tstadel in #2638
Make crawler extract also hidden text by @anakin87 in #2642
Update document scores based on ranker node by @mathislucka in #2048
Improved crawler support for dynamically loaded pages by @danielbichuetti in #2710
Replace deprecated Selenium methods by @ZanSara in #2724
Fix EvaluationSetCliet.get_labels() by @tstadel in #2690
Show warning in reader.eval() about differences compared to pipeline.eval() by @tstadel in #2477
Fix using id_hash_keys as pipeline params by @tstadel in #2717
Fix loading of tokenizers in DPR by @bogdankostic in #2755
Add support for Multi-Hop Dense Retrieval by @deutschmn in #2571
Create target folder if not exists in EvalResult.save() by @tstadel in #2647
Validate max_seq_length in SquadProcessor by @francescocastelli in #2740

Models

Use AutoTokenizer by default, to easily adapt to new models and token… by @apohllo in #1902
first version of save_to_remote for HF from FarmReader by @TuanaCelik in #2618

DocumentStores

Move Opensearch document store in its own module by @masci in #2603
Extract common code for ES and OS into a base class by @masci in #2664
Fix bugs in loading code from yaml by @masci in #2705
fix error in log message by @anakin87 in #2719
Pin es client to include bugfixes by @masci in #2735
Make check of document & embedding count optional in FAISS and Pinecone by @julian-risch in #2677
In memory knowledge graph by @anakin87 in #2678
Pinecone unary queries upgrade by @jamescalam in #2657
wait for postgres to be ready before data migrations by @masci in #2654

Documentation & Tutorials

Update docstrings for GPL by @agnieszka-m in #2633
Add GPL API docs, unit tests update by @vblagoje in #2634
Add GPL adaptation tutorial by @vblagoje in #2632
GPL tutorial - add GPU header and open in colab button by @vblagoje in #2736
Add execute_eval_run example to Tutorial 5 by @tstadel in #2459
Tutorial 14 edit by @robpasternak in #2663

Misc

Replace question issue with link to discussions by @masci in #2697
Upgrade transformers to 4.20.1 by @julian-risch in #2702
Upgrade torch to 1.12 by @julian-risch in #2741
Remove rapidfuzz version pin by @tstadel in #2730

New Contributors

@ryanrussell made their first contribution in #2617
@apohllo made their first contribution in #1902
@robpasternak made their first contribution in #2663
@danielbichuetti made their first contribution in #2710
@francescocastelli made their first contribution in #2740
@deutschmn made their first contribution in https://github.com/deepset-...

Contributors

masci, apohllo, and 16 other contributors

Assets 2

02 Jun 15:37

julian-risch

v1.5.0

4ca331c

v1.5.0

⭐ Highlights

Generative Pseudo Labeling

Dense retrievers excel when finetuned on a labeled dataset of the target domain. However, such datasets rarely exist and are costly to create from scratch with human annotators. Generative Pseudo Labeling solves this dilemma by creating labels automatically for you, which makes it a super fast and low-cost alternative to manual annotation. Technically speaking, it is an unsupervised approach for domain adaptation of dense retrieval models. Given a corpus of unlabeled documents from that domain, it automatically generates queries on that corpus and then uses a cross-encoder model to create pseudo labels for these queries. The pseudo labels can be used to adapt retriever models that domain. Here is a code example that shows how to do that in Haystack:

from haystack.nodes.retriever import EmbeddingRetriever
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.question_generator.question_generator import QuestionGenerator
from haystack.nodes.label_generator.pseudo_label_generator import PseudoLabelGenerator

# Initialize any document store and fill it with documents from your domain - no labels needed.
document_store = InMemoryDocumentStore()
document_store.write_documents(...) 

# Calculate and store a dense embedding for each document
retriever = EmbeddingRetriever(document_store=document_store, 
                               embedding_model="sentence-transformers/msmarco-distilbert-base-tas-b", 
                               max_seq_len=200)
document_store.update_embeddings(retriever)

# Use the new PseudoLabelGenerator to automatically generate labels and train the retriever on them
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1", max_length=64, split_length=200, batch_size=12)
psg = PseudoLabelGenerator(qg, retriever)
output, _ = psg.run(documents=document_store.get_all_documents()) 
retriever.train(output["gpl_labels"])

#2388

Batch Processing with Query Pipelines

Every query pipeline now has a run_batch() method, which allows to pass multiple queries to the pipeline at once.
Together with a list of queries, you can either provide a single list of documents or a list of lists of documents. In the first case, answers are returned for each query-document pair. In the second case, each query is applied to its corresponding list of documents based on same index in the list. A third option is to have a list containing a single query, which is then applied to each list of documents separately.
Here is an example with a pipeline:

from haystack.pipelines import ExtractiveQAPipeline
...
pipe = ExtractiveQAPipeline(reader, retriever)
predictions = pipe.pipeline.run_batch(
        queries=["Who is the father of Arya Stark?","Who is the mother of Arya Stark?"], params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
    )

And here is an example with a single reader node:

from haystack.nodes import FARMReader
from haystack.schema import Document

FARMReader.predict_batch(
    queries=["1st sample query", "2nd sample query"]
    documents=[[Document(content="sample doc1"), Document(content="sample doc2")], [Document(content="sample doc3"), Document(content="sample doc4")]]

{"queries": ["1st sample query", "2nd sample query"], "answers": [[Answers from doc1 and doc2], [Answers from doc3 and doc4]], ...]}

#2481 #2575

Pipeline Evaluation with Advanced Label Scopes

Typically, a predicted answer is considered correct if it matches the gold answer in the set of evaluation labels. Similarly, a retrieved document is considered correct if its ID matches the gold document ID in the labels. Sometimes however, these simple definitions of "correctness" are not sufficient and you want to further specify the "scope" within which an answer or a document is considered correct.
For this reason, EvaluationResult.calculate_metrics() accepts the parameters answer_scope and document_scope.

As an example, you might consider an answer to be correct only if it stems from a specific context of surrounding words. You can specify answer_scope="context" in calculate_metrics() in that case. See the updated docstrings with a description of the different label scopes or the updated tutorial on evaluation.

...
document_store.add_eval_data(
        filename="data/tutorial5/nq_dev_subset_v2.json",
        preprocessor=preprocessor,
    )
...
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)
eval_result = pipeline.eval(labels=eval_labels, params={"Retriever": {"top_k": 5}})
metrics = eval_result.calculate_metrics(answer_scope="context")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')

#2482

Support of DeBERTa Models

Haystack now supports DeBERTa models! These kind of models come with some smart architectural improvements over BERT and RoBERTa, such as encoding the relative and absolute position of a token in the input sequence. Only the following three lines are needed to train a DeBERTa reader model on the SQuAD 2.0 dataset. And compared to a RoBERTa model trained on that dataset, you can expect a boost in F1-score from ~84% to ~88% ("microsoft/deberta-v3-large" even gets you to an F1-score as high as ~92%).

from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="microsoft/deberta-v3-base")
reader.train(data_dir="data/squad20", train_filename="train-v2.0.json", dev_filename="dev-v2.0.json", save_dir="my_model")

#2097

⚠️ Breaking Changes

Validation for Ray pipelines by @ZanSara in #2545
Add run_batch method to all nodes and Pipeline to allow batch querying by @bogdankostic in #2481
Support context matching in pipeline.eval() by @tstadel in #2482

Other Changes

Pipeline

Add sort arg to JoinAnswers by @brandenchan in #2436
Update run() and run_batch() params descriptions in API by @agnieszka-m in #2542
[CI refactoring] Avoid ray==1.12.0 on Windows by @ZanSara in #2562
Prevent losing names of utilized components when loaded from config by @tstadel in #2525
Do not copy _component_config in get_components_definitions by @ZanSara in #2574
Add run_batch for standard pipelines by @bogdankostic in #2595
Fix Pipeline.get_config() for forked pipelines by @tstadel in #2616
Remove wrong retriever top_1 metrics from print_eval_report by @tstadel in #2510
Handle transformers pipeline flattening lists of length 1 by @MichelBartels in #2531
Fix pipeline.eval with context matching for Table-QA by @tstadel in #2597
set top_k to 5 in SAS to be consistent by @ClaMnc in #2550

DocumentStores

Make DeepsetCloudDocumentStore work with non-existing index by @bogdankostic in #2513
[Weaviate] Exit the while loop when we query less documents than available by @masci in #2537
Fix knn params for aws managed opensearch by @tstadel in #2581
Fix number of returned values in get_metadata_values_by_key by @bogdankostic in #2614

Retriever

Simplify loading of EmbeddingRetriever by @bogdankostic in #2619
Add training checkpoint in retriever trainer by @dimitrisna in #2543
Include meta data when computing embeddings in EmbeddingRetriever by @MichelBartels in #2559

Documentation

fix small typo in Document doc string by @galtay in #2520
rearrange contributing guidelines by @masci in #2515
Documenting output score of JoinDocuments when using concatenation by @MichelBartels in #2561
Minor lg updates to doc strings by @agnieszka-m in #2585
Adjust pydoc markdown config so methods shown with classes by @brandenchan in #2511
Update Ray pipeline docs with validation info by @agnieszka-m in #2590

Other Changes

Upgrade transformers version to 4.18.0 by @bogdankostic in #2514
Upgrade torch version to 1.11 by @bogdankostic in #2538
Fix tutorials 4, 7 and 8 by @bogdankostic in #2526
Tutorial1: convert_files_to_dicts --> convert_files_to_docs by @ZanSara in #2546
Fix docker image tag with semantic version for releases by @askainet in https://github.com/deepset-ai/haystack/pull/...

Contributors

masci, askainet, and 11 other contributors

Assets 2

05 May 10:48

julian-risch

v1.4.0

081b886

v1.4.0

⭐ Highlights

Logging Evaluation Results to MLflow

Logging and comparing the evaluation results of multiple different pipeline configurations is much easier now thanks to the newly implemented MLflowTrackingHead. With our public MLflow instance you can log evaluation metrics and metadata about pipeline, evaluation set and corpus. Here is an example log file. If you have your own MLflow instance you can even store the pipeline YAML file and the evaluation set as artifacts. In Haystack, all you need is the execute_eval_run() method:

eval_result = Pipeline.execute_eval_run(
    index_pipeline=index_pipeline,
    query_pipeline=query_pipeline,
    evaluation_set_labels=labels,
    corpus_file_paths=file_paths,
    corpus_file_metas=file_metas,
    experiment_tracking_tool="mlflow",
    experiment_tracking_uri="http://localhost:5000",
    experiment_name="my-query-pipeline-experiment",
    experiment_run_name="run_1",
    pipeline_meta={"name": "my-pipeline-1"},
    evaluation_set_meta={"name": "my-evalset"},
    corpus_meta={"name": "my-corpus"}.
    add_isolated_node_eval=True,
    reuse_index=False
)

#2337

Filtering Answers by Confidence in FARMReader

The FARMReader got a parameter confidence_threshold to filter out predictions below this threshold.
The threshold is disabled by default but can be set between 0 and 1 when initializing the FARMReader:

from haystack.nodes import FARMReader
model = "deepset/roberta-base-squad2"
reader = FARMReader(model, confidence_threshold=0.5)

#2376

Deprecating Milvus1DocumentStore & Renaming ElasticsearchRetriever

The Milvus1DocumentStore is deprecated in favor of the newer Milvus2DocumentStore. Besides big architectural changes that impact performance and reliability Milvus version 2.0 supports the filtering by scalar data types.
For Haystack users this means you can now run a query using vector similarity and filter for some meta data at the same time! See the Milvus documentation for more details if you need to migrate from Milvus1DocumentStore to Milvus2DocumentStore. #2495

The ElasticsearchRetriever node does not only work with the ElasticsearchDocumentStore but also with the OpenSearchDocumentStore and so it is only logical to rename the ElasticsearchRetriever. Now it is called
BM25Retriever after the underlying BM25 ranking function. For the same reason, ElasticsearchFilterOnlyRetriever is now called FilterRetriever. The deprecated names and the new names are both working but we will drop support of the deprecated names in a future release. An overview of the different DocumentStores in Haystack can be found here. #2423 #2461

Fixing Evaluation Discrepancies

The evaluation of pipeline nodes with pipeline.eval(add_isolated_node_eval=True) and alternatively with retriever.eval() and reader.eval() gave slightly different results due to a bug in handling no_answers. This bug is fixed now and all different ways to run the evaluation give the same results. #2381

⚠️ Breaking Changes

Change return types of indexing pipeline nodes by @bogdankostic in #2342
Upgrade weaviate-client to 3.3.3 and fix get_all_documents by @ZanSara in #1895
Align TransformersReader defaults with FARMReader by @julian-risch in #2490
Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 by @ZanSara in #2420
Validate YAML files without loading the nodes by @ZanSara in #2438

Other Changes

Pipeline

Add tests for missing __init__ and super().__init__() in custom nodes by @ZanSara in #2350
Forbid usage of *args and **kwargs in any node's __init__ by @ZanSara in #2362
Change YAML version exception into a warning by @ZanSara in #2385
Make sure that debug=True and params={'debug': True} behaves the same way by @ZanSara in #2442
Add support for positional args in pipeline.get_config() by @tstadel in #2478
enforce same index values before and after saving/loading eval dataframes by @tstadel in #2398

DocumentStores

Fix sparse retrieval with filters returns results without any text-match by @tstadel in #2359
EvaluationSetClient for deepset cloud to fetch evaluation sets and la… by @FHardow in #2345
Update launch script for Milvus from 1.x to 2.x by @ZanSara in #2378
Use ElasticsearchDocumentStore.get_all_documents in ElasticsearchFilterOnlyRetriever.retrieve by @adri1wald in #2151
Fix and use delete_index instead of delete_documents in tests by @tstadel in #2453
Update docs of DeepsetCloudDocumentStore by @tholor in #2460
Add support for aliases in elasticsearch document store by @ZeJ0hn in #2448
fix dot_product metric by @jamescalam in #2494
Deprecate Milvus1DocumentStore by @bogdankostic in #2495
Fix OpenSearchDocumentStore's __init__ by @ZanSara in #2498

Retriever

Rename dataset to evaluation_set when logging to mlflow by @tstadel in #2457
Linearize tables in EmbeddingRetriever by @MichelBartels in #2462
Print warning in EmbeddingRetriever if sentence-transformers model used with different model format by @mpangrazzi in #2377
Add flag to disable scaling scores to probabilities by @tstadel in #2454
changing the name of the retrievers from es_retriever to retriever by @TuanaCelik in #2487
Replace dpr with embeddingretriever tut14 by @mkkuemmel in #2336
Support conjunctive queries in sparse retrieval by @tstadel in #2361
Fix: Auth token not passed for EmbeddingRetriever by @mathislucka in #2404
Pass use_auth_token to sentence transformers EmbeddingRetriever by @MichelBartels in #2284

Reader

Fix TableReader for tables without rows by @bogdankostic in #2369
Match answer sorting in QuestionAnsweringHead with FARMReader by @tstadel in #2414
Fix reader.eval() and reader.eval_on_file() output by @tstadel in #2476
Raise error if torch-scatter is not installed or wrong version is installed by @MichelBartels in #2486

Documentation

Fix link to squad_to_dpr.py in DPR train tutorial by @raphaelmerx in #2334
Add evaluation and document conversion to tutorial 15 by @MichelBartels in #2325
Replace TableTextRetriever with EmbeddingRetriever in Tutorial 15 by @MichelBartels in #2479
Fix RouteDocuments documentation by @MichelBartels in #2380

Other Changes

extract extension based on file's content by @GiannisKitsos in #2330
Reduce num REST API workers to accommodate smaller machines by @brandenchan in #2400
Add devices alongside use_gpu in FARMReader by @ZanSara in #2294
Delete files in docs/_src by @brandenchan in #2322
Add apt update in Linux CI by @ZanSara in #2415
Exclude beir from Windows install by @ZanSara in #2419
Added macos version of xpdf in tutorial 8 by @seduerr91 in #2424
Make python-magic fully optional by @ZanSara in #2412
Upgrade xpdf to 4.0.4 by @tholor in #2443
Update xpdfreader package installation by @AI-Ahmed in #2491

New Contributors

@raphaelmerx made their first contribution in #2334
@FHardow made their first contribution in #2345
@GiannisKitsos made their first contribution in #2330
@mpangrazzi made their first contribution in #2377
@seduerr91 made their first contribution i...

Contributors

mpangrazzi, tholor, and 17 other contributors

Assets 2

23 Mar 16:46

julian-risch

v1.3.0

bf71f03

v1.3.0

⭐ Highlights

Pipeline YAML Syntax Validation

The syntax of pipeline configurations as defined in YAML files can now be validated. If the validation fails, erroneous components/parameters are identified to make it simple to fix them. Here is a code snippet to manually validate a file:

from pathlib import Path
from haystack.pipelines.config import validate_yaml
validate_yaml(Path("rest_api/pipeline/pipelines.haystack-pipeline.yml"))

Your IDE can also take care of the validation when you edit a pipeline YAML file. The suffix *.haystack-pipeline.yml tells your IDE that this YAML contains a Haystack pipeline configuration and enables some checks and autocompletion features if the IDE is configured that way (YAML plugin for VSCode, Configuration Guide for PyCharm). The schema used for validation can be found in SchemaStore pointing to the schema files for the different Haystack versions. Note that an update of the Haystack version might sometimes require to do small changes to the pipeline YAML files. You can set version: 'unstable' in the pipeline YAML to circumvent the validation or set it to the latest Haystack version if the components and parameters that you use are compatible with the latest version. #2226

Pinecone DocumentStore

We added another DocumentStore to Haystack: PineconeDocumentStore! 🎉 Pinecone is a fully managed service for very large scale dense retrieval. To this end, embeddings and metadata are stored in a hosted Pinecone vector database while the document content is stored in a local SQL database. This separation simplifies infrastructure setup and maintenance. In order to use this new document store, all you need is an API key, which you can obtain by creating an account on the Pinecone website. #2254

import os
from haystack.document_stores import PineconeDocumentStore
document_store = PineconeDocumentStore(api_key=os.environ["PINECONE_API_KEY"])

BEIR Integration

Fresh from the 🍻 cellar, Haystack now has an integration with our favorite BEnchmarking Information Retrieval tool BEIR. It contains preprocessed datasets for zero-shot evaluation of retrieval models in 17 different languages, which you can use to benchmark your pipelines. For example, a DocumentSearchPipeline can now be evaluated by calling Pipeline.eval_beir() after having installed Haystack with the BEIR dependency via pip install farm-haystack[beir]. Cheers! #2333

from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)

index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

query_pipeline = DocumentSearchPipeline(retriever=retriever)

ndcg, _map, recall, precision = Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)

Breaking Changes

Make Milvus2DocumentStore compatible with pymilvus>=2.0.0 by @MichelBartels in #2126
Set provider parameter when instantiating onnxruntime.InferenceSession and make device a torch.device in internal methods by @cjb06776 in #1976

Pipeline

Generate haystack-pipeline-1.2.0.schema.json by @ZanSara in #2239
Add RouteDocuments and JoinAnswers nodes by @bogdankostic in #2256
Refactor Pipeline peripherals by @tstadel in #2253
Allow to deploy and undeploy Pipelines on Deepset Cloud by @tstadel in #2285
Reintroduce debug as a valid global key for Pipeline's params by @ZanSara in #2298
Replace dpr with embeddingretriever tut11 by @mkkuemmel in #2287
Package JSON schemas properly in Haystack by @ZanSara in #2316
Fix dependency graph for indexing pipelines during codegen by @tstadel in #2311
Fix YAML pipeline paths in docker-compose.yml by @ZanSara in #2335
Improve error message for nodes failing validation by @ZanSara in #2313
Fix Pipeline.print_eval_report by @tstadel in #2271
save_to_deepset_cloud: automatically convert document stores by @tstadel in #2283
Sas gpu additions by @thimo72 in #2308

Models

Update LFQA with the latest LFQA seq2seq and retriever models by @vblagoje in #2210

DocumentStores

Bulk insert in sql document stores by @OmniscienceAcademy in #2264
'os' wrapper to function for brownfield support by @TuanaCelik in #2282
Using default OpenSearch parameters by @TuanaCelik in #2327
Fix docker launch scripts by @tstadel in #2341
Fix normalize_embedding using numba by @tstadel in #2347

Documentation

Update other.yml with new node names by @agnieszka-m in #2286
Bring back init defs to api in v1.2 and latest by @brandenchan in #2296
Remove unneeded files in docs directory by @brandenchan in #2237
change old text to content argument for translator examples by @ju-gu in #2240

Tutorials

Fix tutorial dataset paths by @julian-risch in #2340
Polish Evaluation Tutorial by @brandenchan in #2212
Comment out Milvus cell on Tutorial6 by @ZanSara in #2243
Change document attribute from text to content by @julian-risch in #2352
Replace dpr with embeddingretriever tut5 by @mkkuemmel in #2274
ipynb: inserted links to graph images by @mkkuemmel in #2309

Other Changes

Implement Context Matching by @tstadel in #2293
Fix surrounding context extraction in ParsrConverter by @bogdankostic in #2162
Fix table extraction in ParsrConverter by @bogdankostic in #2262
Api pages by @brandenchan in #2248
fix pip backtracking issue by @tstadel in #2281
Update reader/base.py to fix UnboundLocalError in #2273 by @thimo72 in #2275
Remove substrings basic implementation by @dmigo in #2152
adding quotes for zsh shell issue by @TuanaCelik in #2289
Prevent Preprocessor from changing existing documents by @tstadel in #2297
Fix install because of missing jsonschema dependency by @tstadel in #2315
Add basic telemetry features by @julian-risch in #2314
Let SquadData support data from Annotation Tool by @brandenchan in #2329

New Contributors

@thimo72 made their first contribution in #2275
@agnieszka-m made their first contribution in #2286
@TuanaCelik made their first contribution in #2289
@OmniscienceAcademy made their first contribution in #2264
@jamescalam made their first contribution in #2254
@cjb06776 made their first contribution in #1976

❤️ Big thanks to all contributors and the whole community!

Contributors

vblagoje, julian-risch, and 14 other contributors

Assets 2

23 Feb 16:02

julian-risch

v1.2.0

d21b6a5

v1.2.0

⭐ Highlights

Brownfield Support of Existing Elasticsearch Indices

You have an existing Elasticsearch index from other projects and now want to try out Haystack? The newly added method es_index_to_document_store provides brownfield support of existing Elasticsearch indices by converting each of the records in the provided index to Haystack Document objects and writing them to the specified DocumentStore.

document_store = es_index_to_document_store(
    document_store=InMemoryDocumentStore(), #or any other Haystack DocumentStore
    original_index_name="existing_index",
    original_content_field="content",
    original_name_field="name",
    included_metadata_fields=["date_field"],
    index="new_index",
)

It can even be used on a regular basis in order to add new records of the Elasticsearch index to the DocumentStore! #2229

Tapas Reader With Scores

The new model class TapasForScoredQA introduced in #1997 supports Tapas Reader models that return confidence scores. When you load a Tapas Reader model, Haystack automatically infers whether the model supports confidence scores and chooses the correct model class under the hood. The returned answers are sorted first by a general table score and then by answer span scores. To try it out, just use one of the new TableReader models:

reader = TableReader(model_name_or_path="deepset/tapas-large-nq-reader", max_seq_len=512) #or
reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader", max_seq_len=512)

Extended Meta Data Filtering

We extended the filter capabilities of all(*) document stores to support more complex filter expressions than previously. Besides simple selections on multiple fields you can now use more complex comparison expressions and connect these using boolean operators. For people having used mongodb the new syntax should look familiar. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name.

Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value.

If no logical operator is provided, "$and" is used as default operation.
If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

Therefore, we don't have any breaking changes and you can keep on using your existing filter expressions.

Example:

filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}

(*) FAISSDocumentStore and MilvusDocumentStore currently do not support filters during search.

Code Style and Linting

In addition to mypy we already had for static type checking, we now use pylint for linting and the Haystack code base does now comply with Black formatting standards. As a result, the code is formatted in a consistent way and easier to read. When you would like to contribute to Haystack you don't need to worry about that though - our CI will automatically format your code changes correctly. Our contributor guidelines give more details in case you would like to run the checks locally. #2115 #2130

Installation with fewer dependencies

Installing Haystack has become easier and faster thanks to optional dependencies. From now on, there is no need to install all dependencies if you don't need them. For example, pip3 install farm-haystack will install the latest release together with only a small subset of packages required for basic Pipelines with an ElasticsearchDocumentStore. As another example, if you are experimenting with FAISSDocumentStore in a colab notebook, you can install Haystack from the master branch together with FAISS dependency by running: !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]. The installation guide reflects these updates and the full list of subsets of dependencies can be found here. Keep in mind, though, that this system works best with pip versions above 22 #1994

⚠️ Known Issues

Installing haystack with all dependencies results in heavy pip backtracking that might never finish.
This is due to a dependency conflict that was introduced by a new release of one of our sub dependencies.
To circumvent this problem install haystack like this:

pip install farm-haystack[all] "azure-core<1.23"

This might also be needed for other non-default dependencies (e.g. farm-haystack[dev] "azure-core<1.23").
See #2280 for more information.

⚠️ Breaking Changes

Improve dependency management by @ZanSara in #1994
Make ui and rest proper packages by @ZanSara in #2098
Add aiorwlock to 'ray' extra & fix maximum version for some dependencies by @ZanSara in #2140

🤓 Detailed Changes

Pipeline

Add top_k_join parameter to JoinDocuments.run by @adri1wald in #2065
✨ Add JSON Schema autogeneration for Pipeline YAML files by @tiangolo in #2020
Make FileTypeClassifier more flexible by @ZanSara in #2101
Query response without answers by @ZanSara in #2161
Generate JSON schema index for Schemastore by @ZanSara in #2225
Fix Pipeline.components by @tstadel in #2215
Join node should allow reciprocal rank fusion as additional merging method by @mathislucka in #2133
Apply filter in eval only if no gold docs are given as input by @julian-risch in #2154
pipeline.save_to_deepset_cloud() by @tstadel in #2145
Fix typo in save_to_deepset_cloud() by @tstadel in #2189
Generate code from pipeline (pipeline.to_code()) by @tstadel in #2214
Allow different filters per query in pipeline evaluation by @julian-risch in #2068
List all pipeline(_configs) on Deepset Cloud by @tstadel in #2102
Evaluating a pipeline consisting only of a reader node by @julian-risch in #2132
DC SDK - load pipeline from deepset cloud by @ArzelaAscoIi in #2013
YAML versioning by @ZanSara in #2209

Models

Add Tapas reader with scores by @bogdankostic in #1997
Fix finetuning notebook augmentation by @MichelBartels in #2071
Fix Seq2SeqGenerator return type by @tstadel in #2099
Distribute intermediate layer distillation loss calculation over multiple GPUs by @MichelBartels in #2090
Do not apply DataParallel twice by @MichelBartels in #2095

DocumentStores

Pin Milvus to <2.0.0 by @ZanSara in #2063
fix: get_documents_by_id should return docs for all passed ids by @mathislucka in #2064
Supported Highlighting in Elasticsearch by @SjSnowball in #1930
pass faiss batch_size to sqldocumentstore by @AhmedIdr in #2061
Fixed the Search Field mapping in ElasticSearch DocumentStore by @SjSnowball in #2080
Provide option to recreate es doc store on initialization by @mathislucka in #2084
Fixed performance bug. Using a list where a set is needed. by @baregawi in #2125
Extend metadata filtering support in ElasticsearchDocumentStore by @bogdankostic in #2108
OpenSearchDocumentStore: Extend similarity support by @tstadel in #2070
Speed up query_by_embedding in InMemoryDocumentStore. by @baregawi in https://github.com/deepset-ai/haystack/p...

Contributors

tiangolo, tholor, and 15 other contributors

Assets 2

Releases: deepset-ai/haystack

v1.9.1rc1

What's Changed

Contributors

v1.9.0

⭐ Highlights

Logging speed set to ludicrous (#3212)

Tutorials moved out! (#3244)

Docker pull deepset/haystack (#3162)

⚠️ Deprecation notice

New Documentation Site and Haystack Website Revamp:

What's Changed

Pipeline

Models

DocumentStores

Tutorials

Other Changes

New Contributors

Contributors

v1.8.0

⭐ Highlights

Pipeline Evaluation in Batch Mode #2942

Early Stopping in Reader and Retriever Training #3071

PineconeDocumentStore Without SQL Database #2749

FAISS in OpenSearchDocumentStore: #3101 #3029

Highlighted Bug Fixes

Other Changes

DocumentStores

Documentation

Crawler

Other Changes

New Contributors

Contributors

v1.7.1

Patch Release

Main Changes

Other Changes

Contributors

v1.7.0

⭐ Highlights

Support for OpenAI GPT-3

Zero-Shot Query Classification

Adding Page Numbers to Document Meta

Gradient Accumulation for FARMReader

Extended Ray Support

Support for Custom Sentence Tokenizers in Preprocessor

Making it Easier to Switch Document Stores

Almost 2x Performance Gain for Electra Reader Models

⚠️ Breaking Changes

⚠️ Breaking Changes for Contributors

Default Branch will be Renamed to main on Tuesday, 16th of August

Pre-Commit Hooks Instead of CI Jobs

Other Changes

Pipelin...

Contributors

v1.6.0

⭐ Highlights

Make Your QA Pipelines Talk with Audio Nodes! (#2584)

Save Models to Remote (#2618)

Multi-Hop Dense Retrieval (#2571)

InMemoryKnowledgeGraph (#2678)

Torch 1.12 and Transformers 4.20.1 Support

Other Changes

Pipeline

Models

DocumentStores

Documentation & Tutorials

Misc

New Contributors

Contributors

v1.5.0

⭐ Highlights

Generative Pseudo Labeling

Batch Processing with Query Pipelines

Pipeline Evaluation with Advanced Label Scopes

Support of DeBERTa Models

⚠️ Breaking Changes

Other Changes

Pipeline

DocumentStores

Default Branch will be Renamed to `main` on Tuesday, 16th of August