13 Apr 15:04

oryx1729

bba1d80

v0.8.0

⭐ Highlights

This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:

Milvus Document Store

Milvus is an open-source vector database. With the MilvusDocumentStore contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.

Knowledge Graph

An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever to convert natural language queries to SPARQL.

Pipeline configuration with YAML

The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.

REST APIs

The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.

Confidence Scores

The answers now have a probability score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.

Web Crawler

A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.

⚠️ Breaking Changes

REST APIs

The REST APIs got a major revamp with this release.

/doc-qa & /faq-qa endpoints are replaced with a more generic POST /query endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured at rest_api/pipeline.yaml.

The new /query endpoint expects a single query per request instead of a list of query strings.
The new request format is:

{
    "query": "Why did the revenue change?"
}

and the response looks like this:

{
    "query": "Why did the revenue change?",
    "answers": [
        {
            "answer": "rapid technological change and evolving industry standards",
            "question": null,
            "score": 0.543937623500824,
            "probability": 0.014070278964936733,
            "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and     evolving industry standards.",
            "offset_start": 91,
            "offset_end": 149,
            "offset_start_in_doc": 511,
            "offset_end_in_doc": 569,
            "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57",
            "meta": {
                "_split_id": "7"
            }
        },
        {
             // other answers
        }
    ]
}

The /doc-qa-feedback & /faq-qa-feedback endpoints are replaced with a new generic /feedback endpoint.

Created At Timestamp

Previously, all documents/labels in SQLDocumentStore and FAISSDocumentStore had a field called created to store the creation timestamp, while ElasticsearchDocumentStore did not have any timestamp field. Now, all document stores have a created_at field for documents and labels.

RAGenerator

The top_k_answers parameter in the RAGenerator is renamed to top_k for consistency across Haystack components.

Custom Query for Elasticsearch

The placeholder terms in custom_query should not have quotes around them. See more details here.

🤓 Detailed Changes

Pipeline

Fix execution of Pipelines with parallel nodes #901 (@oryx1729)
Add abstract run method to basecomponent #887 (@tholor)
Add support for parallel paths in Pipeline #884 (@oryx1729)
Add runtime parameters to component initialization #873 (@oryx1729 )
Add support for indexing pipelines #816 (@oryx1729 )
Adding translator with many generic input parameter support #782 (@lalitpagaria)
Fix building Pipeline with YAML #800 (@oryx1729)
Load Pipeline with YAML config file #785 (@oryx1729)
Add evaluation nodes for Pipelines #904 (@brandenchan)
Fix passing a list as parameter value in Pipeline YAML #952 (@oryx1729)

Document Store

Fixes elasticsearch auth #871 (@grafke)
Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
Milvus integration #771 (@lalitpagaria)
Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)

Retriever

Improve dpr conversion #826 (@Timoeller)
Fix DPR training batch size #898 (@brandenchan)
Upgrade FAISS to 1.7.0 #834 (@tholor)
Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811(@psorianom)

Modeling

Add model versioning support #784 (@brandenchan)
Improve preprocessing and adding of eval data #780 (@Timoeller)
SQuAD to DPR dataset converter #765 (@psorianom)
Remove RAG todos after transformers update #781 (@Timoeller)
Update farm version #936 (@Timoeller)

REST API

Refactor REST APIs to use Pipelines #922 (@oryx1729)
Add PDF converter in Dockerfiles #877 (@oryx1729)
Update GPU Dockerimage (Cuda 11, Fix faiss) #836 (@tholor)
Add API endpoint to export accuracy metrics from user feedback + created_at timestamp #803(@tholor)
Fix file upload API #808 (@oryx1729)

File Converter

Add Markdown file convertor #875 (@lalitpagaria)
Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)

Crawler

Add crawler to get texts from websites #775 (@DIVYA-19)

Knowledge Graph

knowledge graph example #934 (@julian-risch)

Annotation Tool

Annotation Tool: data is not persisted when using local version #853 #855(@venuraja79)

Search UI

Fix UI when API returns fewer answers than expected #828(@tholor)

CI

Revamp CI #825 (@oryx1729)
Fix mypy typing #792 (@oryx1729)
Fix pdftotext dependency in CI #788 (@tholor)

Misc Fixes

Adding indentation to markup files #947 (@julian-risch)
Reduce precision in pipeline eval print functions #943 (@lewtun)
Fix division by zero error in EvalRetriever #938 (@lewtun)
Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
Fix error when is_impossible not exist [#870](https://github.com/deepset-ai/haystack/pu...

Assets 2

21 Jan 17:42

tholor

v0.7.0

5081542

v0.7.0

⭐ Highlights

New Slack Channel

As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!

👉 https://haystack.deepset.ai/community/join

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Interacting with large datasets can be challenging for the local memory. Therefore, we ...

... add batch_size parameters for most methods of the document store that allow to only load smaller chunks of documents at a time
... add a get_all_documents_generator() method that "streams" documents one by one from your document store.
Both help to lower the memory footprint significantly- especially when calling methods like update_embeddings() on datasets > 1 Mio docs.

Add Simple Demo UI (#671)

Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...

Support for summarization models (#698)

Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...

... standalone:

docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
                    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
                    the shutoffs which were expected to last through at least midday tomorrow.")]

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)

... as a node in your pipeline:

...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])

... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:

...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()

We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.

New Tutorials

Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.

⚠️ Breaking Changes

Dropping `index_buffer_size` from FAISSDocumentStore

We removed the arg index_buffer_size from the init of FAISSDocumentStore. "Buffering" is now handled via the new batch_size arguments that you can pass to most methods like write_documents(), update_embeddings() and get_all_documents().

Renaming of Preprocessor arg

Old:

PreProcessor(..., split_stride=5)

New:

PreProcessor(..., split_overlap=5)

🤓 Detailed Changes

Preprocessing / File Conversion

Using PreProcessor functions on eval data #751

DocumentStore

Support filters for DensePassageRetriever + InMemoryDocumentStore #754
use Path class in add_eval_data of haystack.document_store.base.py #745
Make batchwise adding of evaluation data possible #717
Change signature and docstring for ca_certs parameter #730
Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
Fix SQLite errors in tests #723
Add support for custom embedding field for InMemoryDocumentStore #640
Using Columns names instead of ORM to get all documents #620

Other

Generate docstrings and deploy to branches to Staging (Website) #731
Script for releasing docs #736
Increase FARM to Version 0.6.2 #755
Reduce memory consumption of fetch_archive_from_http #737
Add links to more resources #746
Fix Tutorial 9 #734
Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
Add ID to Label schema #727
Automate docstring and tutorial generation with every push to master #718
Pass custom label index name to REST API #724
Correcting pypi download badge #722
Fix GPU docker build #703
Remove sourcerer.io widget #702
Haystack logo is not visible on github mobile app #697
Update pipeline documentation and readme #693
Enable GPU args in tutorials #692
Add docs v0.6.0 #689

Big thanks to all contributors ❤️ !

@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch

Assets 2

17 Dec 06:53

tholor

v0.6.0

5b81738

v0.6.0

⭐ Highlights

Flexible Pipelines powered by DAGs (#596)

In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

You can draw the DAG to better inspect what you are building:

p.draw(path="custom_pipe.png")

Multiple retrievers

You can now also use multiple Retrievers and join their results:

p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Custom nodes

You can easily build your own custom nodes. Just respect the following requirements:

Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
Do whatever you want within run() (e.g. reformatting the query)
Return a tuple that contains your output data (for the next node) and the name of the outgoing edge output_dict, "output_1
Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

Decision nodes

Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:

    class QueryClassifier():
        outgoing_edges = 2

        def run(self, **kwargs):
            if "?" in kwargs["query"]:
                return (kwargs, "output_1")

            else:
                return (kwargs, "output_2")

    pipe = Pipeline()
    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
                  inputs=["ESRetriever", "DPRRetriever"])
    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
    res = p.run(query="What did Einstein work on?", top_k_retriever=1)

Default Pipelines (replacing the "Finder")

Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder class which is now deprecated.

from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments

# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)

# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)

# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)

We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...

New DocumentStore for the Open Distro of Elasticsearch (#676)

From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore:

document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)

⚠️ Breaking Changes

As Haystack is extending from QA to further search types, we decided to rename all parameters from question to query.
This includes for example the predict() methods of the Readers but also several other places. See #614 for details.

🤓 Detailed Changes

Preprocessing / File Conversion

Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
Add needed whitespace before sentence start #582

DocumentStore

Scale dot product into probabilities #667
Add refresh_type param for Elasticsearch update_embeddings() #630
Add return_embedding parameter for get_all_documents() #615
Adding support for update_existing_documents to sql and faiss document stores #584
Add filters for delete_all_documents() #591

Retriever

Fix saving tokenizers in DPR training + unify save and load dirs #682
fix a typo, num_negatives -> num_positives #681
Refactor DensePassageRetriever._get_predictions #642
Move DPR embeddings from GPU to CPU straight away #618
Add MAP retriever metric for open-domain case #572

Reader / Generator

add GPU support for rag #669
Enable dynamic parameter updates for the FARMReader #650
Add option in API Config to configure if reader can return "No Answer" #609
Fix various generator issues #590

Pipeline

Add support for building custom Search Pipelines #596
Add set_node() for Pipeline #659
Add support for aggregating scores in JoinDocuments node #683
Add pipelines for GenerativeQA & FAQs #645

Other

Cleanup Pytest Fixtures #639
Add latest benchmark run #652
Fix image links in tutorial #663
Update query arg in Tutorial 7 #656
Fix benchmarks #648
Add link to FAISS Info in documentation #643
Improve User Feedback Documentation #539
Add formatting checks for shell scripts #627
Update md files for API docs #631
Clean API docs and increase coverage #621
Add boxes for recommendations #629
Automate benchmarks via CML #518
Add contributor hall of fame #628
README: Fix link to roadmap #626
Fix docstring examples #604
Cleaning the api docs #616
Fix link to DocumentStore page #613
Make more changes to documentation #578
Remove column in benchmark website #608
Make benchmarks clearer #606
Fixing defaults configs for rest_apis #583
Allow list of filter values in REST API #568
Fix CI bug due to new Elasticsearch release and new model release #579
Update Colab Torch Version [#576](https://github.com/deepset...

Assets 2

06 Nov 10:28

tholor

v0.5.0

99e924a

v0.5.0

Highlights

💬 Generative Question Answering via RAG (#484)

Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG").
Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)

Example:

    question = "who got the first nobel prize in physics?"

    # Retrieve related documents from retriever
    retrieved_docs = retriever.retrieve(query=question)

    # Now generate answer from question and retrieved documents
    predicted_result = generator.predict(
        question=question,
        documents=retrieved_docs,
        top_k=1
    )

You already play around with it in this minimal tutorial:

We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder class.

↗️ Better DPR (incl. training) (#527)

We migrated the existing DensePassageRetriever to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.

Example:

dense_passage_retriever.train(self,
                              data_dir: str,
                              train_filename: str,
                              dev_filename: str = None,
                              test_filename: str = None,
                              batch_size: int = 16,
                              embed_title: bool = True,
                              num_hard_negatives: int = 1,
                              n_epochs: int = 3)

Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.

📊 New Benchmarks

Happy to introduce a new benchmark section on our website!
Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?

See the relevant metrics here to guide your decision:
👉 https://haystack.deepset.ai/bm/benchmarks

We will extend this section over time with more models, metrics and key parameters.

⚠️ Breaking Changes

Consistent parameter naming for TransformersReader #510

# old: 
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 

# new
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..) 
TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)

FAISS: Remove phi normalization, support more index types #467

New default index type is "Flat" and params have changed slightly:

# old 
 FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_size: int = 768,
        faiss_index: Optional[IndexHNSWFlat] = None,

# new
FAISSDocumentStore(
        sql_url: str = "sqlite:///",
        index_buffer_size: int = 10_000,
        vector_dim: int = 768,
        faiss_index_factory_str: str = "Flat",
        faiss_index: Optional[faiss.swigfaiss.Index] = None,
        return_embedding: Optional[bool] = True,
        **kwargs,

DPR signature

Splitting max_seq_len into two independent params.
Removing remove_sep_tok_from_untitled_passages param.

# old
DensePassageRetriever(
                 document_store: BaseDocumentStore,
                 query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 remove_sep_tok_from_untitled_passages: bool = True
                 )

# new 
 DensePassageRetriever(
 		 document_store: BaseDocumentStore,
                 query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
                 passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
                 max_seq_len_query: int = 64,
                 max_seq_len_passage: int = 256,
                 use_gpu: bool = True,
                 batch_size: int = 16,
                 embed_title: bool = True,
                 use_fast_tokenizers: bool = True,
                 similarity_function: str = "dot_product"
                 ):

Detailed Changes

Preprocessing / File Conversion

Add preprocessing pipeline #473
Restructure checks in PreProcessor #504
Updated the example code to Indexing PDF / Docx files #502
Fix meta data = None in PreProcessor #496
add explicit encoding mode to file_converter/txt.py #478
Skip file conversion if file type is not supported #456

DocumentStore

Add support for MySQL database #556
Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
Add support to return embedding #514
Fix scoring in Elasticsearch for dot product #517
Allow filters for get_document_count() #512
Make creation of label index optional #490
Fix update_embeddings function in FAISSDocumentStore #481
FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
Enable bulk operations on vector IDs for FAISSDocumentStore #460
fixing ElasticsearchDocumentStore initialisation #415
bug: filters on a query_by_embedding #464

Retriever

DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
Fix retriever evaluation metrics #547
Add save and load method for DPR #550
Typo in dense.py comment #545
Make returning predictions in Finder & Retriever eval() possible #524
Make title info optional when evaluating on QA data #494
Make sentence-transformers usage more user-friendly #439

Reader

Fix FARMReader.eval() handling of no_answers #531
Added automatic mixed precision (AMP) support for reader training from Haystack side #463
Update ONNX conversion for FARMReader #438

Other

Fix sentencepiece dependencies in Dockerfiles #553
Update Dockerfile #537
Removing (deprecated) warnings from the Haystack codebase. #530
Pytest fix memory leak and put pytest marker on slow tests #520
[enhancement] Create deploy_website.yml #450
Add Docker Images & Setup for the Annotation Tool #444

REST API

Make filter value optional in REST API #497
Add Elasticsearch Query DSL compliant Query API #471
Allow configuration of log level in REST API #541
Add create_index and similarity metric to api config #493
Add deepcopy for meta dicts in answers #485
Fix windows platform installation #480
Update GPU docker & fix race condition with multiple workers #436

Documentation / Benchmarks / Tutorials

New readme #534
Add ...

Assets 2

21 Sep 09:01

tholor

v0.4.0

c5f1f9a

v0.4.0

Highlights

💥 New Project Website & Documentation

As the project is growing, we have more and more content that doesn't fit in GitHub.
In this first version of the website, we focused on documentation incl. quick start, usage guides and the API reference.
In the future, we plan to extend this with benchmarks, FAQs, use cases, and other content that helps you to build your QA system.

👉 https://haystack.deepset.ai

📈 Scalable dense retrieval: FAISSDocumentStore

With recent performance gains of dense retrieval methods (learn more about it here), we need document stores that efficiently store vectors and find the most similar ones at query time. While Elasticsearch can also handle vectors, it quickly reaches its limits when dealing with larger datasets. We evaluated a couple of projects (FAISS, Scann, Milvus, Jina ...) that specialize on approximate nearest neighbour (ANN) algorithms for vector similarity. We decided to implement FAISS as it's easy to run in most environments.
We will likely add one of the heavier solutions (e.g. Jina or Milvus) later this year.

The FAISSDocumentStore uses FAISS to handle embeddings and SQL to store the actual texts and meta data.

Usage:

document_store = FAISSDocumentStore(sql_url: str = "sqlite:///",        # SQL DB for text + meta data
                                    vector_size: int = 768)             # Dimensionality of your embeddings

📃 More input file formats: Apache Tika File Converter (#314 )

Thanks to @dany-nonstop you can now extract text from many file formats (docx, pptx, html, epub, odf ...) via Apache Tika.

Usage:

Start Apache Tika Server

docker run -d -p 9998:9998 apache/tika

Do Conversion in Haystack

tika_converter = TikaConverter(
        tika_url = "http://localhost:9998/tika",
        remove_numeric_tables = False,
        remove_whitespace = False,
        remove_empty_lines = False,
        remove_header_footer = False,
        valid_languages = None,
    )
>>> dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}

Breaking changes

Restructuring / Renaming of modules (Breaking changes!) (#379)

We've restructured the package to make the usage more intuitive and terminology more consistent.

Rename database module -> document_store
Split indexing module into -> file_converter and preprocessor
Move Document, Label and Multilabel classes into -> schema and simplify import to from haystack import Document, Label, Multilabel

File converters (#393)

Refactoring major parts of the file converters. Not returning pages anymore, but rather adding page break symbols that can be accessed further down the pipeline.

Old:

>>> pages, meta = `Fileconverter.extract_pages(file_path=Path("..."))`

New:

>>> dict = `Fileconverter.convert(file_path="...", meta={"name": "some_name", "category": "news"})`
>>> dict
{ 
  "text": "everything on page one \f then page two \f ..."
  "meta": {"name": "..."}
}

DensePassageRetriever (#308)

Refactored from FB code to transformers code base and loading the models from their model hub now.
Signature has therefore changed to:

retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True,
                                  embed_title=True,
                                  remove_sep_tok_from_untitled_passages=True)

Deprecate Tags for Document Stores (#286)

We removed the "tags" field that in the past could be associated with Documents and used for filtering your search.
Insead, we use now the more general concept of "meta", where you can supply any custom fields and filter for them at runtime

Old:

dict = {"text": "some", "tags": ["category1", "category2"]}

New

dict =   {"text": "some", "meta": {"category": ["1", "2"] }}

Details

Document Stores

Add FAISS Document Store #253
Fix type casting for vectors in FAISS #399
Fix duplicate vector ids in FAISS #395
Fix document filtering in SQLDocumentStore #396
Move retriever probability calculations to document_store #389
Add FAISS query scores #368
Raise Exception if filters used for FAISSDocumentStore query #338
Add refresh_type arg to ElasticsearchDocumentStore #326
Improve speed for SQLDocumentStore #330
Fix indexing of metadata for FAISS/SQL Document Store #310
Ensure exact match when filtering by meta in Elasticsearch #311
Deprecate Tags for Document Stores #286
Add option to update existing documents when indexing #285
Cast document_ids as strings #284
Add method to update meta fields for documents in Elasticsearch #242
Custom mapping write doc fix #297

Retriever

DPR (Dense Retriever) for InMemoryDocumentStore #316 #332
Refactor DPR from FB to Transformers codebase #308
Restructure update embeddings #304
Added title during DPR passage embedding && ElasticsearchDocumentStore #298
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback #243
Fix type of query_emb in DPR.retrieve() #247
Fix return type of EmbeddingRetriever to numpy array #245

Reader

More robust Reader eval by limiting max answers and creating no answer labels #331
Aggregate multiple no answers in MultiLabel #324
Add "no answer" aggregation to Transformersreader #259
Align TransformersReader with FARMReader #319
Datasilo use all cores for preprocessing #303
Batch prediction in evaluation #137
Aggregate label objects for same questions #292
Add num_processes to reader.train() to configure multiprocessing #271
Added support for unanswerable questions in TransformersReader #258

Preprocessing

Refactor file converter interface #393
Add Tika Converter #314

Finder

Add index arg to Finder.get_answers() and _via_similar_questions() #362

Documentation

Create documentation website #272
Use port 8000 in documentation #357
Documentation #343
Convert Documentation to markdown #386
Add logo to readme #384
Refactor the DPR tutorial to use FAISS #317
Make Tutorials Work on Colab GPUs #322

Other

Exclude embedding fields from the REST API #390
Fix test suite dependency issue on MacOS #374
Add Gunicorn timeout #364
Bump FARM version to 0.4.7 #340
Add Tests for MultiLabel #318
Modified search endpoints logs to dump json #290
Add export answers to CSV function #266

Big thanks to all contributors ♥️

@antoniolanza1996, @dany-nonstop, @philipp-bode, @lalitpagaria , @PiffPaffM , @brandenchan , @tanaysoni , @Timoeller , @tholor, @bogdankostic , @maxupp, @kolk , @venuraja79 , @karimjp

Assets 2

16 Jul 12:30

tholor

0.3.0

a6ec430

0.3.0

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

Added Support for Docx Files #225
Add PDF parser for indexing #109
Adjust PDF conversion subprocess for Python v3.6 #194
Fix boundary condition in detection of header/footer in file converters #165

Retriever

Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
Add dummy retriever for benchmarking / reader-only settings #235
Fix id for documents returned by the TfidfRetriever #232
Tutorial for Dense Passage Retriever #186
Fix device arg for sentence transformers #124
Fix embeddings from sentence-transformers (type cast & gpu flags) #121
Adding metadata to be returned from tfidf retreiver #122

Reader

Add ONNXRuntime support #157
Fix multi gpu training via Dataparallel #234
Fix document id missing in farm inference output #174
Add document meta for Transformer Reader #114
Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
Adjust to farm handling of no answer #170

DocumentStores

Move document_name attribute to meta #217
Remove meta field when writing documents in Elasticsearch #240
Harmonize meta data handling across doc stores #214
Add filtering by tags for InMemoryDocumentStore #108
Make FAQ question field customizable #146
Increase timeout for Elasticsearch bulk indexing #119
Add embedding query for InMemoryDocumentStore #112
Increase timeout for bulk indexing in ES #130
Add custom port to ElasticsearchDocumentStore #129
Remove hard-coded embedding field #107

REST API

Move out REST API from PyPI package #160
Fix format of /export-doc-qa-feedback to comply with SQuAD #241
Create file upload directory in the REST API #166
Add API endpoint to upload files #154
Missing PORT and SCHEME for elasticsearch to run the API #134
Add EMBEDDING_MODEL_FORMAT in API config #152
Add success response for successful file upload API #195
Add response time in logs #201
Fix rest api in Docker image after refactoring #178

Other

Upgrade to new FARM / Transformers / PyTorch versions #212
Fix Evaluation Dataset #233
Remove mutation of documents in write_documents() #231
Remove mutation of results dict in print_answers() #230
Make doc name optional #100
Fix Dockerfile to build successfully without models directory #210
Docker available for TransformsReader Class #180
Fix embedding method in FAQ-QA Tutorial #220
Add more tests #213
Update docstring for embedding_field and embedding_dim #208
Make "meta" field generic for Document Schema #102
Update tutorials #200
Upgrade FARM version #172
Fix for installing PyTorch on Windows OS #159
Remove Literal type hint #156
Remove PyMuPDF dependency #148
Add missing type hints #138
Add a GitHub Action to start Elasticsearch instance for Build workflow #142
Correct field in evaluation tutorial #139
Update Haystack version in tutorials #136
Fix evaluation #132
Add stalebot #131
Add Reader/Retriever validations in Finder #113
Add document metadata for FAQ style QA #106
Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan

Assets 2

05 May 10:57

tanaysoni

0.2.1

b4842f2

0.2.1

In our release notes, we will always highlight a few important changes first and list the detailed PRs below.

🎉 First release

Happy to announce our first proper release incl. many of the features that we found absolutely crucial for a QA system. While we still have countless exciting features on our roadmap, we are confident that this version already accelerates your development phase of QA systems significantly and it was also tested successfully in the first production deployments.
From now on, we will switch to a more regular release cycle.

📜 ElasticsearchDocumentStore

We recommend the new ElasticsearchDocumentStore for all production deployments. While we will keep more light-weight options (SQL, In-Memory) for easy prototyping, new features will be implemented first for Elasticsearch.

🚀 New Retrievers

Beside plain TF-IDF (in memory), we introduced the ElasticsearchRetriever that supports Elasticsearch's native scoring (BM25) or custom queries (e.g. using boosting).

As a further option, we also added the EmbeddingRetriever that encodes texts into embeddings (e.g. via Sentence-BERT) and retrieves via cosine-similarity. Especially the latter is very promising and you will likely see more features in this direction.

⁉️ FAQ-style QA

Beside extractive QA, you can now also index existing question-answer pairs (e.g. from FAQs) and find answers via matching the incoming user-question with the indexed questions and returning the related answer from that pair. This can be an interesting alternative or addition to extractive QA, if you already have huge collections of FAQs and/or need a solution that works with low computational resources.

🔁 Modular API based on FastAPI

We changed the basic REST API from Flask to FastAPI and modularized it.
You can now:

search answers in texts (extractive QA)
search answers by comparing user question to existing questions (FAQ-style QA)
collect & export user feedback on answers to gain domain-specific training data (feedback)
do basic monitoring of requests (currently via APM in Kibana)

Detailed changes:

Document Stores

Add Elasticsearch Datastore #13
Refactor database layer #10
Add test for Elasticsearch document store #88
Make filters optional for Elasticsearch query #80
Inmemory store #76
Fix get_all_documents() in ElasticsearchDocumentStore #77
Fix get_all_documents query for Elasticsearch #21

Retrievers

Add FAQ-style QA #44
added option for custom elasticsearch queries and filters #52
More flexbile es config & support for filters #29
Add more ES connection params #35
Simplify Retriever query #73
Refactor ElasticsearchRetriever into separate class #72
Add params to create_embeddings in retriever #45
fix scaling of pseudo probs for es scores. fix filtering of embedding retrieval #46
Fixing doc_name for TFIDF Retriever #33

Readers

Refactor pipeline for better generalizability & Add TransformersReader #1
Add method to train a reader on custom data #5
Add no answer handling #26
Add no_answer option to results #24
Fix offsets in reader #4
FARMReader.train() now takes default values from FARMReader #47
Update inferencer args (num_processes, chunksize) to latest FARM version #54
update readme & rename arg in TransformersReader for consistency #86
Fixing typo in transformer. use_gpu provides ordinal of the gpu, not … #83
Add document_id with Transformers Reader #60
Make eval during reader.train() more verbose #28
Removed "document_name" from farm.py #31
Add a document_name field in answers #30

REST API / Deployment

Move API from flask to fastAPI #3
Modularize API components #55
Return more meta data & restructure reponse format #66
Log API responses in APM #70
Make Elastic-APM optional #65
Update Python version in Dockerfile-GPU #71
Update Dockerfiles to use Gunicorn for deployment #69
Add limit on concurrent requests for doc-qa #64
Add Docker Images for running Haystack #85
Fix cyclic import of Elasticsearch client #59
Add Feedback export API #56
Add gpu dockerfile, improve logging, fix minor bug with filtering #36
Improve deployment of REST API (Configs, logging, minor bugs) #40

Others

Standardize Finder, Readers, and Retriever interfaces #62
pin haystack version in tutorials until release #87
Update tutorials to use Elasticsearch, new Retrievers #79
Adding coverage reports and a few more tests #78
Added Jupyter notebooks of Tutorials #43
Add minimal tutorial for ES #19
Update tutorials #12

Thanks to all contributors for your great work 👏
@tanaysoni, @Timoeller , @brandenchan, @bogdankostic , @skirdey , @stedomedo , @karthik19967829 , @aadil-srivastava01 , @tholor

Assets 2

28 Nov 09:55

tanaysoni

0.1.0

d2c77f3

Initial Release

0.1.0

0.1.0

Assets 2

Releases: deepset-ai/haystack

v0.8.0

⭐ Highlights

Milvus Document Store

Knowledge Graph

Pipeline configuration with YAML

REST APIs

Confidence Scores

Web Crawler

⚠️ Breaking Changes

REST APIs

Created At Timestamp

RAGenerator

Custom Query for Elasticsearch

🤓 Detailed Changes

Pipeline

Document Store

Retriever

Modeling

REST API

File Converter

Crawler

Knowledge Graph

Annotation Tool

Search UI

CI

Misc Fixes

v0.7.0

⭐ Highlights

New Slack Channel

Optimizing Memory + CPU consumption of documentstores for large datasets (#733)

Add Simple Demo UI (#671)

Support for summarization models (#698)

New Tutorials

⚠️ Breaking Changes

Dropping index_buffer_size from FAISSDocumentStore

Renaming of Preprocessor arg

🤓 Detailed Changes

Preprocessing / File Conversion

DocumentStore

Other

v0.6.0

⭐ Highlights

Flexible Pipelines powered by DAGs (#596)

Multiple retrievers

Custom nodes

Decision nodes

Default Pipelines (replacing the "Finder")

New DocumentStore for the Open Distro of Elasticsearch (#676)

⚠️ Breaking Changes

🤓 Detailed Changes

Preprocessing / File Conversion

DocumentStore

Retriever

Reader / Generator

Pipeline

Other

v0.5.0

Highlights

💬 Generative Question Answering via RAG (#484)

↗️ Better DPR (incl. training) (#527)

📊 New Benchmarks

⚠️ Breaking Changes

Consistent parameter naming for TransformersReader #510

FAISS: Remove phi normalization, support more index types #467

DPR signature

Detailed Changes

Preprocessing / File Conversion

DocumentStore

Retriever

Reader

Other

REST API

Documentation / Benchmarks / Tutorials

v0.4.0

Highlights

💥 New Project Website & Documentation

👉 https://haystack.deepset.ai

📈 Scalable dense retrieval: FAISSDocumentStore

📃 More input file formats: Apache Tika File Converter (#314 )

Dropping `index_buffer_size` from FAISSDocumentStore