Releases: deepset-ai/haystack
v0.8.0
⭐ Highlights
This is a major Haystack release with many new features. The release blog post has a detailed summary. Below are the top highlights:
Milvus Document Store
Milvus is an open-source vector database. With the MilvusDocumentStore
contributed by @lalitpagaria, embedding based Retrievers like the DensePassageRetriever
or EmbeddingRetriever can use production-ready Milvus servers for large-scale deployments.
Knowledge Graph
An experimental integration for KnowledgeGraphs is introduced using GraphDB. The GraphDBKnowlegeGraph
stores Triples and executes SPARQL queries. It can be integrated with Text2SparqlRetriever
to convert natural language queries to SPARQL.
Pipeline configuration with YAML
The Pipelines can now be configured with YAML. This enables easier sharing of query & indexing configuration, reproducible setups, A/B testing of Pipelines, and moving from development to the production environment.
REST APIs
The REST APIs are revamped to use Pipelines for Query & Indexing files. The YAML configurations are in the rest_api/pipelines.YAML. The new API endpoints are more generic to accommodate custom Pipeline configurations.
Confidence Scores
The answers now have a probability
score that is better calibrated to the model's confidence. It has a range of 0-1; 0 signifying very low confidence, while, 1 for very high confidence.
Web Crawler
A Selenium based web crawler is now part of Haystack, thanks to @DIVYA-19 for the contribution. It takes as input a list of URLs and converts extracted text to Haystack Documents.
⚠️ Breaking Changes
REST APIs
The REST APIs got a major revamp with this release.
-
/doc-qa
&/faq-qa
endpoints are replaced with a more generic POST/query
endpoint. This new endpoint uses Pipelines under-the-hood, that can be configured atrest_api/pipeline.yaml
. -
The new
/query
endpoint expects a single query per request instead of a list of query strings.
The new request format is:{ "query": "Why did the revenue change?" }
and the response looks like this:
{ "query": "Why did the revenue change?", "answers": [ { "answer": "rapid technological change and evolving industry standards", "question": null, "score": 0.543937623500824, "probability": 0.014070278964936733, "context": "tion process. The market for our products is intensely competitive and is characterized by rapid technological change and evolving industry standards.", "offset_start": 91, "offset_end": 149, "offset_start_in_doc": 511, "offset_end_in_doc": 569, "document_id": "f30273b2-4d49-40d8-8824-43b3b6a0ea57", "meta": { "_split_id": "7" } }, { // other answers } ] }
-
The
/doc-qa-feedback
&/faq-qa-feedback
endpoints are replaced with a new generic/feedback
endpoint.
Created At Timestamp
Previously, all documents/labels in SQLDocumentStore
and FAISSDocumentStore
had a field called created
to store the creation timestamp, while ElasticsearchDocumentStore
did not have any timestamp field. Now, all document stores have a created_at
field for documents and labels.
RAGenerator
The top_k_answers
parameter in the RAGenerator
is renamed to top_k
for consistency across Haystack components.
Custom Query for Elasticsearch
The placeholder terms in custom_query
should not have quotes around them. See more details here.
🤓 Detailed Changes
Pipeline
- Fix execution of Pipelines with parallel nodes #901 (@oryx1729)
- Add abstract run method to basecomponent #887 (@tholor)
- Add support for parallel paths in Pipeline #884 (@oryx1729)
- Add runtime parameters to component initialization #873 (@oryx1729 )
- Add support for indexing pipelines #816 (@oryx1729 )
- Adding translator with many generic input parameter support #782 (@lalitpagaria)
- Fix building Pipeline with YAML #800 (@oryx1729)
- Load Pipeline with YAML config file #785 (@oryx1729)
- Add evaluation nodes for Pipelines #904 (@brandenchan)
- Fix passing a list as parameter value in Pipeline YAML #952 (@oryx1729)
Document Store
- Fixes elasticsearch auth #871 (@grafke)
- Allow more options for elasticsearch client (auth, multiple hosts) #845 (@tholor)
- Fix ElasticsearchDocumentStore.query_by_embedding() #823 (@oryx1729)
- Introduce incremental updates for embeddings in document stores #812 (@oryx1729)
- Add method to get metadata values for a key from Elasticsearch #776 (@oryx1729)
- Fix refresh behaviour for Elasticsearch delete #794 (@oryx1729)
- Milvus integration #771 (@lalitpagaria)
- Add flag for use of window queries in SQLDocumentStore #768 (@oryx1729)
- Remove quotes around placeholders in Elasticsearch custom query #762 (@oryx1729)
- Fix delete_all_documents for the SQLDocumentStore #761 (@oryx1729)
Retriever
- Improve dpr conversion #826 (@Timoeller)
- Fix DPR training batch size #898 (@brandenchan)
- Upgrade FAISS to 1.7.0 #834 (@tholor)
- Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg #811(@psorianom)
Modeling
- Add model versioning support #784 (@brandenchan)
- Improve preprocessing and adding of eval data #780 (@Timoeller)
- SQuAD to DPR dataset converter #765 (@psorianom)
- Remove RAG todos after transformers update #781 (@Timoeller)
- Update farm version #936 (@Timoeller)
REST API
- Refactor REST APIs to use Pipelines #922 (@oryx1729)
- Add PDF converter in Dockerfiles #877 (@oryx1729)
- Update GPU Dockerimage (Cuda 11, Fix faiss) #836 (@tholor)
- Add API endpoint to export accuracy metrics from user feedback + created_at timestamp #803(@tholor)
- Fix file upload API #808 (@oryx1729)
File Converter
- Add Markdown file convertor #875 (@lalitpagaria)
- Fix encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions #813 (@tholor)
Crawler
Knowledge Graph
- knowledge graph example #934 (@julian-risch)
Annotation Tool
- Annotation Tool: data is not persisted when using local version #853 #855(@venuraja79)
Search UI
CI
- Revamp CI #825 (@oryx1729)
- Fix mypy typing #792 (@oryx1729)
- Fix pdftotext dependency in CI #788 (@tholor)
Misc Fixes
- Adding indentation to markup files #947 (@julian-risch)
- Reduce precision in pipeline eval print functions #943 (@lewtun)
- Fix division by zero error in EvalRetriever #938 (@lewtun)
- Logged warning in Faiss and Milvus for filters #913 (@peteradorjan)
- fixed "cannot allocate memory" exception by specifying max_processes #910(@mosheber)
- Fix error when is_impossible not exist [#870](https://github.com/deepset-ai/haystack/pu...
v0.7.0
⭐ Highlights
New Slack Channel
As many people in the community asked us for it, we decided to open a slack channel!
Join us and ask questions, show what you've built with Haystack, and simply exchange with like-minded folks!
👉 https://haystack.deepset.ai/community/join
Optimizing Memory + CPU consumption of documentstores for large datasets (#733)
Interacting with large datasets can be challenging for the local memory. Therefore, we ...
- ... add
batch_size
parameters for most methods of the document store that allow to only load smaller chunks of documents at a time - ... add a
get_all_documents_generator()
method that "streams" documents one by one from your document store.
Both help to lower the memory footprint significantly- especially when calling methods likeupdate_embeddings()
on datasets > 1 Mio docs.
Add Simple Demo UI (#671)
Thanks to our community member @tanmaylaud, we now have a great and simple UI that allows you to easily try your search pipelines. Ask questions, see the results, change basic config params, debug the API response and give your colleagues a better flavor of what you are building ...
Support for summarization models (#698)
Thanks to another community contribution from @lalitpagaria we now also support summarization models like PEGASUS in Haystack. You can use them ...
... standalone:
docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions.
The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by
the shutoffs which were expected to last through at least midday tomorrow.")]
summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
summary = summarizer.predict(documents=docs, generate_single_summary=False)
... as a node in your pipeline:
...
pipeline.add_node(component=summarizer, name="Summarizer", inputs=["Retriever"])
... by simply calling a predefined pipeline that first retrieves and then summarizes the resulting docs:
...
pipe = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever)
pipe.run()
We see many interesting use cases around search for it. For example, running semantic document search and displaying the summary of docs as a "preview" in the results.
New Tutorials
- Wonder how to train a DPR retriever on your own domain dataset? Check out this new tutorial!
- Proper preprocessing (Cleaning, Splitting etc.) of docs can have a big impact on your performance. Check out this new tutorial to learn more about it.
⚠️ Breaking Changes
Dropping index_buffer_size
from FAISSDocumentStore
We removed the arg index_buffer_size
from the init of FAISSDocumentStore
. "Buffering" is now handled via the new batch_size
arguments that you can pass to most methods like write_documents()
, update_embeddings()
and get_all_documents()
.
Renaming of Preprocessor arg
Old:
PreProcessor(..., split_stride=5)
New:
PreProcessor(..., split_overlap=5)
🤓 Detailed Changes
Preprocessing / File Conversion
- Using PreProcessor functions on eval data #751
DocumentStore
- Support filters for DensePassageRetriever + InMemoryDocumentStore #754
- use Path class in add_eval_data of haystack.document_store.base.py #745
- Make batchwise adding of evaluation data possible #717
- Change signature and docstring for ca_certs parameter #730
- Rename label id field for elastic & add UPDATE_EXISTING_DOCUMENTS to API config #728
- Fix SQLite errors in tests #723
- Add support for custom embedding field for InMemoryDocumentStore #640
- Using Columns names instead of ORM to get all documents #620
Other
- Generate docstrings and deploy to branches to Staging (Website) #731
- Script for releasing docs #736
- Increase FARM to Version 0.6.2 #755
- Reduce memory consumption of fetch_archive_from_http #737
- Add links to more resources #746
- Fix Tutorial 9 #734
- Adding a guard that prevents the tutorial from being executed in every subprocess on windows #729
- Add ID to Label schema #727
- Automate docstring and tutorial generation with every push to master #718
- Pass custom label index name to REST API #724
- Correcting pypi download badge #722
- Fix GPU docker build #703
- Remove sourcerer.io widget #702
- Haystack logo is not visible on github mobile app #697
- Update pipeline documentation and readme #693
- Enable GPU args in tutorials #692
- Add docs v0.6.0 #689
Big thanks to all contributors ❤️ !
@Rob192 @antoniolanza1996 @tanmaylaud @lalitpagaria @Timoeller @tanaysoni @bogdankostic @aantti @brandenchan @PiffPaffM @julian-risch
v0.6.0
⭐ Highlights
Flexible Pipelines powered by DAGs (#596)
In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
While we always had great building blocks in Haystack, we didn't have a good way to stick them together so far. That's why we put a lof thought into it in the last weeks and came up with a new Pipeline
class that enables many new search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline:
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
You can draw the DAG to better inspect what you are building:
p.draw(path="custom_pipe.png")
Multiple retrievers
You can now also use multiple Retrievers and join their results:
p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
Custom nodes
You can easily build your own custom nodes. Just respect the following requirements:
- Add a method
run(self, **kwargs)
to your class.**kwargs
will contain the output from the previous node in your graph. - Do whatever you want within
run()
(e.g. reformatting the query) - Return a tuple that contains your output data (for the next node) and the name of the outgoing edge
output_dict, "output_1
- Add a class attribute
outgoing_edges = 1
that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).
Decision nodes
Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:
class QueryClassifier():
outgoing_edges = 2
def run(self, **kwargs):
if "?" in kwargs["query"]:
return (kwargs, "output_1")
else:
return (kwargs, "output_2")
pipe = Pipeline()
pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
inputs=["ESRetriever", "DPRRetriever"])
pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", top_k_retriever=1)
Default Pipelines (replacing the "Finder")
Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder
class which is now deprecated.
from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)
We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs ...
New DocumentStore for the Open Distro of Elasticsearch (#676)
From now on we also support the Open Distro of Elasticsearch. This allows you to use many of the hosted Elasticsearch services (e.g. from AWS) more easily with Haystack. Usage is similar to the regular ElasticsearchDocumentStore
:
document_store = OpenDistroElasticsearchDocumentStore(host="localhost", port="9200", ...)
⚠️ Breaking Changes
As Haystack is extending from QA to further search types, we decided to rename all parameters from question
to query
.
This includes for example the predict()
methods of the Readers but also several other places. See #614 for details.
🤓 Detailed Changes
Preprocessing / File Conversion
- Redone: Fix concatenation of sentences in PreProcessor. Add stride for word-based splits with sentence boundaries #641
- Add needed whitespace before sentence start #582
DocumentStore
- Scale dot product into probabilities #667
- Add refresh_type param for Elasticsearch update_embeddings() #630
- Add return_embedding parameter for get_all_documents() #615
- Adding support for update_existing_documents to sql and faiss document stores #584
- Add filters for delete_all_documents() #591
Retriever
- Fix saving tokenizers in DPR training + unify save and load dirs #682
- fix a typo, num_negatives -> num_positives #681
- Refactor DensePassageRetriever._get_predictions #642
- Move DPR embeddings from GPU to CPU straight away #618
- Add MAP retriever metric for open-domain case #572
Reader / Generator
- add GPU support for rag #669
- Enable dynamic parameter updates for the FARMReader #650
- Add option in API Config to configure if reader can return "No Answer" #609
- Fix various generator issues #590
Pipeline
- Add support for building custom Search Pipelines #596
- Add set_node() for Pipeline #659
- Add support for aggregating scores in JoinDocuments node #683
- Add pipelines for GenerativeQA & FAQs #645
Other
- Cleanup Pytest Fixtures #639
- Add latest benchmark run #652
- Fix image links in tutorial #663
- Update query arg in Tutorial 7 #656
- Fix benchmarks #648
- Add link to FAISS Info in documentation #643
- Improve User Feedback Documentation #539
- Add formatting checks for shell scripts #627
- Update md files for API docs #631
- Clean API docs and increase coverage #621
- Add boxes for recommendations #629
- Automate benchmarks via CML #518
- Add contributor hall of fame #628
- README: Fix link to roadmap #626
- Fix docstring examples #604
- Cleaning the api docs #616
- Fix link to DocumentStore page #613
- Make more changes to documentation #578
- Remove column in benchmark website #608
- Make benchmarks clearer #606
- Fixing defaults configs for rest_apis #583
- Allow list of filter values in REST API #568
- Fix CI bug due to new Elasticsearch release and new model release #579
- Update Colab Torch Version [#576](https://github.com/deepset...
v0.5.0
Highlights
💬 Generative Question Answering via RAG (#484)
Thanks to our community member @lalitpagaria, Haystack now also support generative QA via Retrieval Augmented Generation ("RAG").
Instead of "finding" the answer within a document, these models generate the answer. In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents, i.e. the model can easily adjust to domain documents even after training has finished (in contrast: GPT-3 relies on the web data seen during training)
Example:
question = "who got the first nobel prize in physics?"
# Retrieve related documents from retriever
retrieved_docs = retriever.retrieve(query=question)
# Now generate answer from question and retrieved documents
predicted_result = generator.predict(
question=question,
documents=retrieved_docs,
top_k=1
)
You already play around with it in this minimal tutorial:
We are looking forward to improve this class of models further in the next months and already plan a tighter integration into the Finder
class.
↗️ Better DPR (incl. training) (#527)
We migrated the existing DensePassageRetriever
to an own pipeline based on FARM. This allows a better modularization and most importantly simple training of DPR models! You can either train models from scratch or take an existing DPR model and fine-tune it on your own domain data. The required training data consists of queries and positive passages (i.e. passages that are related to your query / contain the answer) and the format complies with the one in the original DPR codebase.
Example:
dense_passage_retriever.train(self,
data_dir: str,
train_filename: str,
dev_filename: str = None,
test_filename: str = None,
batch_size: int = 16,
embed_title: bool = True,
num_hard_negatives: int = 1,
n_epochs: int = 3)
Future improvements: At the moment training is only supported on single GPUs. We will add support for Multi-GPU Training via DDP soon.
📊 New Benchmarks
Happy to introduce a new benchmark section on our website!
Do you wonder if you should use BERT, RoBERTa or MiniLM for your reader? Is it worth to use DPR for retrieval instead of Elastic's BM25? How would this impact speed and accuracy?
See the relevant metrics here to guide your decision:
👉 https://haystack.deepset.ai/bm/benchmarks
We will extend this section over time with more models, metrics and key parameters.
⚠️ Breaking Changes
Consistent parameter naming for TransformersReader #510
# old:
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..)
# new
TransformersReader(model="distilbert-base-uncased-distilled-squad" ..)
TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad" ...)
FAISS: Remove phi normalization, support more index types #467
New default index type is "Flat" and params have changed slightly:
# old
FAISSDocumentStore(
sql_url: str = "sqlite:///",
index_buffer_size: int = 10_000,
vector_size: int = 768,
faiss_index: Optional[IndexHNSWFlat] = None,
# new
FAISSDocumentStore(
sql_url: str = "sqlite:///",
index_buffer_size: int = 10_000,
vector_dim: int = 768,
faiss_index_factory_str: str = "Flat",
faiss_index: Optional[faiss.swigfaiss.Index] = None,
return_embedding: Optional[bool] = True,
**kwargs,
DPR signature
Splitting max_seq_len
into two independent params.
Removing remove_sep_tok_from_untitled_passages
param.
# old
DensePassageRetriever(
document_store: BaseDocumentStore,
query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len: int = 256,
use_gpu: bool = True,
batch_size: int = 16,
embed_title: bool = True,
remove_sep_tok_from_untitled_passages: bool = True
)
# new
DensePassageRetriever(
document_store: BaseDocumentStore,
query_embedding_model: Union[Path, str] = "facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model: Union[Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query: int = 64,
max_seq_len_passage: int = 256,
use_gpu: bool = True,
batch_size: int = 16,
embed_title: bool = True,
use_fast_tokenizers: bool = True,
similarity_function: str = "dot_product"
):
Detailed Changes
Preprocessing / File Conversion
- Add preprocessing pipeline #473
- Restructure checks in PreProcessor #504
- Updated the example code to Indexing PDF / Docx files #502
- Fix meta data = None in PreProcessor #496
- add explicit encoding mode to file_converter/txt.py #478
- Skip file conversion if file type is not supported #456
DocumentStore
- Add support for MySQL database #556
- Allow configuration of Elasticsearch Analyzer (e.g. for other languages) #554
- Add support to return embedding #514
- Fix scoring in Elasticsearch for dot product #517
- Allow filters for get_document_count() #512
- Make creation of label index optional #490
- Fix update_embeddings function in FAISSDocumentStore #481
- FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
- Enable bulk operations on vector IDs for FAISSDocumentStore #460
- fixing ElasticsearchDocumentStore initialisation #415
- bug: filters on a query_by_embedding #464
Retriever
- DensePassageRetriever: Add Training, Refactor Inference to FARM modules #527
- Fix retriever evaluation metrics #547
- Add save and load method for DPR #550
- Typo in dense.py comment #545
- Make returning predictions in Finder & Retriever eval() possible #524
- Make title info optional when evaluating on QA data #494
- Make sentence-transformers usage more user-friendly #439
Reader
- Fix FARMReader.eval() handling of no_answers #531
- Added automatic mixed precision (AMP) support for reader training from Haystack side #463
- Update ONNX conversion for FARMReader #438
Other
- Fix sentencepiece dependencies in Dockerfiles #553
- Update Dockerfile #537
- Removing (deprecated) warnings from the Haystack codebase. #530
- Pytest fix memory leak and put pytest marker on slow tests #520
- [enhancement] Create deploy_website.yml #450
- Add Docker Images & Setup for the Annotation Tool #444
REST API
- Make filter value optional in REST API #497
- Add Elasticsearch Query DSL compliant Query API #471
- Allow configuration of log level in REST API #541
- Add create_index and similarity metric to api config #493
- Add deepcopy for meta dicts in answers #485
- Fix windows platform installation #480
- Update GPU docker & fix race condition with multiple workers #436
Documentation / Benchmarks / Tutorials
- New readme #534
- Add ...
v0.4.0
Highlights
💥 New Project Website & Documentation
As the project is growing, we have more and more content that doesn't fit in GitHub.
In this first version of the website, we focused on documentation incl. quick start, usage guides and the API reference.
In the future, we plan to extend this with benchmarks, FAQs, use cases, and other content that helps you to build your QA system.
👉 https://haystack.deepset.ai
📈 Scalable dense retrieval: FAISSDocumentStore
With recent performance gains of dense retrieval methods (learn more about it here), we need document stores that efficiently store vectors and find the most similar ones at query time. While Elasticsearch can also handle vectors, it quickly reaches its limits when dealing with larger datasets. We evaluated a couple of projects (FAISS, Scann, Milvus, Jina ...) that specialize on approximate nearest neighbour (ANN) algorithms for vector similarity. We decided to implement FAISS as it's easy to run in most environments.
We will likely add one of the heavier solutions (e.g. Jina or Milvus) later this year.
The FAISSDocumentStore uses FAISS to handle embeddings and SQL to store the actual texts and meta data.
Usage:
document_store = FAISSDocumentStore(sql_url: str = "sqlite:///", # SQL DB for text + meta data
vector_size: int = 768) # Dimensionality of your embeddings
📃 More input file formats: Apache Tika File Converter (#314 )
Thanks to @dany-nonstop you can now extract text from many file formats (docx, pptx, html, epub, odf ...) via Apache Tika.
Usage:
- Start Apache Tika Server
docker run -d -p 9998:9998 apache/tika
- Do Conversion in Haystack
tika_converter = TikaConverter(
tika_url = "http://localhost:9998/tika",
remove_numeric_tables = False,
remove_whitespace = False,
remove_empty_lines = False,
remove_header_footer = False,
valid_languages = None,
)
>>> dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dict
{
"text": "everything on page one \f then page two \f ..."
'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}
Breaking changes
Restructuring / Renaming of modules (Breaking changes!) (#379)
We've restructured the package to make the usage more intuitive and terminology more consistent.
- Rename
database
module ->document_store
- Split
indexing
module into ->file_converter
andpreprocessor
- Move
Document
,Label
andMultilabel
classes into ->schema
and simplify import tofrom haystack import Document, Label, Multilabel
File converters (#393)
Refactoring major parts of the file converters. Not returning pages anymore, but rather adding page break symbols that can be accessed further down the pipeline.
Old:
>>> pages, meta = `Fileconverter.extract_pages(file_path=Path("..."))`
New:
>>> dict = `Fileconverter.convert(file_path="...", meta={"name": "some_name", "category": "news"})`
>>> dict
{
"text": "everything on page one \f then page two \f ..."
"meta": {"name": "..."}
}
DensePassageRetriever (#308)
Refactored from FB code to transformers code base and loading the models from their model hub now.
Signature has therefore changed to:
retriever = DensePassageRetriever(document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
use_gpu=True,
embed_title=True,
remove_sep_tok_from_untitled_passages=True)
Deprecate Tags for Document Stores (#286)
We removed the "tags" field that in the past could be associated with Documents and used for filtering your search.
Insead, we use now the more general concept of "meta", where you can supply any custom fields and filter for them at runtime
Old:
dict = {"text": "some", "tags": ["category1", "category2"]}
New
dict = {"text": "some", "meta": {"category": ["1", "2"] }}
Details
Document Stores
- Add FAISS Document Store #253
- Fix type casting for vectors in FAISS #399
- Fix duplicate vector ids in FAISS #395
- Fix document filtering in SQLDocumentStore #396
- Move retriever probability calculations to document_store #389
- Add FAISS query scores #368
- Raise Exception if filters used for FAISSDocumentStore query #338
- Add refresh_type arg to ElasticsearchDocumentStore #326
- Improve speed for SQLDocumentStore #330
- Fix indexing of metadata for FAISS/SQL Document Store #310
- Ensure exact match when filtering by meta in Elasticsearch #311
- Deprecate Tags for Document Stores #286
- Add option to update existing documents when indexing #285
- Cast document_ids as strings #284
- Add method to update meta fields for documents in Elasticsearch #242
- Custom mapping write doc fix #297
Retriever
- DPR (Dense Retriever) for InMemoryDocumentStore #316 #332
- Refactor DPR from FB to Transformers codebase #308
- Restructure update embeddings #304
- Added title during DPR passage embedding && ElasticsearchDocumentStore #298
- Add eval for Dense Passage Retriever & Refactor handling of labels/feedback #243
- Fix type of query_emb in DPR.retrieve() #247
- Fix return type of EmbeddingRetriever to numpy array #245
Reader
- More robust Reader eval by limiting max answers and creating no answer labels #331
- Aggregate multiple no answers in MultiLabel #324
- Add "no answer" aggregation to Transformersreader #259
- Align TransformersReader with FARMReader #319
- Datasilo use all cores for preprocessing #303
- Batch prediction in evaluation #137
- Aggregate label objects for same questions #292
- Add num_processes to reader.train() to configure multiprocessing #271
- Added support for unanswerable questions in TransformersReader #258
Preprocessing
Finder
- Add index arg to Finder.get_answers() and _via_similar_questions() #362
Documentation
- Create documentation website #272
- Use port 8000 in documentation #357
- Documentation #343
- Convert Documentation to markdown #386
- Add logo to readme #384
- Refactor the DPR tutorial to use FAISS #317
- Make Tutorials Work on Colab GPUs #322
Other
- Exclude embedding fields from the REST API #390
- Fix test suite dependency issue on MacOS #374
- Add Gunicorn timeout #364
- Bump FARM version to 0.4.7 #340
- Add Tests for MultiLabel #318
- Modified search endpoints logs to dump json #290
- Add export answers to CSV function #266
Big thanks to all contributors ♥️
@antoniolanza1996, @dany-nonstop, @philipp-bode, @lalitpagaria , @PiffPaffM , @brandenchan , @tanaysoni , @Timoeller , @tholor, @bogdankostic , @maxupp, @kolk , @venuraja79 , @karimjp
0.3.0
🔍 Dense Passage Retrieval
Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.
retriever = DensePassageRetriever(document_store=document_store,
embedding_model="dpr-bert-base-nq",
do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]
See Tutorial 6 for more details
📊 Evaluation
We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k
for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval()
is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.
document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)
See Tutorial 5 for more details
📄 Basic Support for PDF and Docx Files
You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.
#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page
#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
# => list of str, one per paragraph (as docx has no direct notion of pages)
And there's much more that happened ...
Preprocessing
- Added Support for Docx Files #225
- Add PDF parser for indexing #109
- Adjust PDF conversion subprocess for Python v3.6 #194
- Fix boundary condition in detection of header/footer in file converters #165
Retriever
- Refactor DPR for latest transformers version & change init arg
gpu
->use_gpu
for DPR and EmbeddingRetriever #239 - Add dummy retriever for benchmarking / reader-only settings #235
- Fix id for documents returned by the TfidfRetriever #232
- Tutorial for Dense Passage Retriever #186
- Fix device arg for sentence transformers #124
- Fix embeddings from sentence-transformers (type cast & gpu flags) #121
- Adding metadata to be returned from tfidf retreiver #122
Reader
- Add ONNXRuntime support #157
- Fix multi gpu training via Dataparallel #234
- Fix document id missing in farm inference output #174
- Add document meta for Transformer Reader #114
- Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
- Adjust to farm handling of no answer #170
DocumentStores
- Move document_name attribute to meta #217
- Remove meta field when writing documents in Elasticsearch #240
- Harmonize meta data handling across doc stores #214
- Add filtering by tags for InMemoryDocumentStore #108
- Make FAQ question field customizable #146
- Increase timeout for Elasticsearch bulk indexing #119
- Add embedding query for InMemoryDocumentStore #112
- Increase timeout for bulk indexing in ES #130
- Add custom port to ElasticsearchDocumentStore #129
- Remove hard-coded embedding field #107
REST API
- Move out REST API from PyPI package #160
- Fix format of /export-doc-qa-feedback to comply with SQuAD #241
- Create file upload directory in the REST API #166
- Add API endpoint to upload files #154
- Missing PORT and SCHEME for elasticsearch to run the API #134
- Add EMBEDDING_MODEL_FORMAT in API config #152
- Add success response for successful file upload API #195
- Add response time in logs #201
- Fix rest api in Docker image after refactoring #178
Other
- Upgrade to new FARM / Transformers / PyTorch versions #212
- Fix Evaluation Dataset #233
- Remove mutation of documents in write_documents() #231
- Remove mutation of results dict in print_answers() #230
- Make doc name optional #100
- Fix Dockerfile to build successfully without models directory #210
- Docker available for TransformsReader Class #180
- Fix embedding method in FAQ-QA Tutorial #220
- Add more tests #213
- Update docstring for embedding_field and embedding_dim #208
- Make "meta" field generic for Document Schema #102
- Update tutorials #200
- Upgrade FARM version #172
- Fix for installing PyTorch on Windows OS #159
- Remove Literal type hint #156
- Remove PyMuPDF dependency #148
- Add missing type hints #138
- Add a GitHub Action to start Elasticsearch instance for Build workflow #142
- Correct field in evaluation tutorial #139
- Update Haystack version in tutorials #136
- Fix evaluation #132
- Add stalebot #131
- Add Reader/Retriever validations in Finder #113
- Add document metadata for FAQ style QA #106
- Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
- Make saving more explicit in tutorial #95
Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan
0.2.1
In our release notes, we will always highlight a few important changes first and list the detailed PRs below.
🎉 First release
Happy to announce our first proper release incl. many of the features that we found absolutely crucial for a QA system. While we still have countless exciting features on our roadmap, we are confident that this version already accelerates your development phase of QA systems significantly and it was also tested successfully in the first production deployments.
From now on, we will switch to a more regular release cycle.
📜 ElasticsearchDocumentStore
We recommend the new ElasticsearchDocumentStore for all production deployments. While we will keep more light-weight options (SQL, In-Memory) for easy prototyping, new features will be implemented first for Elasticsearch.
🚀 New Retrievers
Beside plain TF-IDF (in memory), we introduced the ElasticsearchRetriever that supports Elasticsearch's native scoring (BM25) or custom queries (e.g. using boosting).
As a further option, we also added the EmbeddingRetriever that encodes texts into embeddings (e.g. via Sentence-BERT) and retrieves via cosine-similarity. Especially the latter is very promising and you will likely see more features in this direction.
⁉️ FAQ-style QA
Beside extractive QA, you can now also index existing question-answer pairs (e.g. from FAQs) and find answers via matching the incoming user-question with the indexed questions and returning the related answer from that pair. This can be an interesting alternative or addition to extractive QA, if you already have huge collections of FAQs and/or need a solution that works with low computational resources.
🔁 Modular API based on FastAPI
We changed the basic REST API from Flask to FastAPI and modularized it.
You can now:
- search answers in texts (extractive QA)
- search answers by comparing user question to existing questions (FAQ-style QA)
- collect & export user feedback on answers to gain domain-specific training data (feedback)
- do basic monitoring of requests (currently via APM in Kibana)
Detailed changes:
Document Stores
- Add Elasticsearch Datastore #13
- Refactor database layer #10
- Add test for Elasticsearch document store #88
- Make filters optional for Elasticsearch query #80
- Inmemory store #76
- Fix get_all_documents() in ElasticsearchDocumentStore #77
- Fix
get_all_documents
query for Elasticsearch #21
Retrievers
- Add FAQ-style QA #44
- added option for custom elasticsearch queries and filters #52
- More flexbile es config & support for filters #29
- Add more ES connection params #35
- Simplify Retriever query #73
- Refactor ElasticsearchRetriever into separate class #72
- Add params to create_embeddings in retriever #45
- fix scaling of pseudo probs for es scores. fix filtering of embedding retrieval #46
- Fixing doc_name for TFIDF Retriever #33
Readers
- Refactor pipeline for better generalizability & Add TransformersReader #1
- Add method to train a reader on custom data #5
- Add no answer handling #26
- Add no_answer option to results #24
- Fix offsets in reader #4
- FARMReader.train() now takes default values from FARMReader #47
- Update inferencer args (num_processes, chunksize) to latest FARM version #54
- update readme & rename arg in TransformersReader for consistency #86
- Fixing typo in transformer. use_gpu provides ordinal of the gpu, not … #83
- Add document_id with Transformers Reader #60
- Make eval during reader.train() more verbose #28
- Removed "document_name" from farm.py #31
- Add a document_name field in answers #30
REST API / Deployment
- Move API from flask to fastAPI #3
- Modularize API components #55
- Return more meta data & restructure reponse format #66
- Log API responses in APM #70
- Make Elastic-APM optional #65
- Update Python version in Dockerfile-GPU #71
- Update Dockerfiles to use Gunicorn for deployment #69
- Add limit on concurrent requests for doc-qa #64
- Add Docker Images for running Haystack #85
- Fix cyclic import of Elasticsearch client #59
- Add Feedback export API #56
- Add gpu dockerfile, improve logging, fix minor bug with filtering #36
- Improve deployment of REST API (Configs, logging, minor bugs) #40
Others
- Standardize Finder, Readers, and Retriever interfaces #62
- pin haystack version in tutorials until release #87
- Update tutorials to use Elasticsearch, new Retrievers #79
- Adding coverage reports and a few more tests #78
- Added Jupyter notebooks of Tutorials #43
- Add minimal tutorial for ES #19
- Update tutorials #12
Thanks to all contributors for your great work 👏
@tanaysoni, @Timoeller , @brandenchan, @bogdankostic , @skirdey , @stedomedo , @karthik19967829 , @aadil-srivastava01 , @tholor
Initial Release
0.1.0 0.1.0