0.3.0
🔍 Dense Passage Retrieval
Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.
retriever = DensePassageRetriever(document_store=document_store,
embedding_model="dpr-bert-base-nq",
do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]
See Tutorial 6 for more details
📊 Evaluation
We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k
for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval()
is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.
document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)
See Tutorial 5 for more details
📄 Basic Support for PDF and Docx Files
You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.
#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page
#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
# => list of str, one per paragraph (as docx has no direct notion of pages)
And there's much more that happened ...
Preprocessing
- Added Support for Docx Files #225
- Add PDF parser for indexing #109
- Adjust PDF conversion subprocess for Python v3.6 #194
- Fix boundary condition in detection of header/footer in file converters #165
Retriever
- Refactor DPR for latest transformers version & change init arg
gpu
->use_gpu
for DPR and EmbeddingRetriever #239 - Add dummy retriever for benchmarking / reader-only settings #235
- Fix id for documents returned by the TfidfRetriever #232
- Tutorial for Dense Passage Retriever #186
- Fix device arg for sentence transformers #124
- Fix embeddings from sentence-transformers (type cast & gpu flags) #121
- Adding metadata to be returned from tfidf retreiver #122
Reader
- Add ONNXRuntime support #157
- Fix multi gpu training via Dataparallel #234
- Fix document id missing in farm inference output #174
- Add document meta for Transformer Reader #114
- Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
- Adjust to farm handling of no answer #170
DocumentStores
- Move document_name attribute to meta #217
- Remove meta field when writing documents in Elasticsearch #240
- Harmonize meta data handling across doc stores #214
- Add filtering by tags for InMemoryDocumentStore #108
- Make FAQ question field customizable #146
- Increase timeout for Elasticsearch bulk indexing #119
- Add embedding query for InMemoryDocumentStore #112
- Increase timeout for bulk indexing in ES #130
- Add custom port to ElasticsearchDocumentStore #129
- Remove hard-coded embedding field #107
REST API
- Move out REST API from PyPI package #160
- Fix format of /export-doc-qa-feedback to comply with SQuAD #241
- Create file upload directory in the REST API #166
- Add API endpoint to upload files #154
- Missing PORT and SCHEME for elasticsearch to run the API #134
- Add EMBEDDING_MODEL_FORMAT in API config #152
- Add success response for successful file upload API #195
- Add response time in logs #201
- Fix rest api in Docker image after refactoring #178
Other
- Upgrade to new FARM / Transformers / PyTorch versions #212
- Fix Evaluation Dataset #233
- Remove mutation of documents in write_documents() #231
- Remove mutation of results dict in print_answers() #230
- Make doc name optional #100
- Fix Dockerfile to build successfully without models directory #210
- Docker available for TransformsReader Class #180
- Fix embedding method in FAQ-QA Tutorial #220
- Add more tests #213
- Update docstring for embedding_field and embedding_dim #208
- Make "meta" field generic for Document Schema #102
- Update tutorials #200
- Upgrade FARM version #172
- Fix for installing PyTorch on Windows OS #159
- Remove Literal type hint #156
- Remove PyMuPDF dependency #148
- Add missing type hints #138
- Add a GitHub Action to start Elasticsearch instance for Build workflow #142
- Correct field in evaluation tutorial #139
- Update Haystack version in tutorials #136
- Fix evaluation #132
- Add stalebot #131
- Add Reader/Retriever validations in Finder #113
- Add document metadata for FAQ style QA #106
- Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
- Make saving more explicit in tutorial #95
Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan