Skip to content

0.3.0

Compare
Choose a tag to compare
@tholor tholor released this 16 Jul 12:30

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

  • Added Support for Docx Files #225
  • Add PDF parser for indexing #109
  • Adjust PDF conversion subprocess for Python v3.6 #194
  • Fix boundary condition in detection of header/footer in file converters #165

Retriever

  • Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
  • Add dummy retriever for benchmarking / reader-only settings #235
  • Fix id for documents returned by the TfidfRetriever #232
  • Tutorial for Dense Passage Retriever #186
  • Fix device arg for sentence transformers #124
  • Fix embeddings from sentence-transformers (type cast & gpu flags) #121
  • Adding metadata to be returned from tfidf retreiver #122

Reader

  • Add ONNXRuntime support #157
  • Fix multi gpu training via Dataparallel #234
  • Fix document id missing in farm inference output #174
  • Add document meta for Transformer Reader #114
  • Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
  • Adjust to farm handling of no answer #170

DocumentStores

  • Move document_name attribute to meta #217
  • Remove meta field when writing documents in Elasticsearch #240
  • Harmonize meta data handling across doc stores #214
  • Add filtering by tags for InMemoryDocumentStore #108
  • Make FAQ question field customizable #146
  • Increase timeout for Elasticsearch bulk indexing #119
  • Add embedding query for InMemoryDocumentStore #112
  • Increase timeout for bulk indexing in ES #130
  • Add custom port to ElasticsearchDocumentStore #129
  • Remove hard-coded embedding field #107

REST API

  • Move out REST API from PyPI package #160
  • Fix format of /export-doc-qa-feedback to comply with SQuAD #241
  • Create file upload directory in the REST API #166
  • Add API endpoint to upload files #154
  • Missing PORT and SCHEME for elasticsearch to run the API #134
  • Add EMBEDDING_MODEL_FORMAT in API config #152
  • Add success response for successful file upload API #195
  • Add response time in logs #201
  • Fix rest api in Docker image after refactoring #178

Other

  • Upgrade to new FARM / Transformers / PyTorch versions #212
  • Fix Evaluation Dataset #233
  • Remove mutation of documents in write_documents() #231
  • Remove mutation of results dict in print_answers() #230
  • Make doc name optional #100
  • Fix Dockerfile to build successfully without models directory #210
  • Docker available for TransformsReader Class #180
  • Fix embedding method in FAQ-QA Tutorial #220
  • Add more tests #213
  • Update docstring for embedding_field and embedding_dim #208
  • Make "meta" field generic for Document Schema #102
  • Update tutorials #200
  • Upgrade FARM version #172
  • Fix for installing PyTorch on Windows OS #159
  • Remove Literal type hint #156
  • Remove PyMuPDF dependency #148
  • Add missing type hints #138
  • Add a GitHub Action to start Elasticsearch instance for Build workflow #142
  • Correct field in evaluation tutorial #139
  • Update Haystack version in tutorials #136
  • Fix evaluation #132
  • Add stalebot #131
  • Add Reader/Retriever validations in Finder #113
  • Add document metadata for FAQ style QA #106
  • Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
  • Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan