cleaning the api docs (#616)

deepset-ai · Nov 24, 2020 · 3dee284 · 3dee284
1 parent e192387
commit 3dee284
Show file tree

Hide file tree

Showing 12 changed files with 560 additions and 376 deletions.
diff --git a/docs/_src/api/api/document_store.md b/docs/_src/api/api/document_store.md
diff --git a/docs/_src/api/api/file_converter.md b/docs/_src/api/api/file_converter.md
@@ -1,38 +1,8 @@
-<a name="pdf"></a>
-# pdf
-
-<a name="pdf.PDFToTextConverter"></a>
-## PDFToTextConverter
-
-```python
-class PDFToTextConverter(BaseConverter)
-```
-
-<a name="pdf.PDFToTextConverter.__init__"></a>
-#### \_\_init\_\_
-
-```python
- | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
-```
-
-**Arguments**:
-
-- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
-The tabular structures in documents might be noise for the reader model if it
-does not have table parsing capability for finding answers. However, tables
-may also have long strings that could possible candidate for searching answers.
-The rows containing strings are thus retained in this option.
-- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
-(https://en.wikipedia.org/wiki/ISO_639-1) format.
-This option can be used to add test for encoding errors. If the extracted text is
-not one of the valid languages, then it might likely be encoding error resulting
-in garbled text.
-
 <a name="txt"></a>
-# txt
+# Module txt
 
 <a name="txt.TextConverter"></a>
-## TextConverter
+## TextConverter Objects
 
 ```python
 class TextConverter(BaseConverter)
@@ -77,11 +47,36 @@ Reads text from a txt file and executes optional preprocessing steps.
 
 Dict of format {"text": "The text from file", "meta": meta}}
 
+<a name="docx"></a>
+# Module docx
+
+<a name="docx.DocxToTextConverter"></a>
+## DocxToTextConverter Objects
+
+```python
+class DocxToTextConverter(BaseConverter)
+```
+
+<a name="docx.DocxToTextConverter.convert"></a>
+#### convert
+
+```python
+ | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
+```
+
+Extract text from a .docx file.
+Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
+For compliance with other converters we nevertheless opted for keeping the methods name.
+
+**Arguments**:
+
+- `file_path`: Path to the .docx file you want to convert
+
 <a name="tika"></a>
-# tika
+# Module tika
 
 <a name="tika.TikaConverter"></a>
-## TikaConverter
+## TikaConverter Objects
 
 ```python
 class TikaConverter(BaseConverter)
@@ -123,36 +118,11 @@ in garbled text.
 
 a list of pages and the extracted meta data of the file.
 
-<a name="docx"></a>
-# docx
-
-<a name="docx.DocxToTextConverter"></a>
-## DocxToTextConverter
-
-```python
-class DocxToTextConverter(BaseConverter)
-```
-
-<a name="docx.DocxToTextConverter.convert"></a>
-#### convert
-
-```python
- | convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
-```
-
-Extract text from a .docx file.
-Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
-For compliance with other converters we nevertheless opted for keeping the methods name.
-
-**Arguments**:
-
-- `file_path`: Path to the .docx file you want to convert
-
 <a name="base"></a>
-# base
+# Module base
 
 <a name="base.BaseConverter"></a>
-## BaseConverter
+## BaseConverter Objects
 
 ```python
 class BaseConverter()
@@ -207,3 +177,33 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar
 
 Validate if the language of the text is one of valid languages.
 
+<a name="pdf"></a>
+# Module pdf
+
+<a name="pdf.PDFToTextConverter"></a>
+## PDFToTextConverter Objects
+
+```python
+class PDFToTextConverter(BaseConverter)
+```
+
+<a name="pdf.PDFToTextConverter.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
+```
+
+**Arguments**:
+
+- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
+The tabular structures in documents might be noise for the reader model if it
+does not have table parsing capability for finding answers. However, tables
+may also have long strings that could possible candidate for searching answers.
+The rows containing strings are thus retained in this option.
+- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
+(https://en.wikipedia.org/wiki/ISO_639-1) format.
+This option can be used to add test for encoding errors. If the extracted text is
+not one of the valid languages, then it might likely be encoding error resulting
+in garbled text.
+
diff --git a/docs/_src/api/api/generator.md b/docs/_src/api/api/generator.md
@@ -0,0 +1,137 @@
+<a name="transformers"></a>
+# Module transformers
+
+<a name="transformers.RAGenerator"></a>
+## RAGenerator Objects
+
+```python
+class RAGenerator(BaseGenerator)
+```
+
+Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on
+HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html).
+
+Instead of "finding" the answer within a document, these models **generate** the answer.
+In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages
+for real-world applications:
+a) it has a manageable model size
+b) the answer generation is conditioned on retrieved documents,
+i.e. the model can easily adjust to domain documents even after training has finished
+(in contrast: GPT-3 relies on the web data seen during training)
+
+**Example**
+
+```python
+> question = "who got the first nobel prize in physics?"
+
+# Retrieve related documents from retriever
+> retrieved_docs = retriever.retrieve(query=question)
+
+> # Now generate answer from question and retrieved documents
+> generator.predict(
+>    question=question,
+>    documents=retrieved_docs,
+>    top_k=1
+> )
+{'question': 'who got the first nobel prize in physics',
+     'answers':
+         [{'question': 'who got the first nobel prize in physics',
+           'answer': ' albert einstein',
+           'meta': { 'doc_ids': [...],
+                     'doc_scores': [80.42758 ...],
+                     'doc_probabilities': [40.71379089355469, ...
+                     'texts': ['Albert Einstein was a ...]
+                     'titles': ['"Albert Einstein"', ...]
+     }}]}
+```
+
+<a name="transformers.RAGenerator.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(model_name_or_path: str = "facebook/rag-token-nq", retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k_answers: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True)
+```
+
+Load a RAG model from Transformers along with passage_embedding_model.
+See https://huggingface.co/transformers/model_doc/rag.html for more details
+
+**Arguments**:
+
+- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
+'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
+See https://huggingface.co/models for full list of available models.
+- `retriever`: `DensePassageRetriever` used to embedded passage
+- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE
+- `top_k_answers`: Number of independently generated text to return
+- `max_length`: Maximum length of generated text
+- `min_length`: Minimum length of generated text
+- `num_beams`: Number of beams for beam search. 1 means no beam search.
+- `embed_title`: Embedded the title of passage while generating embedding
+- `prefix`: The prefix used by the generator's tokenizer.
+- `use_gpu`: Whether to use GPU (if available)
+
+<a name="transformers.RAGenerator.predict"></a>
+#### predict
+
+```python
+ | predict(question: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
+```
+
+Generate the answer to the input question. The generation will be conditioned on the supplied documents.
+These document can for example be retrieved via the Retriever.
+
+**Arguments**:
+
+- `question`: Question
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict like this:
+
+```python
+> {'question': 'who got the first nobel prize in physics',
+>    'answers':
+>        [{'question': 'who got the first nobel prize in physics',
+>          'answer': ' albert einstein',
+>          'meta': { 'doc_ids': [...],
+>                    'doc_scores': [80.42758 ...],
+>                    'doc_probabilities': [40.71379089355469, ...
+>                    'texts': ['Albert Einstein was a ...]
+>                    'titles': ['"Albert Einstein"', ...]
+>    }}]}
+```
+
+<a name="base"></a>
+# Module base
+
+<a name="base.BaseGenerator"></a>
+## BaseGenerator Objects
+
+```python
+class BaseGenerator(ABC)
+```
+
+Abstract class for Generators
+
+<a name="base.BaseGenerator.predict"></a>
+#### predict
+
+```python
+ | @abstractmethod
+ | predict(question: str, documents: List[Document], top_k: Optional[int]) -> Dict
+```
+
+Abstract method to generate answers.
+
+**Arguments**:
+
+- `question`: Question
+- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
+- `top_k`: Number of returned answers
+
+**Returns**:
+
+Generated answers plus additional infos in a dict
+
diff --git a/docs/_src/api/api/preprocessor.md b/docs/_src/api/api/preprocessor.md
@@ -1,5 +1,44 @@
+<a name="preprocessor"></a>
+# Module preprocessor
+
+<a name="preprocessor.PreProcessor"></a>
+## PreProcessor Objects
+
+```python
+class PreProcessor(BasePreProcessor)
+```
+
+<a name="preprocessor.PreProcessor.__init__"></a>
+#### \_\_init\_\_
+
+```python
+ | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
+```
+
+**Arguments**:
+
+- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
+for the longest common string. This heuristic uses exact matches and therefore
+works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
+or similar.
+- `clean_whitespace`: Strip whitespaces before or after each line in the text.
+- `clean_empty_lines`: Remove more than two empty lines in the text.
+- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
+- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
+"sentence", then each output document will have 10 sentences.
+- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
+split_length -> 5 & split_stride -> 2, then the splits would be like:
+[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
+Set the value to None to disable striding behaviour.
+- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
+to True, the individual split will always have complete sentences &
+the number of words will be <= split_length.
+
+<a name="cleaning"></a>
+# Module cleaning
+
 <a name="utils"></a>
-# utils
+# Module utils
 
 <a name="utils.eval_data_from_file"></a>
 #### eval\_data\_from\_file
@@ -84,45 +123,6 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o
 
 bool if anything got fetched
 
-<a name="preprocessor"></a>
-# preprocessor
-
-<a name="preprocessor.PreProcessor"></a>
-## PreProcessor
-
-```python
-class PreProcessor(BasePreProcessor)
-```
-
-<a name="preprocessor.PreProcessor.__init__"></a>
-#### \_\_init\_\_
-
-```python
- | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
-```
-
-**Arguments**:
-
-- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
-for the longest common string. This heuristic uses exact matches and therefore
-works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
-or similar.
-- `clean_whitespace`: Strip whitespaces before or after each line in the text.
-- `clean_empty_lines`: Remove more than two empty lines in the text.
-- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
-- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
-"sentence", then each output document will have 10 sentences.
-- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
-split_length -> 5 & split_stride -> 2, then the splits would be like:
-[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
-Set the value to None to disable striding behaviour.
-- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
-to True, the individual split will always have complete sentences &
-the number of words will be <= split_length.
-
 <a name="base"></a>
-# base
-
-<a name="cleaning"></a>
-# cleaning
+# Module base
 
diff --git a/docs/_src/api/api/pydoc-markdown-document-store.yml b/docs/_src/api/api/pydoc-markdown-document-store.yml
@@ -10,5 +10,8 @@ processor:
   - skip_empty_modules: true
 renderer:
   type: markdown
-  descriptive_class_title: false
+  descriptive_class_title: true
+  descriptive_module_title: true
+  add_method_class_prefix: false
+  add_member_class_prefix: false
   filename: document_store.md