Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning the api docs #616

Merged
merged 1 commit into from
Nov 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
532 changes: 279 additions & 253 deletions docs/_src/api/api/document_store.md

Large diffs are not rendered by default.

122 changes: 61 additions & 61 deletions docs/_src/api/api/file_converter.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,8 @@
<a name="pdf"></a>
# pdf

<a name="pdf.PDFToTextConverter"></a>
## PDFToTextConverter

```python
class PDFToTextConverter(BaseConverter)
```

<a name="pdf.PDFToTextConverter.__init__"></a>
#### \_\_init\_\_

```python
| __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.

<a name="txt"></a>
# txt
# Module txt

<a name="txt.TextConverter"></a>
## TextConverter
## TextConverter Objects

```python
class TextConverter(BaseConverter)
Expand Down Expand Up @@ -77,11 +47,36 @@ Reads text from a txt file and executes optional preprocessing steps.

Dict of format {"text": "The text from file", "meta": meta}}

<a name="docx"></a>
# Module docx

<a name="docx.DocxToTextConverter"></a>
## DocxToTextConverter Objects

```python
class DocxToTextConverter(BaseConverter)
```

<a name="docx.DocxToTextConverter.convert"></a>
#### convert

```python
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
```

Extract text from a .docx file.
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
For compliance with other converters we nevertheless opted for keeping the methods name.

**Arguments**:

- `file_path`: Path to the .docx file you want to convert

<a name="tika"></a>
# tika
# Module tika

<a name="tika.TikaConverter"></a>
## TikaConverter
## TikaConverter Objects

```python
class TikaConverter(BaseConverter)
Expand Down Expand Up @@ -123,36 +118,11 @@ in garbled text.

a list of pages and the extracted meta data of the file.

<a name="docx"></a>
# docx

<a name="docx.DocxToTextConverter"></a>
## DocxToTextConverter

```python
class DocxToTextConverter(BaseConverter)
```

<a name="docx.DocxToTextConverter.convert"></a>
#### convert

```python
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None) -> Dict[str, Any]
```

Extract text from a .docx file.
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
For compliance with other converters we nevertheless opted for keeping the methods name.

**Arguments**:

- `file_path`: Path to the .docx file you want to convert

<a name="base"></a>
# base
# Module base

<a name="base.BaseConverter"></a>
## BaseConverter
## BaseConverter Objects

```python
class BaseConverter()
Expand Down Expand Up @@ -207,3 +177,33 @@ supplied meta data like author, url, external IDs can be supplied as a dictionar

Validate if the language of the text is one of valid languages.

<a name="pdf"></a>
# Module pdf

<a name="pdf.PDFToTextConverter"></a>
## PDFToTextConverter Objects

```python
class PDFToTextConverter(BaseConverter)
```

<a name="pdf.PDFToTextConverter.__init__"></a>
#### \_\_init\_\_

```python
| __init__(remove_numeric_tables: Optional[bool] = False, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.

137 changes: 137 additions & 0 deletions docs/_src/api/api/generator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
<a name="transformers"></a>
# Module transformers

<a name="transformers.RAGenerator"></a>
## RAGenerator Objects

```python
class RAGenerator(BaseGenerator)
```

Implementation of Facebook's Retrieval-Augmented Generator (https://arxiv.org/abs/2005.11401) based on
HuggingFace's transformers (https://huggingface.co/transformers/model_doc/rag.html).

Instead of "finding" the answer within a document, these models **generate** the answer.
In that sense, RAG follows a similar approach as GPT-3 but it comes with two huge advantages
for real-world applications:
a) it has a manageable model size
b) the answer generation is conditioned on retrieved documents,
i.e. the model can easily adjust to domain documents even after training has finished
(in contrast: GPT-3 relies on the web data seen during training)

**Example**

```python
> question = "who got the first nobel prize in physics?"

# Retrieve related documents from retriever
> retrieved_docs = retriever.retrieve(query=question)

> # Now generate answer from question and retrieved documents
> generator.predict(
> question=question,
> documents=retrieved_docs,
> top_k=1
> )
{'question': 'who got the first nobel prize in physics',
'answers':
[{'question': 'who got the first nobel prize in physics',
'answer': ' albert einstein',
'meta': { 'doc_ids': [...],
'doc_scores': [80.42758 ...],
'doc_probabilities': [40.71379089355469, ...
'texts': ['Albert Einstein was a ...]
'titles': ['"Albert Einstein"', ...]
}}]}
```

<a name="transformers.RAGenerator.__init__"></a>
#### \_\_init\_\_

```python
| __init__(model_name_or_path: str = "facebook/rag-token-nq", retriever: Optional[DensePassageRetriever] = None, generator_type: RAGeneratorType = RAGeneratorType.TOKEN, top_k_answers: int = 2, max_length: int = 200, min_length: int = 2, num_beams: int = 2, embed_title: bool = True, prefix: Optional[str] = None, use_gpu: bool = True)
```

Load a RAG model from Transformers along with passage_embedding_model.
See https://huggingface.co/transformers/model_doc/rag.html for more details

**Arguments**:

- `model_name_or_path`: Directory of a saved model or the name of a public model e.g.
'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
See https://huggingface.co/models for full list of available models.
- `retriever`: `DensePassageRetriever` used to embedded passage
- `generator_type`: Which RAG generator implementation to use? RAG-TOKEN or RAG-SEQUENCE
- `top_k_answers`: Number of independently generated text to return
- `max_length`: Maximum length of generated text
- `min_length`: Minimum length of generated text
- `num_beams`: Number of beams for beam search. 1 means no beam search.
- `embed_title`: Embedded the title of passage while generating embedding
- `prefix`: The prefix used by the generator's tokenizer.
- `use_gpu`: Whether to use GPU (if available)

<a name="transformers.RAGenerator.predict"></a>
#### predict

```python
| predict(question: str, documents: List[Document], top_k: Optional[int] = None) -> Dict
```

Generate the answer to the input question. The generation will be conditioned on the supplied documents.
These document can for example be retrieved via the Retriever.

**Arguments**:

- `question`: Question
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
- `top_k`: Number of returned answers

**Returns**:

Generated answers plus additional infos in a dict like this:

```python
> {'question': 'who got the first nobel prize in physics',
> 'answers':
> [{'question': 'who got the first nobel prize in physics',
> 'answer': ' albert einstein',
> 'meta': { 'doc_ids': [...],
> 'doc_scores': [80.42758 ...],
> 'doc_probabilities': [40.71379089355469, ...
> 'texts': ['Albert Einstein was a ...]
> 'titles': ['"Albert Einstein"', ...]
> }}]}
```

<a name="base"></a>
# Module base

<a name="base.BaseGenerator"></a>
## BaseGenerator Objects

```python
class BaseGenerator(ABC)
```

Abstract class for Generators

<a name="base.BaseGenerator.predict"></a>
#### predict

```python
| @abstractmethod
| predict(question: str, documents: List[Document], top_k: Optional[int]) -> Dict
```

Abstract method to generate answers.

**Arguments**:

- `question`: Question
- `documents`: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
- `top_k`: Number of returned answers

**Returns**:

Generated answers plus additional infos in a dict

82 changes: 41 additions & 41 deletions docs/_src/api/api/preprocessor.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,44 @@
<a name="preprocessor"></a>
# Module preprocessor

<a name="preprocessor.PreProcessor"></a>
## PreProcessor Objects

```python
class PreProcessor(BasePreProcessor)
```

<a name="preprocessor.PreProcessor.__init__"></a>
#### \_\_init\_\_

```python
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
```

**Arguments**:

- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
- `clean_empty_lines`: Remove more than two empty lines in the text.
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
"sentence", then each output document will have 10 sentences.
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
split_length -> 5 & split_stride -> 2, then the splits would be like:
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
Set the value to None to disable striding behaviour.
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
to True, the individual split will always have complete sentences &
the number of words will be <= split_length.

<a name="cleaning"></a>
# Module cleaning

<a name="utils"></a>
# utils
# Module utils

<a name="utils.eval_data_from_file"></a>
#### eval\_data\_from\_file
Expand Down Expand Up @@ -84,45 +123,6 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o

bool if anything got fetched

<a name="preprocessor"></a>
# preprocessor

<a name="preprocessor.PreProcessor"></a>
## PreProcessor

```python
class PreProcessor(BasePreProcessor)
```

<a name="preprocessor.PreProcessor.__init__"></a>
#### \_\_init\_\_

```python
| __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
```

**Arguments**:

- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
- `clean_empty_lines`: Remove more than two empty lines in the text.
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
"sentence", then each output document will have 10 sentences.
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
split_length -> 5 & split_stride -> 2, then the splits would be like:
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
Set the value to None to disable striding behaviour.
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
to True, the individual split will always have complete sentences &
the number of words will be <= split_length.

<a name="base"></a>
# base

<a name="cleaning"></a>
# cleaning
# Module base

5 changes: 4 additions & 1 deletion docs/_src/api/api/pydoc-markdown-document-store.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,8 @@ processor:
- skip_empty_modules: true
renderer:
type: markdown
descriptive_class_title: false
descriptive_class_title: true
descriptive_module_title: true
add_method_class_prefix: false
add_member_class_prefix: false
filename: document_store.md
Loading