Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change return types of indexing pipeline nodes #2342

Merged
merged 37 commits into from
Mar 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7a7c1cd
Change return types of file converters
bogdankostic Mar 21, 2022
a549ff5
Change return types of preprocessor
bogdankostic Mar 21, 2022
638df72
Change return types of crawler
bogdankostic Mar 21, 2022
d15756f
Adapt utils to functions to new return types
bogdankostic Mar 22, 2022
fbf69c2
Adapt __init__.py to new method names
bogdankostic Mar 22, 2022
c07b622
Prevent circular imports
bogdankostic Mar 22, 2022
fcfc646
Update Documentation & Code Style
github-actions[bot] Mar 22, 2022
dec74a2
Let DocStores' run method accept Documents
bogdankostic Mar 22, 2022
6edd010
Adapt tests to new return types
bogdankostic Mar 22, 2022
44e8ce3
Update Documentation & Code Style
github-actions[bot] Mar 22, 2022
0c8784d
Put "# type: ignore" to right place
bogdankostic Mar 22, 2022
65fc302
Remove id_hash_keys property from Document primitive
bogdankostic Mar 22, 2022
dc44b8d
Update Documentation & Code Style
github-actions[bot] Mar 22, 2022
1b7d066
Adapt tests to new return types and missing id_hash_keys property
bogdankostic Mar 23, 2022
8d9e923
Merge remote-tracking branch 'origin/change_return_types' into change…
bogdankostic Mar 23, 2022
61bedc6
Fix mypy
bogdankostic Mar 23, 2022
b73cb6a
Fix mypy
bogdankostic Mar 23, 2022
0088d36
Adapt PDFToTextOCRConverter
bogdankostic Mar 23, 2022
3d2454c
Remove id_hash_keys from RestAPI tests
bogdankostic Mar 23, 2022
b6e2075
Update Documentation & Code Style
github-actions[bot] Mar 23, 2022
eae97f7
Rename tests
bogdankostic Mar 25, 2022
2d681c7
Remove redundant setting of content_type="text"
bogdankostic Mar 25, 2022
2a67a09
Add DeprecationWarning
bogdankostic Mar 25, 2022
19ea446
Add id_hash_keys to elasticsearch_index_to_document_store
bogdankostic Mar 25, 2022
a3432d0
Change document type from dict to Docuemnt in PreProcessor test
bogdankostic Mar 25, 2022
ec8ccf0
Fix file path in Tutorial 5
bogdankostic Mar 25, 2022
eb0ad8a
Remove added output in Tutorial 5
bogdankostic Mar 25, 2022
1aa48c9
Update Documentation & Code Style
github-actions[bot] Mar 25, 2022
54783cf
Fix file_paths in Tutorial 9 + fix gz files in fetch_archive_from_http
bogdankostic Mar 28, 2022
a028f2a
Adapt tutorials to new return types
bogdankostic Mar 28, 2022
b69d391
Merge remote-tracking branch 'origin/master' into change_return_types
bogdankostic Mar 28, 2022
d2f389d
Adapt tutorial 14 to new return types
bogdankostic Mar 28, 2022
f8b3630
Merge remote-tracking branch 'origin/change_return_types' into change…
bogdankostic Mar 28, 2022
cf065e8
Update Documentation & Code Style
github-actions[bot] Mar 28, 2022
8be88b8
Change assertions to HaystackErrors
bogdankostic Mar 29, 2022
2ab2644
Merge remote-tracking branch 'origin/change_return_types' into change…
bogdankostic Mar 29, 2022
b895772
Import HaystackError correctly
bogdankostic Mar 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/linux_ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -298,8 +298,6 @@ jobs:
pip install ui/

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we have this change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this in the Pinecone PR. We don't need the API key here in these tests, as we don't test pinecone inside this job but inside the test-pinecone job. (The API Key is already used there:

PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
)

run: pytest -s ${{ matrix.test-path }}


Expand Down
18 changes: 15 additions & 3 deletions docs/_src/api/api/crawler.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Crawl texts from a website so that we can use them later in Haystack as a corpus
#### \_\_init\_\_

```python
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True)
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None)
```

Init object with basic params for crawling (can be overwritten later).
Expand All @@ -42,13 +42,17 @@ Init object with basic params for crawling (can be overwritten later).
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.

<a id="crawler.Crawler.crawl"></a>

#### crawl

```python
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None) -> List[Path]
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None) -> List[Path]
```

Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
Expand All @@ -68,6 +72,10 @@ If no parameters are provided to this method, the instance attributes that were
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.

**Returns**:

Expand All @@ -78,7 +86,7 @@ List of paths where the crawled webpages got stored
#### run

```python
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False) -> Tuple[Dict, str]
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None) -> Tuple[Dict, str]
```

Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
Expand All @@ -94,6 +102,10 @@ Method to be executed when the Crawler is used as a Node within a Haystack pipel
All URLs not matching at least one of the regular expressions will be dropped.
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
- `return_documents`: Return json files content
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.

**Returns**:

Expand Down
14 changes: 11 additions & 3 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ None
#### run

```python
def run(documents: List[dict], index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, id_hash_keys: Optional[List[str]] = None)
def run(documents: List[Union[dict, Document]], index: Optional[str] = None, headers: Optional[Dict[str, str]] = None, id_hash_keys: Optional[List[str]] = None)
```

Run requests of document stores
Expand Down Expand Up @@ -4669,7 +4669,7 @@ and filter_utils.py.
#### open\_search\_index\_to\_document\_store

```python
def open_search_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "admin", password: str = "admin", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "https", ca_certs: Optional[str] = None, verify_certs: bool = False, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
def open_search_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, id_hash_keys: Optional[List[str]] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "admin", password: str = "admin", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "https", ca_certs: Optional[str] = None, verify_certs: bool = False, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
```

This function provides brownfield support of existing OpenSearch indexes by converting each of the records in
Expand Down Expand Up @@ -4700,6 +4700,10 @@ all the indexed Documents in the `DocumentStore` will be overwritten in the seco
- `index`: Name of index in `document_store` to use to store the resulting haystack `Document` objects.
- `preprocessor`: Optional PreProcessor that will be applied on the content field of the original OpenSearch
record.
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.
- `batch_size`: Number of records to process at once.
- `host`: URL(s) of OpenSearch nodes.
- `port`: Ports(s) of OpenSearch nodes.
Expand All @@ -4721,7 +4725,7 @@ You can use certifi package with `certifi.where()` to find where the CA certs fi
#### elasticsearch\_index\_to\_document\_store

```python
def elasticsearch_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "", password: str = "", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "http", ca_certs: Optional[str] = None, verify_certs: bool = True, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
def elasticsearch_index_to_document_store(document_store: "BaseDocumentStore", original_index_name: str, original_content_field: str, original_name_field: Optional[str] = None, included_metadata_fields: Optional[List[str]] = None, excluded_metadata_fields: Optional[List[str]] = None, store_original_ids: bool = True, index: Optional[str] = None, preprocessor: Optional[PreProcessor] = None, id_hash_keys: Optional[List[str]] = None, batch_size: int = 10_000, host: Union[str, List[str]] = "localhost", port: Union[int, List[int]] = 9200, username: str = "", password: str = "", api_key_id: Optional[str] = None, api_key: Optional[str] = None, aws4auth=None, scheme: str = "http", ca_certs: Optional[str] = None, verify_certs: bool = True, timeout: int = 30, use_system_proxy: bool = False) -> "BaseDocumentStore"
```

This function provides brownfield support of existing Elasticsearch indexes by converting each of the records in
Expand Down Expand Up @@ -4752,6 +4756,10 @@ all the indexed Documents in the `DocumentStore` will be overwritten in the seco
- `index`: Name of index in `document_store` to use to store the resulting haystack `Document` objects.
- `preprocessor`: Optional PreProcessor that will be applied on the content field of the original Elasticsearch
record.
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.
- `batch_size`: Number of records to process at once.
- `host`: URL(s) of Elasticsearch nodes.
- `port`: Ports(s) of Elasticsearch nodes.
Expand Down
Loading