Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: updating docstrings to say OpenSearch and backlink to correct docs #5000

Merged
merged 2 commits into from
May 25, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 22 additions & 22 deletions haystack/document_stores/opensearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,45 +78,45 @@ def __init__(
"""
Document Store using OpenSearch (https://opensearch.org/). It is compatible with the Amazon OpenSearch Service.

In addition to native Elasticsearch query & filtering, it provides efficient vector similarity search using
In addition to native OpenSearch query & filtering, it provides efficient vector similarity search using
the KNN plugin that can scale to a large number of documents.

:param host: url(s) of elasticsearch nodes
:param port: port(s) of elasticsearch nodes
:param host: url(s) of OpenSearch nodes
:param port: port(s) of OpenSearch nodes
:param username: username (standard authentication via http_auth)
:param password: password (standard authentication via http_auth)
:param api_key_id: ID of the API key (altenative authentication mode to the above http_auth)
:param api_key: Secret value of the API key (altenative authentication mode to the above http_auth)
:param aws4auth: Authentication for usage with aws elasticsearch (can be generated with the requests-aws4auth package)
:param index: Name of index in elasticsearch to use for storing the documents that we want to search. If not existing yet, we will create one.
:param label_index: Name of index in elasticsearch to use for storing labels. If not existing yet, we will create one.
:param search_fields: Name of fields used by BM25Retriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"]
:param aws4auth: Authentication for usage with AWS OpenSearch Service (can be generated with the requests-aws4auth package)
:param index: Name of index in OpenSearch to use for storing the documents that we want to search. If not existing yet, we will create one.
:param label_index: Name of index in OpenSearch to use for storing labels. If not existing yet, we will create one.
:param search_fields: Name of fields used by BM25Retriever to find matches in the docs to our incoming query (using OpenSearch's multi_match query), e.g. ["title", "full_text"]
:param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
:param name_field: Name of field that contains the title of the the doc
:param embedding_field: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
Note, that in OpenSearch the similarity type for efficient approximate vector similarity calculations is tied to the embedding field's data type which cannot be changed after creation.
:param embedding_dim: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
:param custom_mapping: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.
:param analyzer: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index.
Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html
:param excluded_meta_data: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]).
:param custom_mapping: If you want to use your own custom mapping for creating a new index in OpenSearch, you can supply it here as a dictionary.
:param analyzer: Specify the default analyzer from one of the built-ins when creating a new OpenSearch Index.
OpenSearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
https://opensearch.org/docs/latest/analyzers/text-analyzers/
:param excluded_meta_data: Name of fields in OpenSearch that should not be returned (e.g. [field_one, field_two]).
Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).
:param scheme: 'https' or 'http', protocol used to connect to your elasticsearch instance
:param scheme: 'https' or 'http', protocol used to connect to your OpenSearch instance
:param ca_certs: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.
:param verify_certs: Whether to be strict about ca certificates
:param create_index: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case
:param refresh_type: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search.
:param refresh_type: Type of OpenSearch refresh used to control when changes made by a request (e.g. bulk) are made visible to search.
If set to 'wait_for', continue only after changes are visible (slow, but safe).
If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion).
More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html
More info at https://opensearch.org/docs/latest/api-reference/document-apis/bulk/#url-parameters
:param similarity: The similarity function used to compare document vectors. 'dot_product' is the default since it is
more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.
Note, that the use of efficient approximate vector calculations in OpenSearch is tied to embedding_field's data type which cannot be changed after creation.
You won't be able to use approximate vector calculations on an embedding_field which was created with a different similarity value.
In such cases a fallback to exact but slow vector calculations will happen and a warning will be displayed.
:param timeout: Number of seconds after which an ElasticSearch request times out.
:param timeout: Number of seconds after which an OpenSearch request times out.
:param return_embedding: To return document embedding
:param duplicate_documents: Handle duplicates document based on parameter options.
Parameter options : ( 'skip','overwrite','fail')
Expand All @@ -135,12 +135,12 @@ def __init__(
the best balance between nearly as good accuracy and latency.
:param scroll: Determines how long the current index is fixed, e.g. during updating all documents with embeddings.
Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h"
For details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html
For details, see https://opensearch.org/docs/latest/api-reference/scroll/
:param skip_missing_embeddings: Parameter to control queries based on vector similarity when indexed documents miss embeddings.
Parameter options: (True, False)
False: Raises exception if one or more documents do not have embeddings at query time
True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)
:param synonyms: List of synonyms can be passed while elasticsearch initialization.
:param synonyms: List of synonyms can be passed while OpenSearch initialization.
For example: [ "foo, bar => baz",
"foozball , foosball" ]
More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
Expand Down Expand Up @@ -329,7 +329,7 @@ def write_documents(
Indexes documents for later queries in OpenSearch.

If a document with the same ID already exists in OpenSearch:
a) (Default) Throw Elastic's standard error message for duplicate IDs.
a) (Default) Throw OpenSearch's standard error message for duplicate IDs.
b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents.
(This is only relevant if you pass your own ID when initializing a `Document`.
If you don't set custom IDs for your Documents or just pass a list of dictionaries here,
Expand Down Expand Up @@ -478,7 +478,7 @@ def query_by_embedding(
:param top_k: How many documents to return
:param index: Index name for storing the docs and metadata
:param return_embedding: To return document embedding
:param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
:param headers: Custom HTTP headers to pass to OpenSearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
:param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Expand Down Expand Up @@ -601,7 +601,7 @@ def query_by_embedding_batch(
:param top_k: How many documents to return
:param index: Index name for storing the docs and metadata
:param return_embedding: To return document embedding
:param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
:param headers: Custom HTTP headers to pass to OpenSearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
:param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
Expand Down Expand Up @@ -1271,7 +1271,7 @@ def _create_label_index(self, index_name: str, headers: Optional[Dict[str, str]]

def _get_vector_similarity_query(self, query_emb: np.ndarray, top_k: int):
"""
Generate Elasticsearch query for vector similarity.
Generate OpenSearch query for vector similarity.
"""
if self.knn_engine == "score_script":
query: dict = {
Expand Down