Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Pinecone tests #3555

Merged
merged 8 commits into from
Nov 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 31 additions & 65 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,6 @@ jobs:
channel: '#haystack'
if: failure() && github.repository_owner == 'deepset-ai' && github.ref == 'refs/heads/main'


integration-tests-weaviate:
name: Integration / Weaviate / ${{ matrix.os }}
needs:
Expand Down Expand Up @@ -363,6 +362,37 @@ jobs:
channel: '#haystack'
if: failure() && github.repository_owner == 'deepset-ai' && github.ref == 'refs/heads/main'

integration-tests-pinecone:
name: Integration / pinecone / ${{ matrix.os }}
needs:
- unit-tests
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest,macos-latest,windows-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: ./.github/actions/python_cache/

- name: Install Haystack
run: pip install .[pinecone]

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
run: |
pytest --maxfail=5 -m "document_store and integration" test/document_stores/test_pinecone.py

- uses: act10ns/slack@v1
with:
status: ${{ job.status }}
channel: '#haystack'
if: failure() && github.repository_owner == 'deepset-ai' && github.ref == 'refs/heads/main'


#
# TODO: the following steps need to be revisited
#
Expand Down Expand Up @@ -540,71 +570,7 @@ jobs:
# pytest ${{ env.PYTEST_PARAMS }} -m "milvus and not integration" ${{ env.SUITES_EXCLUDED_FROM_WINDOWS }} test/document_stores/ --document_store_type=milvus


pinecone-tests-linux:
needs: [mypy, pylint, black]
runs-on: ubuntu-latest
if: contains(github.event.pull_request.labels.*.name, 'topic:pinecone') || !github.event.pull_request.draft

steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: ./.github/actions/python_cache/

# TODO Let's try to remove this one from the unit tests
- name: Install pdftotext
run: wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz && tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

- name: Install Haystack
run: pip install .[pinecone]

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
TOKENIZERS_PARALLELISM: 'false'
run: |
pytest ${{ env.PYTEST_PARAMS }} -m "pinecone and not integration" test/document_stores/ --document_store_type=pinecone

- uses: act10ns/slack@v1
with:
status: ${{ job.status }}
channel: '#haystack'
if: failure() && github.repository_owner == 'deepset-ai' && github.ref == 'refs/heads/main'

pinecone-tests-windows:
needs: [mypy, pylint, black]
runs-on: windows-latest
if: contains(github.event.pull_request.labels.*.name, 'topic:pinecone') && contains(github.event.pull_request.labels.*.name, 'topic:windows') || !github.event.pull_request.draft

steps:
- uses: actions/checkout@v3

- name: Setup Python
uses: ./.github/actions/python_cache/
with:
prefix: windows

- name: Install pdftotext
run: |
choco install xpdf-utils
choco install openjdk11
refreshenv

- name: Install Haystack
run: pip install .[pinecone]

- name: Run tests
env:
TOKENIZERS_PARALLELISM: 'false'
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
run: |
pytest ${{ env.PYTEST_PARAMS }} -m "pinecone and not integration" ${{ env.SUITES_EXCLUDED_FROM_WINDOWS }} test/document_stores/ --document_store_type=pinecone

- uses: act10ns/slack@v1
with:
status: ${{ job.status }}
channel: '#haystack'
if: failure() && github.repository_owner == 'deepset-ai' && github.ref == 'refs/heads/main'

rest-and-ui:
needs: [mypy, pylint, black]
Expand Down
3 changes: 3 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ def pytest_addoption(parser):
parser.addoption(
"--mock-dc", action="store_true", default=True, help="Mock HTTP requests to dC while running tests"
)
parser.addoption(
"--mock-pinecone", action="store_true", default=True, help="Mock HTTP requests to Pinecone while running tests"
)


def pytest_generate_tests(metafunc):
Expand Down
103 changes: 62 additions & 41 deletions haystack/document_stores/pinecone.py
Original file line number Diff line number Diff line change
Expand Up @@ -1340,11 +1340,6 @@ def _label_to_meta(self, labels: list) -> dict:
meta = {
"label-id": label.id,
"query": label.query,
"label-answer-answer": label.answer.answer,
"label-answer-type": label.answer.type,
"label-answer-score": label.answer.score,
"label-answer-context": label.answer.context,
"label-answer-document-id": label.answer.document_id,
"label-is-correct-answer": label.is_correct_answer,
"label-is-correct-document": label.is_correct_document,
"label-document-content": label.document.content,
Expand All @@ -1355,19 +1350,38 @@ def _label_to_meta(self, labels: list) -> dict:
"label-updated-at": label.updated_at,
"label-pipeline-id": label.pipeline_id,
}
# Get offset data
if label.answer.offsets_in_document:
meta["label-answer-offsets-in-document-start"] = label.answer.offsets_in_document[0].start
meta["label-answer-offsets-in-document-end"] = label.answer.offsets_in_document[0].end
else:
meta["label-answer-offsets-in-document-start"] = None
meta["label-answer-offsets-in-document-end"] = None
if label.answer.offsets_in_context:
meta["label-answer-offsets-in-context-start"] = label.answer.offsets_in_context[0].start
meta["label-answer-offsets-in-context-end"] = label.answer.offsets_in_context[0].end
else:
meta["label-answer-offsets-in-context-start"] = None
meta["label-answer-offsets-in-context-end"] = None
# Get document metadata
if label.document.meta is not None:
for k, v in label.document.meta.items():
meta[f"label-document-meta-{k}"] = v
# Get label metadata
if label.meta is not None:
for k, v in label.meta.items():
meta[f"label-meta-{k}"] = v
# Get Answer data
if label.answer is not None:
meta.update(
{
"label-answer-answer": label.answer.answer,
"label-answer-type": label.answer.type,
"label-answer-score": label.answer.score,
"label-answer-context": label.answer.context,
"label-answer-document-id": label.answer.document_id,
}
)
# Get offset data
if label.answer.offsets_in_document:
meta["label-answer-offsets-in-document-start"] = label.answer.offsets_in_document[0].start
meta["label-answer-offsets-in-document-end"] = label.answer.offsets_in_document[0].end
else:
meta["label-answer-offsets-in-document-start"] = None
meta["label-answer-offsets-in-document-end"] = None
if label.answer.offsets_in_context:
meta["label-answer-offsets-in-context-start"] = label.answer.offsets_in_context[0].start
meta["label-answer-offsets-in-context-end"] = label.answer.offsets_in_context[0].end
else:
meta["label-answer-offsets-in-context-start"] = None
meta["label-answer-offsets-in-context-end"] = None
metadata[label.id] = meta
metadata = self._meta_for_pinecone(metadata)
return metadata
Expand All @@ -1377,40 +1391,47 @@ def _meta_to_labels(self, documents: List[Document]) -> List[Label]:
Converts a list of metadata dictionaries to a list of Labels.
"""
labels = []
for doc in documents:
label_meta = {k: v for k, v in doc.meta.items() if k[:6] == "label-" or k == "query"}
other_meta = {k: v for k, v in doc.meta.items() if k[:6] != "label-" and k != "query"}
for d in documents:
label_meta = {k: v for k, v in d.meta.items() if k[:6] == "label-" or k == "query"}
other_meta = {k: v for k, v in d.meta.items() if k[:6] != "label-" and k != "query"}
# Create document
doc = Document(
id=label_meta["label-document-id"],
content=doc.content,
meta=other_meta,
score=doc.score,
embedding=doc.embedding,
id=label_meta["label-document-id"], content=d.content, meta={}, score=d.score, embedding=d.embedding
)
# Extract document metadata
for k, v in d.meta.items():
if k.startswith("label-document-meta-"):
doc.meta[k[20:]] = v
# Extract offsets
offsets: Dict[str, Optional[List[Span]]] = {"document": None, "context": None}
for mode in offsets.keys():
if label_meta[f"label-answer-offsets-in-{mode}-start"] is not None:
if label_meta.get(f"label-answer-offsets-in-{mode}-start") is not None:
offsets[mode] = [
Span(
label_meta[f"label-answer-offsets-in-{mode}-start"],
label_meta[f"label-answer-offsets-in-{mode}-end"],
)
]
# if label_meta["label-answer-answer"] is None:
# label_meta["label-answer-answer"] = ""
answer = Answer(
answer=label_meta["label-answer-answer"]
or "", # If we leave as None a schema validation error will be thrown
type=label_meta["label-answer-type"],
score=label_meta["label-answer-score"],
context=label_meta["label-answer-context"],
offsets_in_document=offsets["document"],
offsets_in_context=offsets["context"],
document_id=label_meta["label-answer-document-id"],
meta=other_meta,
)
# Extract Answer
answer = None
if label_meta.get("label-answer-answer") is not None:
answer = Answer(
answer=label_meta["label-answer-answer"]
or "", # If we leave as None a schema validation error will be thrown
type=label_meta["label-answer-type"],
score=label_meta["label-answer-score"],
context=label_meta["label-answer-context"],
offsets_in_document=offsets["document"],
offsets_in_context=offsets["context"],
document_id=label_meta["label-answer-document-id"],
meta=other_meta,
)
# Extract Label metadata
label_meta_metadata = {}
for k, v in d.meta.items():
if k.startswith("label-meta-"):
label_meta_metadata[k[11:]] = v
# Rebuild Label object
label = Label(
id=label_meta["label-id"],
query=label_meta["query"],
Expand All @@ -1422,7 +1443,7 @@ def _meta_to_labels(self, documents: List[Document]) -> List[Label]:
is_correct_answer=label_meta["label-is-correct-answer"],
is_correct_document=label_meta["label-is-correct-document"],
origin=label_meta["label-origin"],
meta={},
meta=label_meta_metadata,
filters=None,
)
labels.append(label)
Expand Down
2 changes: 1 addition & 1 deletion test/document_stores/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ def test_duplicate_documents_skip(self, ds, documents):

ds.write_documents(updated_docs, duplicate_documents="skip")
for d in ds.get_all_documents():
assert d.meta["name"] != "Updated"
assert d.meta.get("name") != "Updated"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this change ensure that no KeyError is raised if "name" is not part of d.meta and instead Null is returned. My question is if that was an issue before? The documents all have meta data with a name set, no?
There is nothing wrong about this change. I just would like to understand it a bit better. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinecone tests have their own set of documents provided by the documents fixture, and some of them don't have any metadata attached. I think all the document stores should be able to pass tests on the same set of documents, but I didn't want to change too much functionality in this PR so I left pinecone tests with their custom set of docs, which requires this change.


@pytest.mark.integration
def test_duplicate_documents_overwrite(self, ds, documents):
Expand Down
Loading