Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Milvus vector db #244

Merged
merged 6 commits into from
Dec 3, 2024
Merged

Milvus vector db #244

merged 6 commits into from
Dec 3, 2024

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Dec 3, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for the Milvus vector database in the Cognee library.
    • Updated installation instructions to include Milvus support for both pip and Poetry.
    • Introduced a new GitHub Actions workflow for automated testing of the Milvus integration.
  • Bug Fixes

    • Improved error handling for missing credentials in various vector database providers.
  • Documentation

    • Enhanced the .env.template to include "milvus" as an option for the VECTOR_DB_PROVIDER.
  • Tests

    • Added asynchronous tests specifically for the Milvus integration.

jinhonglin-ryan and others added 5 commits December 3, 2024 03:40
Make Milvus an optional dependency, expand docs with Milvus information

Chore
…milvus gh action

Resolved if statement resolution issue regrading api key,
Added vector db config to milvus test,
Added milvus gh action

Fix
Rewrite batch search to work as async gather

Fix
Feature: Integrate Milvus as a Vector Database Provider
Copy link
Contributor

coderabbitai bot commented Dec 3, 2024

Walkthrough

The changes in this pull request include updates to several files to integrate support for the Milvus vector database within the Cognee framework. Key modifications involve the addition of a new GitHub Actions workflow for testing, enhancements to installation instructions in the README, and the introduction of the MilvusAdapter class to facilitate interactions with the Milvus database. Additionally, the .env.template file was updated to include "milvus" as a database provider option, and a new asynchronous testing script was created to validate the integration.

Changes

File Change Summary
.env.template Updated comment for VECTOR_DB_PROVIDER to include "milvus".
.github/workflows/test_milvus.yml Introduced a new workflow for automated testing of the Milvus project, including job definitions and conditions.
README.md Added installation options for Milvus support in both pip and poetry sections.
cognee/infrastructure/databases/vector/create_vector_engine.py Added support for "milvus" in create_vector_engine, standardized argument formatting, and improved error handling.
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py Introduced MilvusAdapter class with methods for managing collections and data points in Milvus.
cognee/infrastructure/databases/vector/milvus/__init__.py Imported MilvusAdapter class to make it accessible from the package's __init__.py.
cognee/tests/test_milvus.py Created a new asynchronous test script for integration with Milvus, including data operations and search queries.
pyproject.toml Updated version to 0.1.19, added pymilvus dependency, and created an extra group for Milvus support.

Possibly related PRs

Suggested reviewers

  • borisarzentar
  • Vasilije1990

Poem

🐇 In the garden of code, a new friend we see,
Milvus joins the party, as happy as can be!
With adapters and workflows, we leap and we bound,
Testing and building, new features abound!
Let's hop to the future, with joy in our hearts,
For every new change, a fresh start imparts! 🌱


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 747a6b9 and e462ebe.

📒 Files selected for processing (1)
  • README.md (3 hunks)
🔇 Additional comments (3)
README.md (3)

56-60: LGTM! Pip installation instructions for Milvus are correct.

The installation command follows the consistent format used for other database providers.


92-96: LGTM! Poetry installation instructions for Milvus are correct.

The poetry command follows the consistent format used for other database providers.


371-380: LGTM! Database implementation state table is properly updated.

The Milvus entry is correctly added with appropriate type and status indicators, maintaining consistency with other stable database implementations.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dexters1 dexters1 marked this pull request as ready for review December 3, 2024 12:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
cognee/infrastructure/databases/vector/create_vector_engine.py (2)

9-10: Correct misspelling of 'required' in error messages

The word "required" is misspelled as "requred" in several error messages. This may cause confusion when these errors are encountered.

Apply this diff to fix the typos:

-    raise EnvironmentError("Missing requred Weaviate credentials!")
+    raise EnvironmentError("Missing required Weaviate credentials!")

...

-    raise EnvironmentError("Missing requred Qdrant credentials!")
+    raise EnvironmentError("Missing required Qdrant credentials!")

...

-    raise EnvironmentError("Missing requred pgvector credentials!")
+    raise EnvironmentError("Missing required pgvector credentials!")

...

-    raise EnvironmentError("Missing requred FalkorDB credentials!")
+    raise EnvironmentError("Missing required FalkorDB credentials!")

Also applies to: 17-18, 30-31, 41-42


3-7: Use appropriate base class for VectorConfig

The VectorConfig class currently inherits from Dict, which is not intended for subclassing in this context. If you need a dictionary with defined keys, consider using typing.TypedDict or use a dataclass for better type safety and clarity.

Consider changing the class definition to one of the following:

Option 1: Use TypedDict

from typing import TypedDict

class VectorConfig(TypedDict):
    vector_db_url: str
    vector_db_port: str
    vector_db_key: str
    vector_db_provider: str

Option 2: Use dataclass

from dataclasses import dataclass

@dataclass
class VectorConfig:
    vector_db_url: str
    vector_db_port: str
    vector_db_key: str
    vector_db_provider: str

Using TypedDict or dataclass provides better type checking and improves code readability.

cognee/tests/test_milvus.py (1)

35-41: Avoid unintended indentation in multi-line string

The multi-line string assigned to text is indented, which includes leading whitespace in each line. This may affect text processing or embedding results, as the extra spaces become part of the string.

Consider dedenting the string to remove unnecessary whitespace:

import textwrap

    text = textwrap.dedent("""\
        A quantum computer is a computer that takes advantage of quantum mechanical phenomena.
        At small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior, specifically quantum superposition and entanglement, using specialized hardware that supports the preparation and manipulation of quantum states.
        Classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster (with respect to input size scaling) than any modern "classical" computer. In particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the technology is largely experimental and impractical, with several obstacles to useful applications. Moreover, scalable quantum computers do not hold promise for many practical tasks, and for many important tasks quantum speedups are proven impossible.
        The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. Unlike a classical bit, a qubit can exist in a superposition of its two "basis" states. When measuring a qubit, the result is a probabilistic output of a classical bit, therefore making quantum computers nondeterministic in general. If a quantum computer manipulates the qubit in a particular way, wave interference effects can amplify the desired measurement results. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently and quickly.
        Physically engineering high-quality qubits has proven challenging. If a physical qubit is not sufficiently isolated from its environment, it suffers from quantum decoherence, introducing noise into calculations. Paradoxically, perfectly isolating qubits is also undesirable because quantum computations typically need to initialize qubits, perform controlled qubit interactions, and measure the resulting quantum states. Each of those operations introduces errors and suffers from noise, and such inaccuracies accumulate.
        In principle, a non-quantum (classical) computer can solve the same computational problems as a quantum computer, given enough time. Quantum advantage comes in the form of time complexity rather than computability, and quantum complexity theory shows that some quantum algorithms for carefully selected tasks require exponentially fewer computational steps than the best known non-quantum algorithms. Such tasks can in theory be solved on a large-scale quantum computer whereas classical computers would not finish computations in any reasonable amount of time. However, quantum speedup is not universal or even typical across computational tasks, since basic tasks such as sorting are proven to not allow any asymptotic quantum speedup. Claims of quantum supremacy have drawn significant attention to the discipline, but are demonstrated on contrived tasks, while near-term practical use cases remain limited.
        """)

Alternatively, you can adjust the indentation:

-        text = """A quantum computer is a computer that takes advantage of quantum mechanical phenomena.
-            At small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior, specifically quantum superposition and entanglement, using specialized hardware that supports the preparation and manipulation of quantum states.
-            Classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster (with respect to input size scaling) than any modern "classical" computer. In particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the technology is largely experimental and impractical, with several obstacles to useful applications. Moreover, scalable quantum computers do not hold promise for many practical tasks, and for many important tasks quantum speedups are proven impossible.
-            The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. Unlike a classical bit, a qubit can exist in a superposition of its two "basis" states. When measuring a qubit, the result is a probabilistic output of a classical bit, therefore making quantum computers nondeterministic in general. If a quantum computer manipulates the qubit in a particular way, wave interference effects can amplify the desired measurement results. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently and quickly.
-            Physically engineering high-quality qubits has proven challenging. If a physical qubit is not sufficiently isolated from its environment, it suffers from quantum decoherence, introducing noise into calculations. Paradoxically, perfectly isolating qubits is also undesirable because quantum computations typically need to initialize qubits, perform controlled qubit interactions, and measure the resulting quantum states. Each of those operations introduces errors and suffers from noise, and such inaccuracies accumulate.
-            In principle, a non-quantum (classical) computer can solve the same computational problems as a quantum computer, given enough time. Quantum advantage comes in the form of time complexity rather than computability, and quantum complexity theory shows that some quantum algorithms for carefully selected tasks require exponentially fewer computational steps than the best known non-quantum algorithms. Such tasks can in theory be solved on a large-scale quantum computer whereas classical computers would not finish computations in any reasonable amount of time. However, quantum speedup is not universal or even typical across computational tasks, since basic tasks such as sorting are proven to not allow any asymptotic quantum speedup. Claims of quantum supremacy have drawn significant attention to the discipline, but are demonstrated on contrived tasks, while near-term practical use cases remain limited.
-        """
+        text = """A quantum computer is a computer that takes advantage of quantum mechanical phenomena.
+At small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior, specifically quantum superposition and entanglement, using specialized hardware that supports the preparation and manipulation of quantum states.
+Classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster (with respect to input size scaling) than any modern "classical" computer. In particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the technology is largely experimental and impractical, with several obstacles to useful applications. Moreover, scalable quantum computers do not hold promise for many practical tasks, and for many important tasks quantum speedups are proven impossible.
+The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. Unlike a classical bit, a qubit can exist in a superposition of its two "basis" states. When measuring a qubit, the result is a probabilistic output of a classical bit, therefore making quantum computers nondeterministic in general. If a quantum computer manipulates the qubit in a particular way, wave interference effects can amplify the desired measurement results. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently and quickly.
+Physically engineering high-quality qubits has proven challenging. If a physical qubit is not sufficiently isolated from its environment, it suffers from quantum decoherence, introducing noise into calculations. Paradoxically, perfectly isolating qubits is also undesirable because quantum computations typically need to initialize qubits, perform controlled qubit interactions, and measure the resulting quantum states. Each of those operations introduces errors and suffers from noise, and such inaccuracies accumulate.
+In principle, a non-quantum (classical) computer can solve the same computational problems as a quantum computer, given enough time. Quantum advantage comes in the form of time complexity rather than computability, and quantum complexity theory shows that some quantum algorithms for carefully selected tasks require exponentially fewer computational steps than the best known non-quantum algorithms. Such tasks can in theory be solved on a large-scale quantum computer whereas classical computers would not finish computations in any reasonable amount of time. However, quantum speedup is not universal or even typical across computational tasks, since basic tasks such as sorting are proven to not allow any asymptotic quantum speedup. Claims of quantum supremacy have drawn significant attention to the discipline, but are demonstrated on contrived tasks, while near-term practical use cases remain limited.
+"""

This ensures that the text content is processed correctly without unintended leading spaces.

README.md (1)

Line range hint 1-400: Documentation needs to be updated to fully reflect Milvus integration.

Please update the following sections to include Milvus:

  1. In the "Vector Stores" section under "Vector retrieval, Graphs and LLMs", add Milvus to the list:
- **Vector Stores**: Cognee supports LanceDB, Qdrant, PGVector and Weaviate for vector storage.
+ **Vector Stores**: Cognee supports LanceDB, Qdrant, PGVector, Weaviate, and Milvus for vector storage.
  1. Add Milvus to the Vector & Graph Databases Implementation State table:
 | Name             | Type               | Current state     | Known Issues                          |
 |------------------|--------------------|-------------------|---------------------------------------|
 | Qdrant           | Vector             | Stable &#x2705;   |                                       |
 | Weaviate         | Vector             | Stable &#x2705;   |                                       |
 | LanceDB          | Vector             | Stable &#x2705;   |                                       |
+| Milvus           | Vector             | Stable &#x2705;   |                                       |
 | Neo4j            | Graph              | Stable &#x2705;   |                                       |
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 42ab601 and 747a6b9.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .env.template (1 hunks)
  • .github/workflows/test_milvus.yml (1 hunks)
  • README.md (2 hunks)
  • cognee/infrastructure/databases/vector/create_vector_engine.py (4 hunks)
  • cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/milvus/__init__.py (1 hunks)
  • cognee/tests/test_milvus.py (1 hunks)
  • pyproject.toml (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cognee/infrastructure/databases/vector/milvus/init.py
  • .env.template
🧰 Additional context used
🪛 actionlint (1.7.4)
.github/workflows/test_milvus.yml

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

🪛 Ruff (0.8.0)
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py

35-35: Undefined name MilvusClient

(F821)

🔇 Additional comments (4)
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)

159-174: Handle exceptions when retrieving data points

In the retrieve method, if the client.query operation does not return any results, the method may not handle the empty response appropriately, potentially leading to errors downstream.

Please verify that the method handles empty query results without causing exceptions.

This script searches for the retrieve method across the codebase to review its implementation and usage, focusing on error handling when no results are found.

README.md (1)

56-61: LGTM: Installation instructions for Milvus are clear and consistent.

The new installation instructions for Milvus follow the established pattern and provide clear guidance for both pip and poetry users.

Also applies to: 92-97

pyproject.toml (2)

88-88: LGTM! The milvus extras group is well-structured.

The addition of the milvus extras group follows the established pattern in the project for optional database providers.


73-73: Verify pymilvus version compatibility

The version constraint ^2.5.0 allows updates to minor versions. Let's verify this version for compatibility and security.

✅ Verification successful

Version constraint ^2.5.0 is appropriate and secure

The version constraint is well-chosen because:

  • 2.5.0 is the latest stable release (released on Nov 26, 2024)
  • No known security vulnerabilities are reported
  • The caret (^) allows minor version updates while maintaining compatibility
  • The version follows the same pattern as other vector DB dependencies in the project
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check PyPI for latest versions and release dates
curl -s https://pypi.org/pypi/pymilvus/json | jq -r '.releases | keys_unsorted[-5:] as $versions | $versions[] as $v | "\($v): \(.[$v][0].upload_time)"'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "pymilvus") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 631

Comment on lines +46 to +51
async def has_collection(self, collection_name: str) -> bool:
future = asyncio.Future()
client = self.get_milvus_client()
future.set_result(client.has_collection(collection_name=collection_name))

return await future
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify asynchronous method that wraps synchronous call

In the has_collection method, using asyncio.Future to wrap a synchronous operation adds unnecessary complexity and does not provide actual asynchronous benefits.

Consider refactoring the method to either make it synchronous or properly handle the synchronous call without unnecessary wrapping:

-    async def has_collection(self, collection_name: str) -> bool:
-        future = asyncio.Future()
-        client = self.get_milvus_client()
-        future.set_result(client.has_collection(collection_name=collection_name))
-        return await future
+    def has_collection(self, collection_name: str) -> bool:
+        client = self.get_milvus_client()
+        return client.has_collection(collection_name=collection_name)

If you need to keep the method asynchronous to conform to the interface, you can run the synchronous operation in an executor to avoid blocking the event loop:

    async def has_collection(self, collection_name: str) -> bool:
        client = self.get_milvus_client()
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(None, client.has_collection, collection_name=collection_name)

.github/workflows/test_milvus.yml Show resolved Hide resolved
Update README.md with state of stable databases

Docs
@Vasilije1990 Vasilije1990 self-requested a review December 3, 2024 13:11
@dexters1 dexters1 merged commit a117852 into main Dec 3, 2024
35 of 38 checks passed
@dexters1 dexters1 deleted the milvus-vector-db branch December 3, 2024 13:14
This was referenced Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants