Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 170 pgvector adapter #158

Merged
merged 34 commits into from
Oct 22, 2024
Merged

Cog 170 pgvector adapter #158

merged 34 commits into from
Oct 22, 2024

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Oct 18, 2024

Add PGVector support for cognee

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for the "pgvector" vector database provider.
    • Introduced a new GitHub Actions workflow for automated testing of the pgvector integration.
    • Enhanced documentation in the README to include new vector storage options.
    • Expanded configuration options for vector databases in the settings.
  • Bug Fixes

    • Improved clarity in configuration settings for various database providers.
  • Documentation

    • Updated README with expanded installation instructions and additional supported technologies.
  • Chores

    • Updated docker-compose.yml to use the new pgvector PostgreSQL image.
    • Modified pyproject.toml to include new dependencies for pgvector.

dexters1 and others added 15 commits October 11, 2024 13:23
Added config support for using pgvector

Feature #COG-170
Added PGVectorAdapter

Feature #COG-170
Saving state of pgvector integration development so far

Feature #COG-170
Added first working iteration of PGVector for cognee, some important funcionality is still missing, but the core is there. Also some refactoring will be necessary.

Feature: #COG-170
Remove VECTOR_DB_NAME env parameter as it's not needed

Refactor #COG-170
Removed echoing of database operations

Refactor #COG-170
Added batch search to PGVectorAdapter

Feature #COG-170
Optimize search query for PGVector

Refactor #COG-170
Formatted PGVectorAdapter

Refactor #COG-170
Added formatting for PGVector part of create_vector_engine

Refactor #COG-170
Change formatting for connection string for PGVectorAdapter

Refactor #COG-170
Changed raw SQL quries to use SQLalchemy ORM for PGVectorAdapter

Refactor #COG-170
Fixed issue with newly added tables not being pruned from postgres database

Fix #COG-170
Added test for PGVectorAdapter

Test #COG-170
Copy link

gitguardian bot commented Oct 21, 2024

⚠️ GitGuardian has uncovered 6 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password f8babba cognee/tests/test_pgvector.py View secret
9573981 Triggered Generic Password 9f4b8f2 .github/workflows/test_pgvector.yml View secret
9573981 Triggered Generic Password 2cedcbe .github/workflows/test_pgvector.yml View secret
9573981 Triggered Generic Password a358168 cognee/tests/test_pgvector.py View secret
9573981 Triggered Generic Password a358168 cognee/tests/test_pgvector.py View secret
9573981 Triggered Generic Password 6b9a142 cognee/tests/test_pgvector.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Added github action workflow to run PGVectorAdapter integration test

Test #COG-170
Added pgvector dependency for PGVectorAdapter

Chore #COG-170
Add psycopg2 a postgresql database adapter dependency

Chore #COG-170
Changed database name in test and workflow to be the same as in the .env.template

Refactor #COG-170
cognee/tests/test_pgvector.py Outdated Show resolved Hide resolved
from .get_vector_engine import get_vector_engine, get_vectordb_config
from sqlalchemy import text

async def create_db_and_tables():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to move this inside pgvector adapter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Move serialize_datetime function to be in it's own file

Refactor #COG-170
Moved create_db_and_tables module to pgvector so it only handles pgvector related db creation

Refactor #COG-170
Added the ability to set database configurations through dictionary for ease of use.
Updated test_pgvector to use this way of setting configuration

Refactor #COG-170
Moved psycopg2 as an optional dependency for postgres to not block users with lib issues it might cause

Refactor #COG-170
Since the postgres related libraries are no longer mandatory specification is needed for them to be installed as well

Chore #COG-170
pgvector is imported directly in PGVector adapter therefore it needs to be installed with mandatory dependencies

Chore #COG-170
Syntax changed for vector_engine_provider to vector_db_provider

Fix #COG-170
}
)
cognee.config.set_relational_db_config({"db_path": "",
"db_name": "cognee_db",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacings are off here

@@ -39,6 +39,41 @@ def set_classification_model(classification_model: object):
cognify_config = get_cognify_config()
cognify_config.classification_model = classification_model

@staticmethod
def set_db_name(db_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this, we don't want to have to maintain it.

from ..get_vector_engine import get_vector_engine, get_vectordb_config
from sqlalchemy import text

async def create_db_and_tables():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you merged with the main, you can see that create_database exists in SqlAlchemyAdapter. Is that something we can reuse here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. By that I mean we can use create_database from SqlAlchemyAdapter as the initial step, but there is an extra step to enable pgvector for a postgres database that we have to run in an engine connection context

Remove unused config methods so we don't have to maintain them.
Fixed spacing in test_pgvector integration test.

Refactor #COG-170
…endencies through extras

To be uniform with how we install optional dependencies a change has been made so that postgres optional dependencies are installed through extras

Chore #COG-170
Added reusing of sqlalchemy adapters create_database method instead of directly writing code for database creation

Refactor #COG-170
…test for pgvector

Updated how postgres dependencies are installed due to change in pyproject.toml

Chore #COG-170
@dexters1 dexters1 marked this pull request as ready for review October 22, 2024 12:34
Copy link
Contributor

coderabbitai bot commented Oct 22, 2024

Walkthrough

The changes in this pull request encompass updates to various files, primarily enhancing the configuration and documentation for the integration of the pgvector database within the cognee project. Key modifications include clarifications in the .env.template file regarding database providers, the introduction of a new GitHub Actions workflow for testing pgvector, and the addition of new classes and methods for managing vector database configurations. Documentation updates in the README.md provide expanded support information for new technologies, while several files related to the API and database management reflect these enhancements.

Changes

File Change Summary
.env.template Updated comments for GRAPH_DATABASE_PROVIDER, VECTOR_DB_PROVIDER, and DB_PROVIDER to clarify acceptable values.
.github/workflows/test_pgvector.yml Added new workflow to automate testing for pgvector integration with defined jobs and PostgreSQL service setup.
README.md Expanded documentation on vector storage to include "LanceDB" and "PGVector"; minor formatting adjustments made.
cognee/api/client.py Updated VectorDBConfigDTO to include pgvector, added LLMConfigDTO class, modified SettingsPayloadDTO to include optional fields for LLM and vector DB.
cognee/api/v1/add/add.py Modified add function to call both relational and vector database creation functions asynchronously.
cognee/api/v1/add/add_v2.py Renamed and updated add function to use new database creation functions.
cognee/api/v1/config/config.py Added methods for setting relational and vector database configurations using dictionaries.
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py Updated delete_database method to reflect current schema before dropping tables.
cognee/infrastructure/databases/vector/create_vector_engine.py Added handling for pgvector in create_vector_engine function.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py Introduced PGVectorAdapter class implementing vector data management with various methods for database operations.
cognee/infrastructure/databases/vector/pgvector/__init__.py Added imports for PGVectorAdapter and create_db_and_tables.
cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py Introduced a function to set up a PostgreSQL vector database.
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py Added serialize_datetime function for converting datetime objects to ISO 8601 strings.
cognee/modules/settings/get_settings.py Added "PGVector" option to vector_dbs list in get_settings function.
cognee/modules/settings/save_vector_db_config.py Expanded provider attribute in VectorDBConfig to include "pgvector".
cognee/tests/test_pgvector.py Introduced a new test file with an asynchronous main function to validate pgvector functionality.
docker-compose.yml Updated postgres service image to pgvector/pgvector:pg17 and made minor formatting adjustments.
pyproject.toml Added pgvector dependency, modified psycopg2 to be optional, and introduced a new extras group for PostgreSQL.

Possibly related issues

  • [COG-170] Add PGvector to the infrastructure vector-store adapters: This PR addresses the integration of pgvector, aligning with the objectives of the related issue.

Possibly related PRs

Suggested reviewers

  • borisarzentar
  • Vasilije1990

Poem

🐰 In the garden of code, changes bloom bright,
With pgvector now ready, our future's in sight.
Comments refined, workflows in play,
A hop and a skip, we’re paving the way!
So let’s raise our carrots to new paths we tread,
In the world of databases, we leap ahead! 🌱


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Outside diff range and nitpick comments (15)
cognee/infrastructure/databases/vector/pgvector/__init__.py (1)

1-2: LGTM! Consider using absolute imports for consistency.

The imports look good and align with the PR objective of adding PGVector support. They make the PGVectorAdapter class and create_db_and_tables function easily accessible when importing from this package.

For consistency with Python best practices and to avoid potential issues with relative imports, consider using absolute imports:

-from .PGVectorAdapter import PGVectorAdapter
-from .create_db_and_tables import create_db_and_tables
+from cognee.infrastructure.databases.vector.pgvector.PGVectorAdapter import PGVectorAdapter
+from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tables

This change would make the imports more explicit and less prone to issues if the package structure changes in the future.

cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1)

3-12: LGTM: Well-implemented serialization function with a minor suggestion.

The serialize_datetime function is well-implemented and correctly handles nested structures. It effectively converts datetime objects to ISO format strings while preserving other data types.

Consider adding type hints to improve code readability and enable better static type checking:

from typing import Union, List, Dict, Any

def serialize_datetime(data: Union[Dict, List, datetime, Any]) -> Union[Dict, List, str, Any]:
    # ... rest of the function remains the same

This change would make the function's input and output types more explicit, which can be helpful for maintainability and catching potential type-related issues early.

cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (2)

5-13: LGTM: Well-structured implementation for pgvector setup.

The function effectively sets up the pgvector database, addressing the previous review comment by using vector_engine.create_database() and then enabling the vector extension. The use of async/await and SQLAlchemy's text() function for raw SQL is appropriate.

Consider adding error handling to manage potential exceptions during database operations. This will improve the robustness of the function.

Also, adding a docstring would enhance the function's documentation. Here's a suggested docstring:

async def create_db_and_tables():
    """
    Asynchronously create the database and necessary tables for the pgvector provider.
    
    This function retrieves the vector database configuration and engine, creates the database
    if the provider is pgvector, and enables the vector extension in the database.
    
    Raises:
        Exception: If there's an error during database creation or extension enabling.
    """
    # ... (rest of the function)

1-14: LGTM: Well-structured and focused implementation.

The file adheres to the Single Responsibility Principle, focusing solely on creating the database and tables for pgvector. Its structure is clean and easy to understand.

Consider adding more detailed comments or logging statements to enhance maintainability and debugging capabilities. For example, you could log the successful creation of the database and enabling of the vector extension.

cognee/api/v1/add/add_v2.py (3)

6-7: LGTM! Consider using absolute imports.

The new import statements are clear and align with the PR objective of adding PGVector support. The renaming of create_db_and_tables to create_relational_db_and_tables improves clarity.

Consider using absolute imports for better maintainability:

-from cognee.infrastructure.databases.relational import create_db_and_tables as create_relational_db_and_tables
-from cognee.infrastructure.databases.vector.pgvector import create_db_and_tables as create_pgvector_db_and_tables
+from cognee.infrastructure.databases.relational.create_db_and_tables import create_db_and_tables as create_relational_db_and_tables
+from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tables as create_pgvector_db_and_tables

10-11: LGTM! Consider error handling and potential performance optimization.

The addition of database creation calls aligns with the PR objective and ensures proper setup before data processing.

Consider the following improvements:

  1. Add error handling:
try:
    await create_relational_db_and_tables()
    await create_pgvector_db_and_tables()
except Exception as e:
    logger.error(f"Failed to create databases: {e}")
    raise
  1. For potential performance improvement, consider running these operations concurrently:
await asyncio.gather(
    create_relational_db_and_tables(),
    create_pgvector_db_and_tables()
)

This change would require importing asyncio at the top of the file.


Line range hint 9-24: Consider updating the function's docstring.

The changes to the add function are minimal and focused on adding PGVector support, which is good. However, the function's docstring (if it exists) might need an update to reflect the new database setup process.

Consider adding or updating the function's docstring to include information about the database setup:

async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None):
    """
    Add data to the system, setting up both relational and vector databases.

    This function initializes the necessary database tables for both relational
    and vector (PGVector) databases before processing the input data.

    Args:
        data (Union[BinaryIO, list[BinaryIO], str, list[str]]): The data to be added.
        dataset_name (str, optional): The name of the dataset. Defaults to "main_dataset".
        user (User, optional): The user associated with this operation. If None, uses the default user.

    Returns:
        Asynchronous generator yielding results from the add pipeline.
    """
    # ... (rest of the function)
.github/workflows/test_pgvector.yml (3)

13-24: LGTM: Job structure and conditions are well-defined.

The environment variable setup and job definitions are appropriate. The dependency between jobs and the condition for running the integration test are logically structured.

Consider adding a comment explaining the purpose of the get_docs_changes job for better clarity:

jobs:
  get_docs_changes:
    name: docs changes
    uses: ./.github/workflows/get_docs_changes.yml
    # This job checks if there are any changes outside of documentation files

29-42: LGTM: PostgreSQL service is well-configured.

The PostgreSQL service setup using the pgvector image is appropriate. The health check and port mapping are correctly configured.

For improved security, consider using GitHub Secrets for the database credentials:

env:
  POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
  POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
  POSTGRES_DB: ${{ secrets.POSTGRES_DB }}

This change would prevent exposing the credentials in the workflow file.


63-67: LGTM: Test execution is properly configured.

The test execution step is well-defined, using Poetry to run the pgvector test script. The use of GitHub Secrets for the API key is a good security practice.

Consider adding a comment explaining the purpose of the LLM_API_KEY, and possibly make the ENV value configurable:

env:
  ENV: ${{ inputs.environment || 'dev' }}  # Allow overriding the environment
  LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}  # API key for OpenAI integration

This change would provide more context and flexibility for different test scenarios.

cognee/modules/settings/get_settings.py (1)

44-46: LGTM! Consider a minor formatting adjustment.

The addition of PGVector to the vector_dbs list is correct and aligns with the PR objective. The new entry follows the existing structure, maintaining consistency.

For improved readability and consistency with the existing entries, consider adjusting the indentation slightly:

    }, {
-        "value": "pgvector",
-        "label": "PGVector",
+    "value": "pgvector",
+    "label": "PGVector",
    }]

This minor change would align the new entry's indentation with the existing entries in the list.

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

122-123: Approve changes with suggestions for improvement

The addition of schema reflection before dropping tables is a good improvement. It ensures that tables are dropped in the correct order, avoiding potential foreign key constraint violations.

Consider the following enhancements:

  1. Add a comment explaining the purpose of the schema reflection:
# Reflect the current database schema to ensure all tables are accounted for
await connection.run_sync(Base.metadata.reflect)
  1. Implement error handling for individual table drops:
for table in Base.metadata.sorted_tables:
    try:
        drop_table_query = text(f"DROP TABLE IF EXISTS {table.name} CASCADE")
        await connection.execute(drop_table_query)
    except Exception as e:
        print(f"Error dropping table {table.name}: {e}")
        # Decide whether to continue or break the loop based on your requirements

These changes will improve code readability and robustness.

README.md (1)

197-197: LGTM: NetworkX added as a supported graph store.

The addition of NetworkX as a supported graph store alongside Neo4j is a valuable update to the documentation. It provides a more comprehensive view of the project's capabilities.

Consider updating the PR description to mention this addition of NetworkX support, as it's a significant enhancement that wasn't explicitly stated in the original PR objectives.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

190-192: Use proper logging instead of print statements

Using print statements for error handling is not ideal in production code. It can lead to unstructured logs and missed error tracking.

Replace print with structured logging:

 except Exception as e:
-    print(f"Error during search: {e}")
+    logger.exception("Error during search")
     return []

Ensure you have configured a logger, e.g., by importing logging and setting up a logger instance.


195-200: Inconsistent parameter name: with_vectors vs. with_vector

In batch_search, the parameter is named with_vectors, whereas in search, it's with_vector. This inconsistency can cause confusion and potential bugs.

Consider standardizing the parameter name across both methods:

 async def batch_search(
     self,
     collection_name: str,
     query_texts: List[str],
     limit: int = None,
-    with_vectors: bool = False,
+    with_vector: bool = False,
 ):

And ensure you update all references to this parameter within the method.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2f832b1 and 0e1533a.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • .env.template (1 hunks)
  • .github/workflows/test_pgvector.yml (1 hunks)
  • README.md (1 hunks)
  • cognee/api/client.py (1 hunks)
  • cognee/api/v1/add/add.py (1 hunks)
  • cognee/api/v1/add/add_v2.py (1 hunks)
  • cognee/api/v1/config/config.py (1 hunks)
  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/create_vector_engine.py (2 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/pgvector/init.py (1 hunks)
  • cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1 hunks)
  • cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1 hunks)
  • cognee/modules/settings/get_settings.py (1 hunks)
  • cognee/modules/settings/save_vector_db_config.py (1 hunks)
  • cognee/tests/test_pgvector.py (1 hunks)
  • docker-compose.yml (1 hunks)
  • pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .env.template
🧰 Additional context used
🔇 Additional comments (25)
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (2)

1-1: LGTM: Import statement is correct and sufficient.

The import of datetime from the datetime module is appropriate for the function's needs.


1-12: Overall, well-implemented serialization function with room for optional enhancements.

The serialize_datetime function is a solid implementation that effectively handles the serialization of datetime objects in nested structures. It's a valuable addition to the project, particularly for working with databases that require serialized datetime values.

The suggested improvements (adding type hints, handling additional date-related types, and custom objects) are optional enhancements that could further improve the robustness and flexibility of the function. Consider implementing these based on the specific needs of your project and the types of data you expect to handle.

Great job on this implementation!

cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1)

1-3: LGTM: Imports are appropriate and consistent.

The imports are well-structured and relevant to the file's functionality. The use of relative imports maintains consistency with the project structure.

cognee/modules/settings/save_vector_db_config.py (1)

8-8: LGTM! Verify consistency across the codebase.

The addition of "pgvector" as a new option for the provider attribute correctly implements PGVector support at the configuration level. This change aligns well with the PR objective.

To ensure consistency, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:

Consider updating any relevant documentation (e.g., README, API docs) to reflect this new provider option.

✅ Verification successful

Verification Successful: "pgvector" is consistently integrated across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for other vector database provider definitions

# Test: Search for similar provider definitions
rg -i 'provider.*(?:lancedb|qdrant|weaviate)'

# Test: Search for pgvector mentions to ensure it's added everywhere necessary
rg -i 'pgvector'

Length of output: 5953

docker-compose.yml (1)

65-65: Approve the update to pgvector image, with verification steps.

The change from postgres:latest to pgvector/pgvector:pg17 aligns well with the PR objective of adding PGVector support. This update is crucial for enabling vector operations in PostgreSQL.

However, please consider the following:

  1. Ensure that existing data and schemas are compatible with this new image.
  2. Verify that all components interacting with the database are compatible with PostgreSQL 17.
  3. Consider documenting this change in the project's documentation or README.

To verify the compatibility and impact of this change, you can run the following script:

This script will help identify any potential compatibility issues or areas that might need adjustment due to the PostgreSQL version change and pgvector introduction.

✅ Verification successful

Verified the pgvector image update. No compatibility issues detected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for PostgreSQL version dependencies and pgvector usage in the codebase

# Check for explicit PostgreSQL version dependencies
echo "Checking for explicit PostgreSQL version dependencies:"
rg -i 'postgres(ql)?\s*(version|v)?\s*([0-9]+)' --type py

# Check for pgvector usage in Python files
echo "Checking for pgvector usage:"
rg 'pgvector' --type py

# Check for SQL commands related to vector operations
echo "Checking for vector-related SQL commands:"
rg -i 'create extension vector|<\[.*\]>|l2_distance|inner_product|cosine_distance' --type sql

Length of output: 2371


Script:

#!/bin/bash
# Description: Re-check for vector-related SQL commands without type filtering

echo "Re-checking for vector-related SQL commands without file type restrictions:"
rg -i 'create extension vector|l2_distance|inner_product|cosine_distance'

Length of output: 391

.github/workflows/test_pgvector.yml (3)

1-11: LGTM: Workflow configuration is well-defined.

The workflow name, trigger conditions, and concurrency settings are appropriately configured for testing pgvector integration on pull requests and manual runs.


45-61: LGTM: Test environment setup is comprehensive.

The setup steps for the test environment are well-defined. The use of Poetry for dependency management and the inclusion of the postgres extra ensure a consistent and appropriate test environment.


1-67: Overall, excellent implementation of pgvector integration testing workflow.

This GitHub Actions workflow file is well-structured and comprehensive. It effectively sets up a pgvector-enabled PostgreSQL environment, manages dependencies using Poetry, and runs integration tests. The workflow aligns perfectly with the PR objective of adding PGVector support for cognee.

Key strengths:

  1. Clear job dependencies and conditions
  2. Proper use of GitHub Secrets for sensitive data
  3. Comprehensive setup of the test environment
  4. Specific pgvector test execution

The minor suggestions provided earlier will further enhance security and clarity. Great job on implementing this crucial testing workflow!

pyproject.toml (4)

73-73: LGTM: Addition of pgvector dependency

The addition of the pgvector dependency aligns with the PR objective of adding PGVector support to the project. The version constraint "^0.3.5" allows for compatible updates, which is a good practice.


74-74: LGTM: psycopg2 dependency made optional

Making the psycopg2 dependency optional is a good practice. It allows users to install the package without PostgreSQL support if they don't need it, reducing unnecessary dependencies for some use cases.


82-82: LGTM: Addition of 'postgres' extra

The addition of the 'postgres' extra is a good practice. It allows users to easily install PostgreSQL support when needed, which is consistent with the PR objective of adding PGVector support. This change, combined with making psycopg2 optional, provides flexibility in installation options.


73-82: Summary: Well-structured changes for PGVector support

The changes in this file are well-structured and consistent with the PR objective of adding PGVector support. The addition of the pgvector dependency, making psycopg2 optional, and introducing the 'postgres' extra work together to provide the necessary functionality while maintaining flexibility in installation options. These modifications follow poetry best practices and should improve the project's usability for different use cases.

cognee/modules/settings/get_settings.py (2)

Line range hint 1-146: Overall, the change is well-implemented and focused.

The addition of PGVector to the vector_dbs list is the only change in this file, and it's implemented correctly. The modification is minimal, focused, and doesn't disrupt the existing functionality of the get_settings function. This change aligns well with the PR objective of adding PGVector support to the project.


44-46: Verify handling of PGVector option in related code.

While the addition of PGVector to the vector_dbs list is correct, it's important to ensure that other parts of the codebase can handle this new option properly.

Let's verify the integration of PGVector in related files:

Please review the output of these checks to ensure proper integration of PGVector across the codebase.

✅ Verification successful

PGVector integration verified successfully.

All related codebase areas handle PGVector appropriately, and no TODOs or FIXMEs were found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for PGVector handling in related files

# Test 1: Check if PGVector is handled in vector database configuration
echo "Checking vector database configuration handling:"
rg -i 'pgvector|PGVector' --type py

# Test 2: Verify if there are any TODOs or FIXMEs related to PGVector
echo "\nChecking for TODOs or FIXMEs related to PGVector:"
rg -i '(TODO|FIXME).*pgvector' --type py

# Test 3: Look for potential switch statements or if-else blocks that might need updating
echo "\nChecking for potential switch statements or if-else blocks:"
rg -i 'if.*vector_db.*=|match.*vector_db' --type py

Length of output: 5346

cognee/api/v1/config/config.py (3)

Line range hint 1-121: Consider overall improvements for consistency and documentation.

After reviewing the entire file, here are some suggestions for overall improvement:

  1. Docstrings: Add docstrings to all methods for better documentation. This will help users understand the purpose and usage of each method.

  2. Consistent error handling: Implement a consistent approach to error handling across all methods. For example, decide whether to use exceptions or return error codes, and apply this consistently.

  3. Type hinting: Add type hints to all method parameters and return values for better code clarity and to enable static type checking.

  4. Further refactoring: Consider creating a base method for updating configuration objects, which could be used by multiple setter methods to reduce code duplication.

  5. Naming convention: Ensure all method names follow a consistent naming convention. For example, some methods start with "set_" while others don't.

Here's an example of how you might implement some of these suggestions:

from typing import Any, Dict

class config:
    @staticmethod
    def _update_config(config_object: Any, config_dict: Dict[str, Any]) -> bool:
        """
        Updates the given config object with values from config_dict.
        
        Args:
            config_object: The configuration object to update.
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        
        Raises:
            AttributeError: If an invalid attribute is provided in config_dict.
        """
        valid_attrs = set(dir(config_object)) - set(dir(object))
        for key, value in config_dict.items():
            if key in valid_attrs:
                setattr(config_object, key, value)
            else:
                raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
        return True

    @staticmethod
    def set_relational_db_config(config_dict: Dict[str, Any]) -> bool:
        """
        Updates the relational database configuration.
        
        Args:
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        """
        return config._update_config(get_relational_config(), config_dict)

    @staticmethod
    def set_vector_db_config(config_dict: Dict[str, Any]) -> bool:
        """
        Updates the vector database configuration.
        
        Args:
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        """
        return config._update_config(get_vectordb_config(), config_dict)

    # ... (apply similar patterns to other methods)

This refactoring introduces consistent docstrings, type hinting, and a shared _update_config method to reduce duplication.

Could you confirm if these suggestions align with the project's coding standards and documentation practices?


98-109: 🛠️ Refactor suggestion

Consider enhancing error handling and adding return value.

The set_relational_db_config method looks good overall, but there are a few suggestions for improvement:

  1. Consider adding type checking for the values being set. This could prevent issues with incorrect data types.
  2. The error message for invalid attributes could be more informative. Perhaps include a list of valid attributes in the error message.
  3. The method doesn't return anything. Consider returning the updated config object or a boolean indicating success. This would make it clearer to users that the operation was successful.

Here's a suggested improvement:

@staticmethod
def set_relational_db_config(config_dict: dict) -> bool:
    """
    Updates the relational db config with values from config_dict.
    Returns True if successful, False otherwise.
    """
    relational_db_config = get_relational_config()
    valid_attrs = set(dir(relational_db_config)) - set(dir(object))
    for key, value in config_dict.items():
        if key in valid_attrs:
            setattr(relational_db_config, key, value)
        else:
            raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
    return True

This version includes a return value, uses setattr() for clarity, and provides a more informative error message.

Could you clarify if there's a specific reason for using object.__setattr__() instead of the standard setattr() function?


110-121: 🛠️ Refactor suggestion

Consider refactoring to reduce code duplication.

The set_vector_db_config method is nearly identical to set_relational_db_config. This duplication suggests an opportunity for refactoring to improve maintainability and reduce the risk of inconsistencies.

Consider creating a generic method for updating configurations:

@staticmethod
def _update_config(config_object, config_dict: dict) -> bool:
    """
    Updates the given config object with values from config_dict.
    Returns True if successful, False otherwise.
    """
    valid_attrs = set(dir(config_object)) - set(dir(object))
    for key, value in config_dict.items():
        if key in valid_attrs:
            setattr(config_object, key, value)
        else:
            raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
    return True

@staticmethod
def set_relational_db_config(config_dict: dict) -> bool:
    return config._update_config(get_relational_config(), config_dict)

@staticmethod
def set_vector_db_config(config_dict: dict) -> bool:
    return config._update_config(get_vectordb_config(), config_dict)

This refactoring addresses the duplication issue and incorporates the improvements suggested in the previous comment.

As with the previous method, could you clarify if there's a specific reason for using object.__setattr__() instead of the standard setattr() function? If not, we should consistently use setattr() for better readability and maintainability.

README.md (2)

193-194: LGTM: PGVector added to supported vector stores.

The addition of PGVector to the list of supported vector stores is consistent with the PR objective and accurately updates the project's capabilities.


193-197: Overall README.md updates look good.

The changes to the README.md file accurately reflect the project's new capabilities, including PGVector support (the main PR objective) and NetworkX as an additional graph store option. These updates enhance the documentation's completeness and provide valuable information to users about the project's supported technologies.

cognee/api/client.py (1)

377-377: LGTM! Verify consistency with other parts of the codebase.

The addition of "pgvector" as a new literal option for the provider field in VectorDBConfigDTO is correct and aligns with the PR objective of adding PGVector support for cognee.

To ensure full compatibility, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:

This will help identify any other places in the codebase where vector database providers are defined or used, ensuring consistency across the project.

✅ Verification successful

Consistency with other parts of the codebase confirmed.
The addition of "pgvector" as a new literal option for the provider field in VectorDBConfigDTO is consistent and fully supported across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other occurrences of vector database provider definitions

# Search for vector database provider definitions
echo "Searching for vector database provider definitions:"
rg -i "provider.*(?:lancedb|qdrant|weaviate)"

# Search for pgvector-related code
echo "\nSearching for pgvector-related code:"
rg -i "pgvector"

Length of output: 6153

cognee/infrastructure/databases/vector/create_vector_engine.py (2)

3-3: Import statement is appropriate

The import of get_relational_config is necessary for accessing relational database configurations required for the pgvector provider.


31-49: Verify that config["vector_db_key"] is necessary for PGVectorAdapter

Ensure that the vector_db_key is required by PGVectorAdapter. If it's not used, consider removing it to simplify the code.

Run the following script to check if vector_db_key is utilized in PGVectorAdapter:

✅ Verification successful

vector_db_key is not utilized in PGVectorAdapter

The vector_db_key parameter is not used within PGVectorAdapter.py. Consider removing it to simplify the code.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for the usage of 'vector_db_key' in PGVectorAdapter.

# Test: Find occurrences of 'vector_db_key' in PGVectorAdapter.py
# Expect: The term should be referenced if it's necessary.

rg 'vector_db_key' cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

Length of output: 85

cognee/api/v1/add/add.py (2)

11-11: [Approved] Importing get_relational_engine

The import statement for get_relational_engine is appropriate and necessary for interacting with the relational database.


17-18: Maintain Clear Aliases for Database Initialization Functions

Aliasing create_db_and_tables as create_relational_db_and_tables and create_pgvector_db_and_tables improves code clarity by distinguishing between the relational and vector database initialization functions. This is a good practice to prevent confusion.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

225-227: Ensure delete_database method is implemented

The prune method calls await self.delete_database(), but it's unclear whether delete_database is implemented in this class or inherited properly.

Confirm that delete_database exists and functions as intended. If not, implement this method or adjust the prune method accordingly.

Comment on lines +3 to +12
def serialize_datetime(data):
"""Recursively convert datetime objects in dictionaries/lists to ISO format."""
if isinstance(data, dict):
return {key: serialize_datetime(value) for key, value in data.items()}
elif isinstance(data, list):
return [serialize_datetime(item) for item in data]
elif isinstance(data, datetime):
return data.isoformat() # Convert datetime to ISO 8601 string
else:
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider handling additional date-related types and custom objects.

While the current implementation is robust for common use cases, consider the following enhancements:

  1. Handle other date-related types like date and time.
  2. Provide a mechanism to handle custom objects with datetime attributes.

Here's a potential implementation that addresses these points:

from datetime import datetime, date, time
from typing import Union, List, Dict, Any

def serialize_datetime(data: Any) -> Any:
    """Recursively convert datetime, date, and time objects in dictionaries/lists to ISO format."""
    if isinstance(data, dict):
        return {key: serialize_datetime(value) for key, value in data.items()}
    elif isinstance(data, list):
        return [serialize_datetime(item) for item in data]
    elif isinstance(data, (datetime, date, time)):
        return data.isoformat()
    elif hasattr(data, '__dict__'):
        return serialize_datetime(data.__dict__)
    else:
        return data

This implementation:

  1. Handles date and time objects in addition to datetime.
  2. Attempts to serialize custom objects by converting them to dictionaries.

Please consider if these additional cases are relevant to your use case.

Comment on lines +42 to +44
connection_string: str = (
f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use SQLAlchemy's URL object for safer connection strings

Leveraging SQLAlchemy's URL object improves readability and ensures proper handling of credentials and connection parameters.

Refactor the code as follows:

+from sqlalchemy.engine import URL
+
         # Get configuration for postgres database
         relational_config = get_relational_config()
         db_username = relational_config.db_username
         db_password = relational_config.db_password
         db_host = relational_config.db_host
         db_port = relational_config.db_port
         db_name = relational_config.db_name

-        connection_string: str = (
-            f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
-        )
+        connection_url = URL.create(
+            drivername="postgresql+asyncpg",
+            username=db_username,
+            password=db_password,
+            host=db_host,
+            port=db_port,
+            database=db_name
+        )

Update the adapter initialization:

-        return PGVectorAdapter(connection_string,
+        return PGVectorAdapter(connection_url,
                               config["vector_db_key"],
                               embedding_engine
        )

Committable suggestion was skipped due to low confidence.


⚠️ Potential issue

Ensure special characters in credentials are URL-encoded

When constructing the connection string, special characters in db_username and db_password could lead to connection issues or security vulnerabilities if not properly URL-encoded.

Apply this diff to URL-encode the credentials:

+from urllib.parse import quote_plus
+
         # Get configuration for postgres database
         relational_config = get_relational_config()
         db_username = relational_config.db_username
         db_password = relational_config.db_password
         db_host = relational_config.db_host
         db_port = relational_config.db_port
         db_name = relational_config.db_name

         connection_string: str = (
-            f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
+            f"postgresql+asyncpg://{quote_plus(db_username)}:{quote_plus(db_password)}@{db_host}:{db_port}/{db_name}"
         )

Committable suggestion was skipped due to low confidence.


search_results = await cognee.search(SearchType.SUMMARIES, query=random_node_name)
assert len(search_results) != 0, "Query related summaries don't exist."
print("\n\Extracted summaries are:\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the typo in the print statement

There's an unintended backslash in the string, which may cause incorrect output.

Apply this diff to fix the typo:

-    print("\n\Extracted summaries are:\n")
+    print("\n\nExtracted summaries are:\n")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("\n\Extracted summaries are:\n")
print("\n\nExtracted summaries are:\n")

Comment on lines +68 to +69
random_node = (await vector_engine.search("entities", "AI"))[0]
random_node_name = random_node.payload["name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle possible empty search results to avoid IndexError

The code does not check if the search result from vector_engine.search("entities", "AI") is empty before accessing the first element. If the search returns an empty list, accessing [0] will raise an IndexError.

Apply this diff to handle the case when no results are returned:

     vector_engine = get_vector_engine()
-    random_node = (await vector_engine.search("entities", "AI"))[0]
+    search_results = await vector_engine.search("entities", "AI")
+    if not search_results:
+        raise ValueError("No entities found for 'AI'")
+    random_node = search_results[0]
     random_node_name = random_node.payload["name"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
random_node = (await vector_engine.search("entities", "AI"))[0]
random_node_name = random_node.payload["name"]
vector_engine = get_vector_engine()
search_results = await vector_engine.search("entities", "AI")
if not search_results:
raise ValueError("No entities found for 'AI'")
random_node = search_results[0]
random_node_name = random_node.payload["name"]


await cognee.cognify([dataset_name])

from cognee.infrastructure.databases.vector import get_vector_engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Move import statement to the top of the file

It's best practice in Python to place all import statements at the top of the file. Moving get_vector_engine import improves code readability and follows PEP 8 guidelines.

Apply this diff to move the import:

 import os
 import logging
 import pathlib
+from cognee.infrastructure.databases.vector import get_vector_engine
 import cognee
 from cognee.api.v1.search import SearchType

 ...

-    from cognee.infrastructure.databases.vector import get_vector_engine
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from cognee.infrastructure.databases.vector import get_vector_engine
import os
import logging
import pathlib
from cognee.infrastructure.databases.vector import get_vector_engine
import cognee
from cognee.api.v1.search import SearchType

Comment on lines 178 to 181
for vector in closest_items:
# TODO: Add normalization of similarity score
vector_list.append(vector)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incomplete implementation: Normalization of similarity scores

The search method includes a TODO comment for normalizing similarity scores, but the normalization logic is not implemented. This may affect the accuracy of the search results.

Consider implementing the normalization of the similarity scores or remove the TODO comment if it's not required before merging.

Comment on lines 194 to 214
async def batch_search(
self,
collection_name: str,
query_texts: List[str],
limit: int = None,
with_vectors: bool = False,
):
query_vectors = await self.embedding_engine.embed_text(query_texts)

return asyncio.gather(
*[
self.search(
collection_name=collection_name,
query_vector=query_vector,
limit=limit,
with_vector=with_vectors,
)
for query_vector in query_vectors
]
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing await when calling asyncio.gather in batch_search

In the batch_search method, asyncio.gather is called without await, which means the coroutines are not executed.

Apply this diff to fix the issue:

 return asyncio.gather(
     *[
         self.search(
             collection_name=collection_name,
             query_vector=query_vector,
             limit=limit,
             with_vector=with_vectors,
         )
         for query_vector in query_vectors
     ]
 )
+await asyncio.gather(...)

Replace the return statement with:

return await asyncio.gather(
    *[
        self.search(
            collection_name=collection_name,
            query_vector=query_vector,
            limit=limit,
            with_vector=with_vectors,
        )
        for query_vector in query_vectors
    ]
)

Comment on lines +73 to +80
self, collection_name: str, data_points: List[DataPoint]
):
async with self.get_async_session() as session:
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name=collection_name,
payload_schema=type(data_points[0].payload),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Passing unnecessary payload_schema to create_collection

In create_data_points, the payload_schema is passed to create_collection, but since it's unused in create_collection, this parameter can be omitted.

Apply this diff to streamline the method call:

 if not await self.has_collection(collection_name):
     await self.create_collection(
         collection_name=collection_name,
-        payload_schema=type(data_points[0].payload),
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self, collection_name: str, data_points: List[DataPoint]
):
async with self.get_async_session() as session:
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name=collection_name,
payload_schema=type(data_points[0].payload),
)
self, collection_name: str, data_points: List[DataPoint]
):
async with self.get_async_session() as session:
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name=collection_name,
)

Comment on lines +21 to +24
api_key: Optional[str],
embedding_engine: EmbeddingEngine,
):
self.api_key = api_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unused parameter api_key in constructor

The api_key parameter is stored in self.api_key but is not used elsewhere in the class. Consider removing it if it's unnecessary or implement its usage as needed.

Apply this diff to remove the unused parameter:

     def __init__(
         self,
         connection_string: str,
-        api_key: Optional[str],
         embedding_engine: EmbeddingEngine,
     ):
-        self.api_key = api_key
         self.embedding_engine = embedding_engine
         self.db_uri: str = connection_string
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
api_key: Optional[str],
embedding_engine: EmbeddingEngine,
):
self.api_key = api_key
def __init__(
self,
connection_string: str,
embedding_engine: EmbeddingEngine,
):
self.embedding_engine = embedding_engine
self.db_uri: str = connection_string

Comment on lines +44 to +47
async def create_collection(self, collection_name: str, payload_schema=None):
data_point_types = get_type_hints(DataPoint)
vector_size = self.embedding_engine.get_vector_size()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unused payload_schema parameter in create_collection

The payload_schema parameter in the create_collection method is not utilized. If it's not required, consider removing it to simplify the method signature.

Apply this diff to remove the unused parameter:

 async def create_collection(self, collection_name: str, payload_schema=None):
     data_point_types = get_type_hints(DataPoint)
     vector_size = self.embedding_engine.get_vector_size()

+async def create_collection(self, collection_name: str):
     data_point_types = get_type_hints(DataPoint)
     vector_size = self.embedding_engine.get_vector_size()

Also, update the method calls accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def create_collection(self, collection_name: str, payload_schema=None):
data_point_types = get_type_hints(DataPoint)
vector_size = self.embedding_engine.get_vector_size()
async def create_collection(self, collection_name: str):
data_point_types = get_type_hints(DataPoint)
vector_size = self.embedding_engine.get_vector_size()

Refactor fix typo in pgvector integration test

Refactor #COG-170
Added missing await statement to batch search for LanceDB and PGVector adapters

Fix #COG-170
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

140-142: Clarify the use of score=0 in retrieve method.

In the retrieve method, ScoredResult instances are created with score=0. Since the retrieval operation does not involve scoring, setting a default score may be misleading. Consider setting score=None or modifying the ScoredResult class to make the score attribute optional when it's not applicable.


190-192: Use logging instead of print statements for error handling.

Currently, errors are printed directly to the console using print(f"Error during search: {e}"). Using the logging module allows for better control over log levels and integrates well with various logging systems. This change enhances maintainability and operational monitoring.

Apply this diff to replace print with logging:

 import asyncio
+import logging
 from pgvector.sqlalchemy import Vector
             except Exception as e:
-                print(f"Error during search: {e}")
+                logging.error(f"Error during search: {e}")
                 return []

Ensure to configure the logging system appropriately in your application.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0e1533a and dc46304.

📒 Files selected for processing (3)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
  • cognee/tests/test_pgvector.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tests/test_pgvector.py
🧰 Additional context used
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

Line range hint 155-161: Correct usage of await with asyncio.gather

The addition of await before asyncio.gather is a crucial improvement. This change ensures that the method waits for all search operations to complete before returning the results, rather than returning a coroutine object. This modification aligns with best practices for asynchronous programming in Python.

Comment on lines 160 to 192
try:
# Get PGVectorDataPoint Table from database
PGVectorDataPoint = await self.get_table(collection_name)

# Find closest vectors to query_vector
closest_items = await session.execute(
select(
PGVectorDataPoint,
PGVectorDataPoint.c.vector.cosine_distance(query_vector).label(
"similarity"
),
)
.order_by("similarity")
.limit(limit)
)

vector_list = []
# Extract distances and find min/max for normalization
for vector in closest_items:
# TODO: Add normalization of similarity score
vector_list.append(vector)

# Create and return ScoredResult objects
return [
ScoredResult(
id=str(row.id), payload=row.payload, score=row.similarity
)
for row in vector_list
]

except Exception as e:
print(f"Error during search: {e}")
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid catching broad exceptions in search method.

Catching all exceptions using except Exception can mask unexpected errors and make debugging more difficult. It's better to catch specific exceptions that you anticipate might occur during database operations or vector computations.

Consider modifying the exception handling:

             try:
                 # Database operations
+            except (SpecificException1, SpecificException2) as e:
+                logging.error(f"Error during search: {e}")
+                return []
             except Exception as e:
-                print(f"Error during search: {e}")
-                return []
+                # Re-raise unexpected exceptions
+                raise e

Replace SpecificException1, SpecificException2, etc., with the actual exceptions you expect (e.g., SQLAlchemyError).

Committable suggestion was skipped due to low confidence.

Removed broad exception handling for the search function in PGVectorAdapter

Refactor #COG-170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants