Cog 170 pgvector adapter #158

dexters1 · 2024-10-18T12:28:59Z

Add PGVector support for cognee

Summary by CodeRabbit

Release Notes

New Features
- Added support for the "pgvector" vector database provider.
- Introduced a new GitHub Actions workflow for automated testing of the pgvector integration.
- Enhanced documentation in the README to include new vector storage options.
- Expanded configuration options for vector databases in the settings.
Bug Fixes
- Improved clarity in configuration settings for various database providers.
Documentation
- Updated README with expanded installation instructions and additional supported technologies.
Chores
- Updated docker-compose.yml to use the new pgvector PostgreSQL image.
- Modified pyproject.toml to include new dependencies for pgvector.

Added config support for using pgvector Feature #COG-170

Added PGVectorAdapter Feature #COG-170

Saving state of pgvector integration development so far Feature #COG-170

Added first working iteration of PGVector for cognee, some important funcionality is still missing, but the core is there. Also some refactoring will be necessary. Feature: #COG-170

Remove VECTOR_DB_NAME env parameter as it's not needed Refactor #COG-170

Removed echoing of database operations Refactor #COG-170

Added batch search to PGVectorAdapter Feature #COG-170

…ector-adapter

Optimize search query for PGVector Refactor #COG-170

Formatted PGVectorAdapter Refactor #COG-170

Added formatting for PGVector part of create_vector_engine Refactor #COG-170

Change formatting for connection string for PGVectorAdapter Refactor #COG-170

Changed raw SQL quries to use SQLalchemy ORM for PGVectorAdapter Refactor #COG-170

Fixed issue with newly added tables not being pruned from postgres database Fix #COG-170

Added test for PGVectorAdapter Test #COG-170

gitguardian · 2024-10-21T12:34:14Z

⚠️ GitGuardian has uncovered 6 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`f8babba`	cognee/tests/test_pgvector.py	View secret
9573981	Triggered	Generic Password	`9f4b8f2`	.github/workflows/test_pgvector.yml	View secret
9573981	Triggered	Generic Password	`2cedcbe`	.github/workflows/test_pgvector.yml	View secret
9573981	Triggered	Generic Password	`a358168`	cognee/tests/test_pgvector.py	View secret
9573981	Triggered	Generic Password	`a358168`	cognee/tests/test_pgvector.py	View secret
9573981	Triggered	Generic Password	`6b9a142`	cognee/tests/test_pgvector.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Added github action workflow to run PGVectorAdapter integration test Test #COG-170

Added pgvector dependency for PGVectorAdapter Chore #COG-170

Add psycopg2 a postgresql database adapter dependency Chore #COG-170

Changed database name in test and workflow to be the same as in the .env.template Refactor #COG-170

cognee/tests/test_pgvector.py

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

borisarzentar · 2024-10-21T14:11:51Z

cognee/infrastructure/databases/vector/create_db_and_tables.py

+from .get_vector_engine import get_vector_engine, get_vectordb_config
+from sqlalchemy import text
+
+async def create_db_and_tables():


We could try to move this inside pgvector adapter.

Move serialize_datetime function to be in it's own file Refactor #COG-170

Moved create_db_and_tables module to pgvector so it only handles pgvector related db creation Refactor #COG-170

Added the ability to set database configurations through dictionary for ease of use. Updated test_pgvector to use this way of setting configuration Refactor #COG-170

Moved psycopg2 as an optional dependency for postgres to not block users with lib issues it might cause Refactor #COG-170

…ector-adapter

Since the postgres related libraries are no longer mandatory specification is needed for them to be installed as well Chore #COG-170

pgvector is imported directly in PGVector adapter therefore it needs to be installed with mandatory dependencies Chore #COG-170

Syntax changed for vector_engine_provider to vector_db_provider Fix #COG-170

borisarzentar · 2024-10-22T10:57:04Z

cognee/tests/test_pgvector.py

+        }
+    )
+    cognee.config.set_relational_db_config({"db_path": "",
+            "db_name": "cognee_db",


Spacings are off here

borisarzentar · 2024-10-22T10:57:43Z

cognee/api/v1/config/config.py

@@ -39,6 +39,41 @@ def set_classification_model(classification_model: object):
        cognify_config = get_cognify_config()
        cognify_config.classification_model =  classification_model

+    @staticmethod
+    def set_db_name(db_name: str):


Let's remove this, we don't want to have to maintain it.

borisarzentar · 2024-10-22T11:00:00Z

cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py

+from ..get_vector_engine import get_vector_engine, get_vectordb_config
+from sqlalchemy import text
+
+async def create_db_and_tables():


If you merged with the main, you can see that create_database exists in SqlAlchemyAdapter. Is that something we can reuse here?

Yes and no. By that I mean we can use create_database from SqlAlchemyAdapter as the initial step, but there is an extra step to enable pgvector for a postgres database that we have to run in an engine connection context

Remove unused config methods so we don't have to maintain them. Fixed spacing in test_pgvector integration test. Refactor #COG-170

…endencies through extras To be uniform with how we install optional dependencies a change has been made so that postgres optional dependencies are installed through extras Chore #COG-170

Added reusing of sqlalchemy adapters create_database method instead of directly writing code for database creation Refactor #COG-170

…test for pgvector Updated how postgres dependencies are installed due to change in pyproject.toml Chore #COG-170

coderabbitai · 2024-10-22T12:34:25Z

Walkthrough

The changes in this pull request encompass updates to various files, primarily enhancing the configuration and documentation for the integration of the pgvector database within the cognee project. Key modifications include clarifications in the .env.template file regarding database providers, the introduction of a new GitHub Actions workflow for testing pgvector, and the addition of new classes and methods for managing vector database configurations. Documentation updates in the README.md provide expanded support information for new technologies, while several files related to the API and database management reflect these enhancements.

Changes

File	Change Summary
`.env.template`	Updated comments for `GRAPH_DATABASE_PROVIDER`, `VECTOR_DB_PROVIDER`, and `DB_PROVIDER` to clarify acceptable values.
`.github/workflows/test_pgvector.yml`	Added new workflow to automate testing for `pgvector` integration with defined jobs and PostgreSQL service setup.
`README.md`	Expanded documentation on vector storage to include "LanceDB" and "PGVector"; minor formatting adjustments made.
`cognee/api/client.py`	Updated `VectorDBConfigDTO` to include `pgvector`, added `LLMConfigDTO` class, modified `SettingsPayloadDTO` to include optional fields for LLM and vector DB.
`cognee/api/v1/add/add.py`	Modified `add` function to call both relational and vector database creation functions asynchronously.
`cognee/api/v1/add/add_v2.py`	Renamed and updated `add` function to use new database creation functions.
`cognee/api/v1/config/config.py`	Added methods for setting relational and vector database configurations using dictionaries.
`cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py`	Updated `delete_database` method to reflect current schema before dropping tables.
`cognee/infrastructure/databases/vector/create_vector_engine.py`	Added handling for `pgvector` in `create_vector_engine` function.
`cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py`	Introduced `PGVectorAdapter` class implementing vector data management with various methods for database operations.
`cognee/infrastructure/databases/vector/pgvector/__init__.py`	Added imports for `PGVectorAdapter` and `create_db_and_tables`.
`cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py`	Introduced a function to set up a PostgreSQL vector database.
`cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py`	Added `serialize_datetime` function for converting `datetime` objects to ISO 8601 strings.
`cognee/modules/settings/get_settings.py`	Added "PGVector" option to `vector_dbs` list in `get_settings` function.
`cognee/modules/settings/save_vector_db_config.py`	Expanded `provider` attribute in `VectorDBConfig` to include "pgvector".
`cognee/tests/test_pgvector.py`	Introduced a new test file with an asynchronous main function to validate `pgvector` functionality.
`docker-compose.yml`	Updated `postgres` service image to `pgvector/pgvector:pg17` and made minor formatting adjustments.
`pyproject.toml`	Added `pgvector` dependency, modified `psycopg2` to be optional, and introduced a new extras group for PostgreSQL.

Possibly related issues

[COG-170] Add PGvector to the infrastructure vector-store adapters: This PR addresses the integration of pgvector, aligning with the objectives of the related issue.

Possibly related PRs

feat: improve API request and response models and docs #154: Modifications to VectorDBConfigDTO class relate to the clarifications made in the .env.template file regarding vector database configurations.

Suggested reviewers

borisarzentar
Vasilije1990

Poem

🐰 In the garden of code, changes bloom bright,
With pgvector now ready, our future's in sight.
Comments refined, workflows in play,
A hop and a skip, we’re paving the way!
So let’s raise our carrots to new paths we tread,
In the world of databases, we leap ahead! 🌱

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 17

🧹 Outside diff range and nitpick comments (15)

cognee/infrastructure/databases/vector/pgvector/__init__.py (1)
1-2: LGTM! Consider using absolute imports for consistency.

The imports look good and align with the PR objective of adding PGVector support. They make the PGVectorAdapter class and create_db_and_tables function easily accessible when importing from this package.

For consistency with Python best practices and to avoid potential issues with relative imports, consider using absolute imports:
-from .PGVectorAdapter import PGVectorAdapter
-from .create_db_and_tables import create_db_and_tables
+from cognee.infrastructure.databases.vector.pgvector.PGVectorAdapter import PGVectorAdapter
+from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tables
This change would make the imports more explicit and less prone to issues if the package structure changes in the future.
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1)
3-12: LGTM: Well-implemented serialization function with a minor suggestion.

The serialize_datetime function is well-implemented and correctly handles nested structures. It effectively converts datetime objects to ISO format strings while preserving other data types.

Consider adding type hints to improve code readability and enable better static type checking:
from typing import Union, List, Dict, Any

def serialize_datetime(data: Union[Dict, List, datetime, Any]) -> Union[Dict, List, str, Any]:
    # ... rest of the function remains the same
This change would make the function's input and output types more explicit, which can be helpful for maintainability and catching potential type-related issues early.
cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (2)
5-13: LGTM: Well-structured implementation for pgvector setup.

The function effectively sets up the pgvector database, addressing the previous review comment by using vector_engine.create_database() and then enabling the vector extension. The use of async/await and SQLAlchemy's text() function for raw SQL is appropriate.

Consider adding error handling to manage potential exceptions during database operations. This will improve the robustness of the function.

Also, adding a docstring would enhance the function's documentation. Here's a suggested docstring:
async def create_db_and_tables():
    """
    Asynchronously create the database and necessary tables for the pgvector provider.
    
    This function retrieves the vector database configuration and engine, creates the database
    if the provider is pgvector, and enables the vector extension in the database.
    
    Raises:
        Exception: If there's an error during database creation or extension enabling.
    """
    # ... (rest of the function)
1-14: LGTM: Well-structured and focused implementation.

The file adheres to the Single Responsibility Principle, focusing solely on creating the database and tables for pgvector. Its structure is clean and easy to understand.

Consider adding more detailed comments or logging statements to enhance maintainability and debugging capabilities. For example, you could log the successful creation of the database and enabling of the vector extension.
cognee/api/v1/add/add_v2.py (3)
6-7: LGTM! Consider using absolute imports.

The new import statements are clear and align with the PR objective of adding PGVector support. The renaming of create_db_and_tables to create_relational_db_and_tables improves clarity.

Consider using absolute imports for better maintainability:
-from cognee.infrastructure.databases.relational import create_db_and_tables as create_relational_db_and_tables
-from cognee.infrastructure.databases.vector.pgvector import create_db_and_tables as create_pgvector_db_and_tables
+from cognee.infrastructure.databases.relational.create_db_and_tables import create_db_and_tables as create_relational_db_and_tables
+from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tables as create_pgvector_db_and_tables
10-11: LGTM! Consider error handling and potential performance optimization.

The addition of database creation calls aligns with the PR objective and ensures proper setup before data processing.

Consider the following improvements:

Add error handling:
try:
    await create_relational_db_and_tables()
    await create_pgvector_db_and_tables()
except Exception as e:
    logger.error(f"Failed to create databases: {e}")
    raise
For potential performance improvement, consider running these operations concurrently:
await asyncio.gather(
    create_relational_db_and_tables(),
    create_pgvector_db_and_tables()
)
This change would require importing asyncio at the top of the file.

Line range hint 9-24: Consider updating the function's docstring.

The changes to the add function are minimal and focused on adding PGVector support, which is good. However, the function's docstring (if it exists) might need an update to reflect the new database setup process.

Consider adding or updating the function's docstring to include information about the database setup:
async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None):
    """
    Add data to the system, setting up both relational and vector databases.

    This function initializes the necessary database tables for both relational
    and vector (PGVector) databases before processing the input data.

    Args:
        data (Union[BinaryIO, list[BinaryIO], str, list[str]]): The data to be added.
        dataset_name (str, optional): The name of the dataset. Defaults to "main_dataset".
        user (User, optional): The user associated with this operation. If None, uses the default user.

    Returns:
        Asynchronous generator yielding results from the add pipeline.
    """
    # ... (rest of the function)
.github/workflows/test_pgvector.yml (3)
13-24: LGTM: Job structure and conditions are well-defined.

The environment variable setup and job definitions are appropriate. The dependency between jobs and the condition for running the integration test are logically structured.

Consider adding a comment explaining the purpose of the get_docs_changes job for better clarity:
jobs:
  get_docs_changes:
    name: docs changes
    uses: ./.github/workflows/get_docs_changes.yml
    # This job checks if there are any changes outside of documentation files
29-42: LGTM: PostgreSQL service is well-configured.

The PostgreSQL service setup using the pgvector image is appropriate. The health check and port mapping are correctly configured.

For improved security, consider using GitHub Secrets for the database credentials:
env:
  POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
  POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
  POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
This change would prevent exposing the credentials in the workflow file.

63-67: LGTM: Test execution is properly configured.

The test execution step is well-defined, using Poetry to run the pgvector test script. The use of GitHub Secrets for the API key is a good security practice.

Consider adding a comment explaining the purpose of the LLM_API_KEY, and possibly make the ENV value configurable:
env:
  ENV: ${{ inputs.environment || 'dev' }}  # Allow overriding the environment
  LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}  # API key for OpenAI integration
This change would provide more context and flexibility for different test scenarios.
cognee/modules/settings/get_settings.py (1)
44-46: LGTM! Consider a minor formatting adjustment.

The addition of PGVector to the vector_dbs list is correct and aligns with the PR objective. The new entry follows the existing structure, maintaining consistency.

For improved readability and consistency with the existing entries, consider adjusting the indentation slightly:
    }, {
-        "value": "pgvector",
-        "label": "PGVector",
+    "value": "pgvector",
+    "label": "PGVector",
    }]
This minor change would align the new entry's indentation with the existing entries in the list.
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
122-123: Approve changes with suggestions for improvement

The addition of schema reflection before dropping tables is a good improvement. It ensures that tables are dropped in the correct order, avoiding potential foreign key constraint violations.

Consider the following enhancements:

Add a comment explaining the purpose of the schema reflection:
# Reflect the current database schema to ensure all tables are accounted for
await connection.run_sync(Base.metadata.reflect)
Implement error handling for individual table drops:
for table in Base.metadata.sorted_tables:
    try:
        drop_table_query = text(f"DROP TABLE IF EXISTS {table.name} CASCADE")
        await connection.execute(drop_table_query)
    except Exception as e:
        print(f"Error dropping table {table.name}: {e}")
        # Decide whether to continue or break the loop based on your requirements
These changes will improve code readability and robustness.
README.md (1)

197-197: LGTM: NetworkX added as a supported graph store.

The addition of NetworkX as a supported graph store alongside Neo4j is a valuable update to the documentation. It provides a more comprehensive view of the project's capabilities.

Consider updating the PR description to mention this addition of NetworkX support, as it's a significant enhancement that wasn't explicitly stated in the original PR objectives.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)
190-192: Use proper logging instead of print statements

Using print statements for error handling is not ideal in production code. It can lead to unstructured logs and missed error tracking.

Replace print with structured logging:
 except Exception as e:
-    print(f"Error during search: {e}")
+    logger.exception("Error during search")
     return []
Ensure you have configured a logger, e.g., by importing logging and setting up a logger instance.

195-200: Inconsistent parameter name: with_vectors vs. with_vector

In batch_search, the parameter is named with_vectors, whereas in search, it's with_vector. This inconsistency can cause confusion and potential bugs.

Consider standardizing the parameter name across both methods:
 async def batch_search(
     self,
     collection_name: str,
     query_texts: List[str],
     limit: int = None,
-    with_vectors: bool = False,
+    with_vector: bool = False,
 ):
And ensure you update all references to this parameter within the method.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2f832b1 and 0e1533a.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (18)

.env.template (1 hunks)
.github/workflows/test_pgvector.yml (1 hunks)
README.md (1 hunks)
cognee/api/client.py (1 hunks)
cognee/api/v1/add/add.py (1 hunks)
cognee/api/v1/add/add_v2.py (1 hunks)
cognee/api/v1/config/config.py (1 hunks)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/create_vector_engine.py (2 hunks)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/pgvector/init.py (1 hunks)
cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1 hunks)
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1 hunks)
cognee/modules/settings/get_settings.py (1 hunks)
cognee/modules/settings/save_vector_db_config.py (1 hunks)
cognee/tests/test_pgvector.py (1 hunks)
docker-compose.yml (1 hunks)
pyproject.toml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.env.template

🧰 Additional context used

🔇 Additional comments (25)

cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (2)

1-1: LGTM: Import statement is correct and sufficient.

The import of datetime from the datetime module is appropriate for the function's needs.

1-12: Overall, well-implemented serialization function with room for optional enhancements.

The serialize_datetime function is a solid implementation that effectively handles the serialization of datetime objects in nested structures. It's a valuable addition to the project, particularly for working with databases that require serialized datetime values.

The suggested improvements (adding type hints, handling additional date-related types, and custom objects) are optional enhancements that could further improve the robustness and flexibility of the function. Consider implementing these based on the specific needs of your project and the types of data you expect to handle.

Great job on this implementation!

cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1)

1-3: LGTM: Imports are appropriate and consistent.

The imports are well-structured and relevant to the file's functionality. The use of relative imports maintains consistency with the project structure.
cognee/modules/settings/save_vector_db_config.py (1)
8-8: LGTM! Verify consistency across the codebase.

The addition of "pgvector" as a new option for the provider attribute correctly implements PGVector support at the configuration level. This change aligns well with the PR objective.

To ensure consistency, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:

Consider updating any relevant documentation (e.g., README, API docs) to reflect this new provider option.

✅ Verification successful

Verification Successful: "pgvector" is consistently integrated across the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for other vector database provider definitions

# Test: Search for similar provider definitions
rg -i 'provider.*(?:lancedb|qdrant|weaviate)'

# Test: Search for pgvector mentions to ensure it's added everywhere necessary
rg -i 'pgvector'
Length of output: 5953
docker-compose.yml (1)
65-65: Approve the update to pgvector image, with verification steps.

The change from postgres:latest to pgvector/pgvector:pg17 aligns well with the PR objective of adding PGVector support. This update is crucial for enabling vector operations in PostgreSQL.

However, please consider the following:

Ensure that existing data and schemas are compatible with this new image.

Verify that all components interacting with the database are compatible with PostgreSQL 17.

Consider documenting this change in the project's documentation or README.

To verify the compatibility and impact of this change, you can run the following script:

This script will help identify any potential compatibility issues or areas that might need adjustment due to the PostgreSQL version change and pgvector introduction.

✅ Verification successful

Verified the pgvector image update. No compatibility issues detected.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for PostgreSQL version dependencies and pgvector usage in the codebase

# Check for explicit PostgreSQL version dependencies
echo "Checking for explicit PostgreSQL version dependencies:"
rg -i 'postgres(ql)?\s*(version|v)?\s*([0-9]+)' --type py

# Check for pgvector usage in Python files
echo "Checking for pgvector usage:"
rg 'pgvector' --type py

# Check for SQL commands related to vector operations
echo "Checking for vector-related SQL commands:"
rg -i 'create extension vector|<\[.*\]>|l2_distance|inner_product|cosine_distance' --type sql
Length of output: 2371

Script:
#!/bin/bash
# Description: Re-check for vector-related SQL commands without type filtering

echo "Re-checking for vector-related SQL commands without file type restrictions:"
rg -i 'create extension vector|l2_distance|inner_product|cosine_distance'
Length of output: 391
.github/workflows/test_pgvector.yml (3)

1-11: LGTM: Workflow configuration is well-defined.

The workflow name, trigger conditions, and concurrency settings are appropriately configured for testing pgvector integration on pull requests and manual runs.

45-61: LGTM: Test environment setup is comprehensive.

The setup steps for the test environment are well-defined. The use of Poetry for dependency management and the inclusion of the postgres extra ensure a consistent and appropriate test environment.

1-67: Overall, excellent implementation of pgvector integration testing workflow.

This GitHub Actions workflow file is well-structured and comprehensive. It effectively sets up a pgvector-enabled PostgreSQL environment, manages dependencies using Poetry, and runs integration tests. The workflow aligns perfectly with the PR objective of adding PGVector support for cognee.

Key strengths:

Clear job dependencies and conditions

Proper use of GitHub Secrets for sensitive data

Comprehensive setup of the test environment

Specific pgvector test execution

The minor suggestions provided earlier will further enhance security and clarity. Great job on implementing this crucial testing workflow!

pyproject.toml (4)

73-73: LGTM: Addition of pgvector dependency

The addition of the pgvector dependency aligns with the PR objective of adding PGVector support to the project. The version constraint "^0.3.5" allows for compatible updates, which is a good practice.

74-74: LGTM: psycopg2 dependency made optional

Making the psycopg2 dependency optional is a good practice. It allows users to install the package without PostgreSQL support if they don't need it, reducing unnecessary dependencies for some use cases.

82-82: LGTM: Addition of 'postgres' extra

The addition of the 'postgres' extra is a good practice. It allows users to easily install PostgreSQL support when needed, which is consistent with the PR objective of adding PGVector support. This change, combined with making psycopg2 optional, provides flexibility in installation options.

73-82: Summary: Well-structured changes for PGVector support

The changes in this file are well-structured and consistent with the PR objective of adding PGVector support. The addition of the pgvector dependency, making psycopg2 optional, and introducing the 'postgres' extra work together to provide the necessary functionality while maintaining flexibility in installation options. These modifications follow poetry best practices and should improve the project's usability for different use cases.
cognee/modules/settings/get_settings.py (2)
Line range hint 1-146: Overall, the change is well-implemented and focused.

The addition of PGVector to the vector_dbs list is the only change in this file, and it's implemented correctly. The modification is minimal, focused, and doesn't disrupt the existing functionality of the get_settings function. This change aligns well with the PR objective of adding PGVector support to the project.

44-46: Verify handling of PGVector option in related code.

While the addition of PGVector to the vector_dbs list is correct, it's important to ensure that other parts of the codebase can handle this new option properly.

Let's verify the integration of PGVector in related files:

Please review the output of these checks to ensure proper integration of PGVector across the codebase.

✅ Verification successful

PGVector integration verified successfully.

All related codebase areas handle PGVector appropriately, and no TODOs or FIXMEs were found.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for PGVector handling in related files

# Test 1: Check if PGVector is handled in vector database configuration
echo "Checking vector database configuration handling:"
rg -i 'pgvector|PGVector' --type py

# Test 2: Verify if there are any TODOs or FIXMEs related to PGVector
echo "\nChecking for TODOs or FIXMEs related to PGVector:"
rg -i '(TODO|FIXME).*pgvector' --type py

# Test 3: Look for potential switch statements or if-else blocks that might need updating
echo "\nChecking for potential switch statements or if-else blocks:"
rg -i 'if.*vector_db.*=|match.*vector_db' --type py
Length of output: 5346
cognee/api/v1/config/config.py (3)
Line range hint 1-121: Consider overall improvements for consistency and documentation.

After reviewing the entire file, here are some suggestions for overall improvement:

Docstrings: Add docstrings to all methods for better documentation. This will help users understand the purpose and usage of each method.

Consistent error handling: Implement a consistent approach to error handling across all methods. For example, decide whether to use exceptions or return error codes, and apply this consistently.

Type hinting: Add type hints to all method parameters and return values for better code clarity and to enable static type checking.

Further refactoring: Consider creating a base method for updating configuration objects, which could be used by multiple setter methods to reduce code duplication.

Naming convention: Ensure all method names follow a consistent naming convention. For example, some methods start with "set_" while others don't.

Here's an example of how you might implement some of these suggestions:
from typing import Any, Dict

class config:
    @staticmethod
    def _update_config(config_object: Any, config_dict: Dict[str, Any]) -> bool:
        """
        Updates the given config object with values from config_dict.
        
        Args:
            config_object: The configuration object to update.
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        
        Raises:
            AttributeError: If an invalid attribute is provided in config_dict.
        """
        valid_attrs = set(dir(config_object)) - set(dir(object))
        for key, value in config_dict.items():
            if key in valid_attrs:
                setattr(config_object, key, value)
            else:
                raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
        return True

    @staticmethod
    def set_relational_db_config(config_dict: Dict[str, Any]) -> bool:
        """
        Updates the relational database configuration.
        
        Args:
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        """
        return config._update_config(get_relational_config(), config_dict)

    @staticmethod
    def set_vector_db_config(config_dict: Dict[str, Any]) -> bool:
        """
        Updates the vector database configuration.
        
        Args:
            config_dict: A dictionary of configuration key-value pairs.
        
        Returns:
            bool: True if successful, False otherwise.
        """
        return config._update_config(get_vectordb_config(), config_dict)

    # ... (apply similar patterns to other methods)
This refactoring introduces consistent docstrings, type hinting, and a shared _update_config method to reduce duplication.

Could you confirm if these suggestions align with the project's coding standards and documentation practices?

98-109: 🛠️ Refactor suggestion

Consider enhancing error handling and adding return value.

The set_relational_db_config method looks good overall, but there are a few suggestions for improvement:

Consider adding type checking for the values being set. This could prevent issues with incorrect data types.

The error message for invalid attributes could be more informative. Perhaps include a list of valid attributes in the error message.

The method doesn't return anything. Consider returning the updated config object or a boolean indicating success. This would make it clearer to users that the operation was successful.

Here's a suggested improvement:
@staticmethod
def set_relational_db_config(config_dict: dict) -> bool:
    """
    Updates the relational db config with values from config_dict.
    Returns True if successful, False otherwise.
    """
    relational_db_config = get_relational_config()
    valid_attrs = set(dir(relational_db_config)) - set(dir(object))
    for key, value in config_dict.items():
        if key in valid_attrs:
            setattr(relational_db_config, key, value)
        else:
            raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
    return True
This version includes a return value, uses setattr() for clarity, and provides a more informative error message.

Could you clarify if there's a specific reason for using object.__setattr__() instead of the standard setattr() function?

110-121: 🛠️ Refactor suggestion

Consider refactoring to reduce code duplication.

The set_vector_db_config method is nearly identical to set_relational_db_config. This duplication suggests an opportunity for refactoring to improve maintainability and reduce the risk of inconsistencies.

Consider creating a generic method for updating configurations:
@staticmethod
def _update_config(config_object, config_dict: dict) -> bool:
    """
    Updates the given config object with values from config_dict.
    Returns True if successful, False otherwise.
    """
    valid_attrs = set(dir(config_object)) - set(dir(object))
    for key, value in config_dict.items():
        if key in valid_attrs:
            setattr(config_object, key, value)
        else:
            raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}")
    return True

@staticmethod
def set_relational_db_config(config_dict: dict) -> bool:
    return config._update_config(get_relational_config(), config_dict)

@staticmethod
def set_vector_db_config(config_dict: dict) -> bool:
    return config._update_config(get_vectordb_config(), config_dict)
This refactoring addresses the duplication issue and incorporates the improvements suggested in the previous comment.

As with the previous method, could you clarify if there's a specific reason for using object.__setattr__() instead of the standard setattr() function? If not, we should consistently use setattr() for better readability and maintainability.
README.md (2)

193-194: LGTM: PGVector added to supported vector stores.

The addition of PGVector to the list of supported vector stores is consistent with the PR objective and accurately updates the project's capabilities.

193-197: Overall README.md updates look good.

The changes to the README.md file accurately reflect the project's new capabilities, including PGVector support (the main PR objective) and NetworkX as an additional graph store option. These updates enhance the documentation's completeness and provide valuable information to users about the project's supported technologies.
cognee/api/client.py (1)
377-377: LGTM! Verify consistency with other parts of the codebase.

The addition of "pgvector" as a new literal option for the provider field in VectorDBConfigDTO is correct and aligns with the PR objective of adding PGVector support for cognee.

To ensure full compatibility, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:

This will help identify any other places in the codebase where vector database providers are defined or used, ensuring consistency across the project.

✅ Verification successful

Consistency with other parts of the codebase confirmed.
The addition of "pgvector" as a new literal option for the provider field in VectorDBConfigDTO is consistent and fully supported across the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for other occurrences of vector database provider definitions

# Search for vector database provider definitions
echo "Searching for vector database provider definitions:"
rg -i "provider.*(?:lancedb|qdrant|weaviate)"

# Search for pgvector-related code
echo "\nSearching for pgvector-related code:"
rg -i "pgvector"
Length of output: 6153
cognee/infrastructure/databases/vector/create_vector_engine.py (2)
3-3: Import statement is appropriate

The import of get_relational_config is necessary for accessing relational database configurations required for the pgvector provider.

31-49: Verify that config["vector_db_key"] is necessary for PGVectorAdapter

Ensure that the vector_db_key is required by PGVectorAdapter. If it's not used, consider removing it to simplify the code.

Run the following script to check if vector_db_key is utilized in PGVectorAdapter:

✅ Verification successful

vector_db_key is not utilized in PGVectorAdapter

The vector_db_key parameter is not used within PGVectorAdapter.py. Consider removing it to simplify the code.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for the usage of 'vector_db_key' in PGVectorAdapter.

# Test: Find occurrences of 'vector_db_key' in PGVectorAdapter.py
# Expect: The term should be referenced if it's necessary.

rg 'vector_db_key' cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Length of output: 85
cognee/api/v1/add/add.py (2)

11-11: [Approved] Importing get_relational_engine

The import statement for get_relational_engine is appropriate and necessary for interacting with the relational database.

17-18: Maintain Clear Aliases for Database Initialization Functions

Aliasing create_db_and_tables as create_relational_db_and_tables and create_pgvector_db_and_tables improves code clarity by distinguishing between the relational and vector database initialization functions. This is a good practice to prevent confusion.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

225-227: Ensure delete_database method is implemented

The prune method calls await self.delete_database(), but it's unclear whether delete_database is implemented in this class or inherited properly.

Confirm that delete_database exists and functions as intended. If not, implement this method or adjust the prune method accordingly.

coderabbitai · 2024-10-22T12:42:18Z

cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py

+def serialize_datetime(data):
+    """Recursively convert datetime objects in dictionaries/lists to ISO format."""
+    if isinstance(data, dict):
+        return {key: serialize_datetime(value) for key, value in data.items()}
+    elif isinstance(data, list):
+        return [serialize_datetime(item) for item in data]
+    elif isinstance(data, datetime):
+        return data.isoformat()  # Convert datetime to ISO 8601 string
+    else:
+        return data


🛠️ Refactor suggestion

Consider handling additional date-related types and custom objects.

While the current implementation is robust for common use cases, consider the following enhancements:

Handle other date-related types like date and time.

Provide a mechanism to handle custom objects with datetime attributes.

Here's a potential implementation that addresses these points:

from datetime import datetime, date, time from typing import Union, List, Dict, Any def serialize_datetime(data: Any) -> Any: """Recursively convert datetime, date, and time objects in dictionaries/lists to ISO format.""" if isinstance(data, dict): return {key: serialize_datetime(value) for key, value in data.items()} elif isinstance(data, list): return [serialize_datetime(item) for item in data] elif isinstance(data, (datetime, date, time)): return data.isoformat() elif hasattr(data, '__dict__'): return serialize_datetime(data.__dict__) else: return data

This implementation:

Handles date and time objects in addition to datetime.

Attempts to serialize custom objects by converting them to dictionaries.

Please consider if these additional cases are relevant to your use case.

coderabbitai · 2024-10-22T12:42:18Z

cognee/infrastructure/databases/vector/create_vector_engine.py

+        connection_string: str = (
+                            f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
+        )


🛠️ Refactor suggestion

Use SQLAlchemy's URL object for safer connection strings

Leveraging SQLAlchemy's URL object improves readability and ensures proper handling of credentials and connection parameters.

Refactor the code as follows:

+from sqlalchemy.engine import URL + # Get configuration for postgres database relational_config = get_relational_config() db_username = relational_config.db_username db_password = relational_config.db_password db_host = relational_config.db_host db_port = relational_config.db_port db_name = relational_config.db_name - connection_string: str = ( - f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}" - ) + connection_url = URL.create( + drivername="postgresql+asyncpg", + username=db_username, + password=db_password, + host=db_host, + port=db_port, + database=db_name + )

Update the adapter initialization:

- return PGVectorAdapter(connection_string, + return PGVectorAdapter(connection_url, config["vector_db_key"], embedding_engine )

Committable suggestion was skipped due to low confidence.

⚠️ Potential issue

Ensure special characters in credentials are URL-encoded

When constructing the connection string, special characters in db_username and db_password could lead to connection issues or security vulnerabilities if not properly URL-encoded.

Apply this diff to URL-encode the credentials:

+from urllib.parse import quote_plus + # Get configuration for postgres database relational_config = get_relational_config() db_username = relational_config.db_username db_password = relational_config.db_password db_host = relational_config.db_host db_port = relational_config.db_port db_name = relational_config.db_name connection_string: str = ( - f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}" + f"postgresql+asyncpg://{quote_plus(db_username)}:{quote_plus(db_password)}@{db_host}:{db_port}/{db_name}" )

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-22T12:42:18Z

cognee/tests/test_pgvector.py

+
+    search_results = await cognee.search(SearchType.SUMMARIES, query=random_node_name)
+    assert len(search_results) != 0, "Query related summaries don't exist."
+    print("\n\Extracted summaries are:\n")


⚠️ Potential issue

Fix the typo in the print statement

There's an unintended backslash in the string, which may cause incorrect output.

Apply this diff to fix the typo:

- print("\n\Extracted summaries are:\n") + print("\n\nExtracted summaries are:\n")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("\n\Extracted summaries are:\n")

print("\n\nExtracted summaries are:\n")

coderabbitai · 2024-10-22T12:42:18Z

cognee/tests/test_pgvector.py

+    random_node = (await vector_engine.search("entities", "AI"))[0]
+    random_node_name = random_node.payload["name"]


⚠️ Potential issue

Handle possible empty search results to avoid IndexError

The code does not check if the search result from vector_engine.search("entities", "AI") is empty before accessing the first element. If the search returns an empty list, accessing [0] will raise an IndexError.

Apply this diff to handle the case when no results are returned:

vector_engine = get_vector_engine() - random_node = (await vector_engine.search("entities", "AI"))[0] + search_results = await vector_engine.search("entities", "AI") + if not search_results: + raise ValueError("No entities found for 'AI'") + random_node = search_results[0] random_node_name = random_node.payload["name"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

random_node = (await vector_engine.search("entities", "AI"))[0]

random_node_name = random_node.payload["name"]

vector_engine = get_vector_engine()

search_results = await vector_engine.search("entities", "AI")

if not search_results:

raise ValueError("No entities found for 'AI'")

random_node = search_results[0]

random_node_name = random_node.payload["name"]

coderabbitai · 2024-10-22T12:42:18Z

cognee/tests/test_pgvector.py

+
+    await cognee.cognify([dataset_name])
+
+    from cognee.infrastructure.databases.vector import get_vector_engine


🛠️ Refactor suggestion

Move import statement to the top of the file

It's best practice in Python to place all import statements at the top of the file. Moving get_vector_engine import improves code readability and follows PEP 8 guidelines.

Apply this diff to move the import:

import os import logging import pathlib +from cognee.infrastructure.databases.vector import get_vector_engine import cognee from cognee.api.v1.search import SearchType ... - from cognee.infrastructure.databases.vector import get_vector_engine

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from cognee.infrastructure.databases.vector import get_vector_engine

import os

import logging

import pathlib

from cognee.infrastructure.databases.vector import get_vector_engine

import cognee

from cognee.api.v1.search import SearchType

coderabbitai · 2024-10-22T12:42:19Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+                for vector in closest_items:
+                    # TODO: Add normalization of similarity score
+                    vector_list.append(vector)
+


⚠️ Potential issue

Incomplete implementation: Normalization of similarity scores

The search method includes a TODO comment for normalizing similarity scores, but the normalization logic is not implemented. This may affect the accuracy of the search results.

Consider implementing the normalization of the similarity scores or remove the TODO comment if it's not required before merging.

coderabbitai · 2024-10-22T12:42:19Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+    async def batch_search(
+        self,
+        collection_name: str,
+        query_texts: List[str],
+        limit: int = None,
+        with_vectors: bool = False,
+    ):
+        query_vectors = await self.embedding_engine.embed_text(query_texts)
+
+        return asyncio.gather(
+            *[
+                self.search(
+                    collection_name=collection_name,
+                    query_vector=query_vector,
+                    limit=limit,
+                    with_vector=with_vectors,
+                )
+                for query_vector in query_vectors
+            ]
+        )
+


⚠️ Potential issue

Missing await when calling asyncio.gather in batch_search

In the batch_search method, asyncio.gather is called without await, which means the coroutines are not executed.

Apply this diff to fix the issue:

return asyncio.gather( *[ self.search( collection_name=collection_name, query_vector=query_vector, limit=limit, with_vector=with_vectors, ) for query_vector in query_vectors ] ) +await asyncio.gather(...)

Replace the return statement with:

return await asyncio.gather( *[ self.search( collection_name=collection_name, query_vector=query_vector, limit=limit, with_vector=with_vectors, ) for query_vector in query_vectors ] )

coderabbitai · 2024-10-22T12:42:20Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+        self, collection_name: str, data_points: List[DataPoint]
+    ):
+        async with self.get_async_session() as session:
+            if not await self.has_collection(collection_name):
+                await self.create_collection(
+                    collection_name=collection_name,
+                    payload_schema=type(data_points[0].payload),
+                )


⚠️ Potential issue

Passing unnecessary payload_schema to create_collection

In create_data_points, the payload_schema is passed to create_collection, but since it's unused in create_collection, this parameter can be omitted.

Apply this diff to streamline the method call:

if not await self.has_collection(collection_name): await self.create_collection( collection_name=collection_name, - payload_schema=type(data_points[0].payload), )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self, collection_name: str, data_points: List[DataPoint]

):

async with self.get_async_session() as session:

if not await self.has_collection(collection_name):

await self.create_collection(

collection_name=collection_name,

payload_schema=type(data_points[0].payload),

)

self, collection_name: str, data_points: List[DataPoint]

):

async with self.get_async_session() as session:

if not await self.has_collection(collection_name):

await self.create_collection(

collection_name=collection_name,

)

coderabbitai · 2024-10-22T12:42:20Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+        api_key: Optional[str],
+        embedding_engine: EmbeddingEngine,
+    ):
+        self.api_key = api_key


⚠️ Potential issue

Unused parameter api_key in constructor

The api_key parameter is stored in self.api_key but is not used elsewhere in the class. Consider removing it if it's unnecessary or implement its usage as needed.

Apply this diff to remove the unused parameter:

def __init__( self, connection_string: str, - api_key: Optional[str], embedding_engine: EmbeddingEngine, ): - self.api_key = api_key self.embedding_engine = embedding_engine self.db_uri: str = connection_string

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

api_key: Optional[str],

embedding_engine: EmbeddingEngine,

):

self.api_key = api_key

def __init__(

self,

connection_string: str,

embedding_engine: EmbeddingEngine,

):

self.embedding_engine = embedding_engine

self.db_uri: str = connection_string

coderabbitai · 2024-10-22T12:42:20Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+    async def create_collection(self, collection_name: str, payload_schema=None):
+        data_point_types = get_type_hints(DataPoint)
+        vector_size = self.embedding_engine.get_vector_size()
+


⚠️ Potential issue

Unused payload_schema parameter in create_collection

The payload_schema parameter in the create_collection method is not utilized. If it's not required, consider removing it to simplify the method signature.

Apply this diff to remove the unused parameter:

async def create_collection(self, collection_name: str, payload_schema=None): data_point_types = get_type_hints(DataPoint) vector_size = self.embedding_engine.get_vector_size() +async def create_collection(self, collection_name: str): data_point_types = get_type_hints(DataPoint) vector_size = self.embedding_engine.get_vector_size()

Also, update the method calls accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def create_collection(self, collection_name: str, payload_schema=None):

data_point_types = get_type_hints(DataPoint)

vector_size = self.embedding_engine.get_vector_size()

async def create_collection(self, collection_name: str):

data_point_types = get_type_hints(DataPoint)

vector_size = self.embedding_engine.get_vector_size()

Refactor fix typo in pgvector integration test Refactor #COG-170

Added missing await statement to batch search for LanceDB and PGVector adapters Fix #COG-170

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)
140-142: Clarify the use of score=0 in retrieve method.

In the retrieve method, ScoredResult instances are created with score=0. Since the retrieval operation does not involve scoring, setting a default score may be misleading. Consider setting score=None or modifying the ScoredResult class to make the score attribute optional when it's not applicable.

190-192: Use logging instead of print statements for error handling.

Currently, errors are printed directly to the console using print(f"Error during search: {e}"). Using the logging module allows for better control over log levels and integrates well with various logging systems. This change enhances maintainability and operational monitoring.

Apply this diff to replace print with logging:
 import asyncio
+import logging
 from pgvector.sqlalchemy import Vector
             except Exception as e:
-                print(f"Error during search: {e}")
+                logging.error(f"Error during search: {e}")
                 return []
Ensure to configure the logging system appropriately in your application.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0e1533a and dc46304.

📒 Files selected for processing (3)

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
cognee/tests/test_pgvector.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tests/test_pgvector.py

🧰 Additional context used

🔇 Additional comments (1)

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

Line range hint 155-161: Correct usage of await with asyncio.gather

The addition of await before asyncio.gather is a crucial improvement. This change ensures that the method waits for all search operations to complete before returning the results, rather than returning a coroutine object. This modification aligns with best practices for asynchronous programming in Python.

coderabbitai · 2024-10-22T13:22:04Z

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

+            try:
+                # Get PGVectorDataPoint Table from database
+                PGVectorDataPoint = await self.get_table(collection_name)
+
+                # Find closest vectors to query_vector
+                closest_items = await session.execute(
+                    select(
+                        PGVectorDataPoint,
+                        PGVectorDataPoint.c.vector.cosine_distance(query_vector).label(
+                            "similarity"
+                        ),
+                    )
+                    .order_by("similarity")
+                    .limit(limit)
+                )
+
+                vector_list = []
+                # Extract distances and find min/max for normalization
+                for vector in closest_items:
+                    # TODO: Add normalization of similarity score
+                    vector_list.append(vector)
+
+                # Create and return ScoredResult objects
+                return [
+                    ScoredResult(
+                        id=str(row.id), payload=row.payload, score=row.similarity
+                    )
+                    for row in vector_list
+                ]
+
+            except Exception as e:
+                print(f"Error during search: {e}")
+                return []


🛠️ Refactor suggestion

Avoid catching broad exceptions in search method.

Catching all exceptions using except Exception can mask unexpected errors and make debugging more difficult. It's better to catch specific exceptions that you anticipate might occur during database operations or vector computations.

Consider modifying the exception handling:

try: # Database operations + except (SpecificException1, SpecificException2) as e: + logging.error(f"Error during search: {e}") + return [] except Exception as e: - print(f"Error during search: {e}") - return [] + # Re-raise unexpected exceptions + raise e

Replace SpecificException1, SpecificException2, etc., with the actual exceptions you expect (e.g., SQLAlchemyError).

Committable suggestion was skipped due to low confidence.

Removed broad exception handling for the search function in PGVectorAdapter Refactor #COG-170

dexters1 and others added 15 commits October 11, 2024 13:23

feat: Add config support for pgvector

d68a3be

Added config support for using pgvector Feature #COG-170

feat: Add PGVectorAdapter

c62dfdd

Added PGVectorAdapter Feature #COG-170

feature: Checkpoint during pgvector integration development

268396a

Saving state of pgvector integration development so far Feature #COG-170

feat: Add PGVector support

9fbf2d8

Added first working iteration of PGVector for cognee, some important funcionality is still missing, but the core is there. Also some refactoring will be necessary. Feature: #COG-170

refactor: Remove unused env parameter

9b9ae6c

Remove VECTOR_DB_NAME env parameter as it's not needed Refactor #COG-170

refactor: Remove echo for database

aa26eab

Removed echoing of database operations Refactor #COG-170

feat: Add batch search to PGVectorAdapter

02cd240

Added batch search to PGVectorAdapter Feature #COG-170

Merge branch 'main' of github.com:topoteretes/cognee into COG-170-PGv…

58e5854

…ector-adapter

refactor: Rewrite search query

325e6cd

Optimize search query for PGVector Refactor #COG-170

refactor: Add formatting to PGVector Adapter

2cd2557

Formatted PGVectorAdapter Refactor #COG-170

refactor: Add formatting to create_vector_engine

7f7b015

Added formatting for PGVector part of create_vector_engine Refactor #COG-170

refactor: Formatting change for create_vector_engine

d2772d2

Change formatting for connection string for PGVectorAdapter Refactor #COG-170

refactor: Change raw SQL queries to SQLalchemy ORM for PGVectorAdapter

240c660

Changed raw SQL quries to use SQLalchemy ORM for PGVectorAdapter Refactor #COG-170

fix: Fix pruning of postgres database

05e4ef3

Fixed issue with newly added tables not being pruned from postgres database Fix #COG-170

test: Add test for PGVectorAdapter

f8babba

Added test for PGVectorAdapter Test #COG-170

dexters1 added 4 commits October 21, 2024 14:50

test: Add github action workflow to run PGVectorAdapter integration test

9f4b8f2

Added github action workflow to run PGVectorAdapter integration test Test #COG-170

chore: Add pgvector dependency

4c381a3

Added pgvector dependency for PGVectorAdapter Chore #COG-170

chore: Add psycopg2 dependency

9461ba0

Add psycopg2 a postgresql database adapter dependency Chore #COG-170

refactor: Change database name in PGVectorAdapter test and workflow

2cedcbe

Changed database name in test and workflow to be the same as in the .env.template Refactor #COG-170

borisarzentar reviewed Oct 21, 2024

View reviewed changes

dexters1 added 8 commits October 22, 2024 10:19

refactor: Move serialize_datetime function

71c1374

Move serialize_datetime function to be in it's own file Refactor #COG-170

refactor: Move create_db_and_tables module from vectors to pgvector

4a73505

Moved create_db_and_tables module to pgvector so it only handles pgvector related db creation Refactor #COG-170

refactor: Add setting of database configs through dictionary

a358168

Added the ability to set database configurations through dictionary for ease of use. Updated test_pgvector to use this way of setting configuration Refactor #COG-170

refactor: Move psycopg2 to an optional dependency

7b2022e

Moved psycopg2 as an optional dependency for postgres to not block users with lib issues it might cause Refactor #COG-170

Merge branch 'main' of github.com:topoteretes/cognee into COG-170-PGv…

88ded6e

…ector-adapter

chore: Add installing of depdendencies along with postgres group

8002db7

Since the postgres related libraries are no longer mandatory specification is needed for them to be installed as well Chore #COG-170

chore: Add pgvector back to mandatory dependencies

dbc86e2

pgvector is imported directly in PGVector adapter therefore it needs to be installed with mandatory dependencies Chore #COG-170

fix: Change to new syntax for vector_engine_provider

c7ed46d

Syntax changed for vector_engine_provider to vector_db_provider Fix #COG-170

borisarzentar reviewed Oct 22, 2024

View reviewed changes

dexters1 added 4 commits October 22, 2024 13:45

refactor: Fix spacing, remove unused config methods

6b9a142

Remove unused config methods so we don't have to maintain them. Fixed spacing in test_pgvector integration test. Refactor #COG-170

chore: Remove postgres group from pyproject.toml install postgres dep…

c78627f

…endencies through extras To be uniform with how we install optional dependencies a change has been made so that postgres optional dependencies are installed through extras Chore #COG-170

refactor: Use SQLAlchemyAdapter create_database

d30c337

Added reusing of sqlalchemy adapters create_database method instead of directly writing code for database creation Refactor #COG-170

chore: Update how postgres dependencies are installed in integration …

0e1533a

…test for pgvector Updated how postgres dependencies are installed due to change in pyproject.toml Chore #COG-170

dexters1 marked this pull request as ready for review October 22, 2024 12:34

coderabbitai bot reviewed Oct 22, 2024

View reviewed changes

borisarzentar approved these changes Oct 22, 2024

View reviewed changes

dexters1 added 2 commits October 22, 2024 15:12

refactor: Fix typo

195929e

Refactor fix typo in pgvector integration test Refactor #COG-170

fix: Add missing await statement to LanceDBAdapter and PGVectorAdapter

dc46304

Added missing await statement to batch search for LanceDB and PGVector adapters Fix #COG-170

coderabbitai bot reviewed Oct 22, 2024

View reviewed changes

refactor: Remove broad exception handling from PGVectorAdapter

0c6f019

Removed broad exception handling for the search function in PGVectorAdapter Refactor #COG-170

dexters1 merged commit c170bbb into main Oct 22, 2024
23 checks passed

This was referenced Oct 29, 2024

Cog 334 structure routing #164

Merged

fix: Fix chunk naive llm classifier #173

Merged

COG-485 - Fix/integration test warnings #176

Merged

This was referenced Nov 8, 2024

Cog 337 llama index support #186

Merged

Feat/cog 553 graph memory projection #196

Merged

Cog 577 add unit test task #197

Merged

feat: log search queries and results #166

Merged

Cog 417 chunking unit tests #205

Merged

This was referenced Dec 2, 2024

chore: Add optional dependencies #241

Merged

Feature: Integrate Milvus as a Vector Database Provider #243

Merged

fix: refactor get_graph_from_model to return nodes and edges correctly #257

Merged

Cog 505 data dataset model changes #260

Merged

This was referenced Dec 12, 2024

Add type to DataPoint metadata #362

Closed

Add type to DataPoint metadata #364

Merged

Cog 656 deployment state #368

Merged

coderabbitai bot mentioned this pull request Dec 23, 2024

Cog 813 source code chunks #383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cog 170 pgvector adapter #158

Cog 170 pgvector adapter #158

dexters1 commented Oct 18, 2024 •

edited by coderabbitai bot

Loading

gitguardian bot commented Oct 21, 2024 •

edited

Loading

borisarzentar Oct 21, 2024

dexters1 Oct 22, 2024

borisarzentar Oct 22, 2024

borisarzentar Oct 22, 2024

borisarzentar Oct 22, 2024

dexters1 Oct 22, 2024

coderabbitai bot commented Oct 22, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot Oct 22, 2024

coderabbitai bot left a comment

coderabbitai bot Oct 22, 2024

	print("\n\Extracted summaries are:\n")
	print("\n\nExtracted summaries are:\n")

		random_node = (await vector_engine.search("entities", "AI"))[0]
		random_node_name = random_node.payload["name"]

-    random_node = (await vector_engine.search("entities", "AI"))[0]
-    random_node_name = random_node.payload["name"]
+    vector_engine = get_vector_engine()
+    search_results = await vector_engine.search("entities", "AI")
+    if not search_results:
+        raise ValueError("No entities found for 'AI'")
+    random_node = search_results[0]
+    random_node_name = random_node.payload["name"]


		await cognee.cognify([dataset_name])

		from cognee.infrastructure.databases.vector import get_vector_engine

-    from cognee.infrastructure.databases.vector import get_vector_engine
+import os
+import logging
+import pathlib
+from cognee.infrastructure.databases.vector import get_vector_engine
+import cognee
+from cognee.api.v1.search import SearchType

Cog 170 pgvector adapter #158

Cog 170 pgvector adapter #158

Conversation

dexters1 commented Oct 18, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

gitguardian bot commented Oct 21, 2024 • edited Loading

⚠️ GitGuardian has uncovered 6 secrets following the scan of your pull request.

borisarzentar Oct 21, 2024

Choose a reason for hiding this comment

dexters1 Oct 22, 2024

Choose a reason for hiding this comment

borisarzentar Oct 22, 2024

Choose a reason for hiding this comment

borisarzentar Oct 22, 2024

Choose a reason for hiding this comment

borisarzentar Oct 22, 2024

Choose a reason for hiding this comment

dexters1 Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot commented Oct 22, 2024 • edited Loading

Walkthrough

Changes

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 22, 2024

Choose a reason for hiding this comment

dexters1 commented Oct 18, 2024 •

edited by coderabbitai bot

Loading

gitguardian bot commented Oct 21, 2024 •

edited

Loading

coderabbitai bot commented Oct 22, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)