-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 170 pgvector adapter #158
Conversation
Added config support for using pgvector Feature #COG-170
Added PGVectorAdapter Feature #COG-170
Saving state of pgvector integration development so far Feature #COG-170
Added first working iteration of PGVector for cognee, some important funcionality is still missing, but the core is there. Also some refactoring will be necessary. Feature: #COG-170
Remove VECTOR_DB_NAME env parameter as it's not needed Refactor #COG-170
Removed echoing of database operations Refactor #COG-170
Added batch search to PGVectorAdapter Feature #COG-170
Optimize search query for PGVector Refactor #COG-170
Formatted PGVectorAdapter Refactor #COG-170
Added formatting for PGVector part of create_vector_engine Refactor #COG-170
Change formatting for connection string for PGVectorAdapter Refactor #COG-170
Changed raw SQL quries to use SQLalchemy ORM for PGVectorAdapter Refactor #COG-170
Fixed issue with newly added tables not being pruned from postgres database Fix #COG-170
Added test for PGVectorAdapter Test #COG-170
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | f8babba | cognee/tests/test_pgvector.py | View secret |
9573981 | Triggered | Generic Password | 9f4b8f2 | .github/workflows/test_pgvector.yml | View secret |
9573981 | Triggered | Generic Password | 2cedcbe | .github/workflows/test_pgvector.yml | View secret |
9573981 | Triggered | Generic Password | a358168 | cognee/tests/test_pgvector.py | View secret |
9573981 | Triggered | Generic Password | a358168 | cognee/tests/test_pgvector.py | View secret |
9573981 | Triggered | Generic Password | 6b9a142 | cognee/tests/test_pgvector.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Added github action workflow to run PGVectorAdapter integration test Test #COG-170
Added pgvector dependency for PGVectorAdapter Chore #COG-170
Add psycopg2 a postgresql database adapter dependency Chore #COG-170
Changed database name in test and workflow to be the same as in the .env.template Refactor #COG-170
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
Outdated
Show resolved
Hide resolved
from .get_vector_engine import get_vector_engine, get_vectordb_config | ||
from sqlalchemy import text | ||
|
||
async def create_db_and_tables(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try to move this inside pgvector adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Move serialize_datetime function to be in it's own file Refactor #COG-170
Moved create_db_and_tables module to pgvector so it only handles pgvector related db creation Refactor #COG-170
Added the ability to set database configurations through dictionary for ease of use. Updated test_pgvector to use this way of setting configuration Refactor #COG-170
Moved psycopg2 as an optional dependency for postgres to not block users with lib issues it might cause Refactor #COG-170
Since the postgres related libraries are no longer mandatory specification is needed for them to be installed as well Chore #COG-170
pgvector is imported directly in PGVector adapter therefore it needs to be installed with mandatory dependencies Chore #COG-170
Syntax changed for vector_engine_provider to vector_db_provider Fix #COG-170
} | ||
) | ||
cognee.config.set_relational_db_config({"db_path": "", | ||
"db_name": "cognee_db", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacings are off here
cognee/api/v1/config/config.py
Outdated
@@ -39,6 +39,41 @@ def set_classification_model(classification_model: object): | |||
cognify_config = get_cognify_config() | |||
cognify_config.classification_model = classification_model | |||
|
|||
@staticmethod | |||
def set_db_name(db_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this, we don't want to have to maintain it.
from ..get_vector_engine import get_vector_engine, get_vectordb_config | ||
from sqlalchemy import text | ||
|
||
async def create_db_and_tables(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you merged with the main
, you can see that create_database
exists in SqlAlchemyAdapter
. Is that something we can reuse here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. By that I mean we can use create_database from SqlAlchemyAdapter as the initial step, but there is an extra step to enable pgvector for a postgres database that we have to run in an engine connection context
Remove unused config methods so we don't have to maintain them. Fixed spacing in test_pgvector integration test. Refactor #COG-170
…endencies through extras To be uniform with how we install optional dependencies a change has been made so that postgres optional dependencies are installed through extras Chore #COG-170
Added reusing of sqlalchemy adapters create_database method instead of directly writing code for database creation Refactor #COG-170
…test for pgvector Updated how postgres dependencies are installed due to change in pyproject.toml Chore #COG-170
WalkthroughThe changes in this pull request encompass updates to various files, primarily enhancing the configuration and documentation for the integration of the Changes
Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🧹 Outside diff range and nitpick comments (15)
cognee/infrastructure/databases/vector/pgvector/__init__.py (1)
1-2
: LGTM! Consider using absolute imports for consistency.The imports look good and align with the PR objective of adding PGVector support. They make the
PGVectorAdapter
class andcreate_db_and_tables
function easily accessible when importing from this package.For consistency with Python best practices and to avoid potential issues with relative imports, consider using absolute imports:
-from .PGVectorAdapter import PGVectorAdapter -from .create_db_and_tables import create_db_and_tables +from cognee.infrastructure.databases.vector.pgvector.PGVectorAdapter import PGVectorAdapter +from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tablesThis change would make the imports more explicit and less prone to issues if the package structure changes in the future.
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1)
3-12
: LGTM: Well-implemented serialization function with a minor suggestion.The
serialize_datetime
function is well-implemented and correctly handles nested structures. It effectively converts datetime objects to ISO format strings while preserving other data types.Consider adding type hints to improve code readability and enable better static type checking:
from typing import Union, List, Dict, Any def serialize_datetime(data: Union[Dict, List, datetime, Any]) -> Union[Dict, List, str, Any]: # ... rest of the function remains the sameThis change would make the function's input and output types more explicit, which can be helpful for maintainability and catching potential type-related issues early.
cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (2)
5-13
: LGTM: Well-structured implementation for pgvector setup.The function effectively sets up the pgvector database, addressing the previous review comment by using
vector_engine.create_database()
and then enabling the vector extension. The use of async/await and SQLAlchemy'stext()
function for raw SQL is appropriate.Consider adding error handling to manage potential exceptions during database operations. This will improve the robustness of the function.
Also, adding a docstring would enhance the function's documentation. Here's a suggested docstring:
async def create_db_and_tables(): """ Asynchronously create the database and necessary tables for the pgvector provider. This function retrieves the vector database configuration and engine, creates the database if the provider is pgvector, and enables the vector extension in the database. Raises: Exception: If there's an error during database creation or extension enabling. """ # ... (rest of the function)
1-14
: LGTM: Well-structured and focused implementation.The file adheres to the Single Responsibility Principle, focusing solely on creating the database and tables for pgvector. Its structure is clean and easy to understand.
Consider adding more detailed comments or logging statements to enhance maintainability and debugging capabilities. For example, you could log the successful creation of the database and enabling of the vector extension.
cognee/api/v1/add/add_v2.py (3)
6-7
: LGTM! Consider using absolute imports.The new import statements are clear and align with the PR objective of adding PGVector support. The renaming of
create_db_and_tables
tocreate_relational_db_and_tables
improves clarity.Consider using absolute imports for better maintainability:
-from cognee.infrastructure.databases.relational import create_db_and_tables as create_relational_db_and_tables -from cognee.infrastructure.databases.vector.pgvector import create_db_and_tables as create_pgvector_db_and_tables +from cognee.infrastructure.databases.relational.create_db_and_tables import create_db_and_tables as create_relational_db_and_tables +from cognee.infrastructure.databases.vector.pgvector.create_db_and_tables import create_db_and_tables as create_pgvector_db_and_tables
10-11
: LGTM! Consider error handling and potential performance optimization.The addition of database creation calls aligns with the PR objective and ensures proper setup before data processing.
Consider the following improvements:
- Add error handling:
try: await create_relational_db_and_tables() await create_pgvector_db_and_tables() except Exception as e: logger.error(f"Failed to create databases: {e}") raise
- For potential performance improvement, consider running these operations concurrently:
await asyncio.gather( create_relational_db_and_tables(), create_pgvector_db_and_tables() )This change would require importing
asyncio
at the top of the file.
Line range hint
9-24
: Consider updating the function's docstring.The changes to the
add
function are minimal and focused on adding PGVector support, which is good. However, the function's docstring (if it exists) might need an update to reflect the new database setup process.Consider adding or updating the function's docstring to include information about the database setup:
async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None): """ Add data to the system, setting up both relational and vector databases. This function initializes the necessary database tables for both relational and vector (PGVector) databases before processing the input data. Args: data (Union[BinaryIO, list[BinaryIO], str, list[str]]): The data to be added. dataset_name (str, optional): The name of the dataset. Defaults to "main_dataset". user (User, optional): The user associated with this operation. If None, uses the default user. Returns: Asynchronous generator yielding results from the add pipeline. """ # ... (rest of the function).github/workflows/test_pgvector.yml (3)
13-24
: LGTM: Job structure and conditions are well-defined.The environment variable setup and job definitions are appropriate. The dependency between jobs and the condition for running the integration test are logically structured.
Consider adding a comment explaining the purpose of the
get_docs_changes
job for better clarity:jobs: get_docs_changes: name: docs changes uses: ./.github/workflows/get_docs_changes.yml # This job checks if there are any changes outside of documentation files
29-42
: LGTM: PostgreSQL service is well-configured.The PostgreSQL service setup using the pgvector image is appropriate. The health check and port mapping are correctly configured.
For improved security, consider using GitHub Secrets for the database credentials:
env: POSTGRES_USER: ${{ secrets.POSTGRES_USER }} POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }} POSTGRES_DB: ${{ secrets.POSTGRES_DB }}This change would prevent exposing the credentials in the workflow file.
63-67
: LGTM: Test execution is properly configured.The test execution step is well-defined, using Poetry to run the pgvector test script. The use of GitHub Secrets for the API key is a good security practice.
Consider adding a comment explaining the purpose of the LLM_API_KEY, and possibly make the ENV value configurable:
env: ENV: ${{ inputs.environment || 'dev' }} # Allow overriding the environment LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} # API key for OpenAI integrationThis change would provide more context and flexibility for different test scenarios.
cognee/modules/settings/get_settings.py (1)
44-46
: LGTM! Consider a minor formatting adjustment.The addition of PGVector to the vector_dbs list is correct and aligns with the PR objective. The new entry follows the existing structure, maintaining consistency.
For improved readability and consistency with the existing entries, consider adjusting the indentation slightly:
}, { - "value": "pgvector", - "label": "PGVector", + "value": "pgvector", + "label": "PGVector", }]This minor change would align the new entry's indentation with the existing entries in the list.
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
122-123
: Approve changes with suggestions for improvementThe addition of schema reflection before dropping tables is a good improvement. It ensures that tables are dropped in the correct order, avoiding potential foreign key constraint violations.
Consider the following enhancements:
- Add a comment explaining the purpose of the schema reflection:
# Reflect the current database schema to ensure all tables are accounted for await connection.run_sync(Base.metadata.reflect)
- Implement error handling for individual table drops:
for table in Base.metadata.sorted_tables: try: drop_table_query = text(f"DROP TABLE IF EXISTS {table.name} CASCADE") await connection.execute(drop_table_query) except Exception as e: print(f"Error dropping table {table.name}: {e}") # Decide whether to continue or break the loop based on your requirementsThese changes will improve code readability and robustness.
README.md (1)
197-197
: LGTM: NetworkX added as a supported graph store.The addition of NetworkX as a supported graph store alongside Neo4j is a valuable update to the documentation. It provides a more comprehensive view of the project's capabilities.
Consider updating the PR description to mention this addition of NetworkX support, as it's a significant enhancement that wasn't explicitly stated in the original PR objectives.
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)
190-192
: Use proper logging instead ofUsing
Replace
except Exception as e: - print(f"Error during search: {e}") + logger.exception("Error during search") return []Ensure you have configured a logger, e.g., by importing
logging
and setting up a logger instance.
195-200
: Inconsistent parameter name:with_vectors
vs.with_vector
In
batch_search
, the parameter is namedwith_vectors
, whereas insearch
, it'swith_vector
. This inconsistency can cause confusion and potential bugs.Consider standardizing the parameter name across both methods:
async def batch_search( self, collection_name: str, query_texts: List[str], limit: int = None, - with_vectors: bool = False, + with_vector: bool = False, ):And ensure you update all references to this parameter within the method.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (18)
- .env.template (1 hunks)
- .github/workflows/test_pgvector.yml (1 hunks)
- README.md (1 hunks)
- cognee/api/client.py (1 hunks)
- cognee/api/v1/add/add.py (1 hunks)
- cognee/api/v1/add/add_v2.py (1 hunks)
- cognee/api/v1/config/config.py (1 hunks)
- cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1 hunks)
- cognee/infrastructure/databases/vector/create_vector_engine.py (2 hunks)
- cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
- cognee/infrastructure/databases/vector/pgvector/init.py (1 hunks)
- cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1 hunks)
- cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (1 hunks)
- cognee/modules/settings/get_settings.py (1 hunks)
- cognee/modules/settings/save_vector_db_config.py (1 hunks)
- cognee/tests/test_pgvector.py (1 hunks)
- docker-compose.yml (1 hunks)
- pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .env.template
🧰 Additional context used
🔇 Additional comments (25)
cognee/infrastructure/databases/vector/pgvector/serialize_datetime.py (2)
1-1
: LGTM: Import statement is correct and sufficient.The import of
datetime
from thedatetime
module is appropriate for the function's needs.
1-12
: Overall, well-implemented serialization function with room for optional enhancements.The
serialize_datetime
function is a solid implementation that effectively handles the serialization of datetime objects in nested structures. It's a valuable addition to the project, particularly for working with databases that require serialized datetime values.The suggested improvements (adding type hints, handling additional date-related types, and custom objects) are optional enhancements that could further improve the robustness and flexibility of the function. Consider implementing these based on the specific needs of your project and the types of data you expect to handle.
Great job on this implementation!
cognee/infrastructure/databases/vector/pgvector/create_db_and_tables.py (1)
1-3
: LGTM: Imports are appropriate and consistent.The imports are well-structured and relevant to the file's functionality. The use of relative imports maintains consistency with the project structure.
cognee/modules/settings/save_vector_db_config.py (1)
8-8
: LGTM! Verify consistency across the codebase.The addition of "pgvector" as a new option for the
provider
attribute correctly implements PGVector support at the configuration level. This change aligns well with the PR objective.To ensure consistency, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:
Consider updating any relevant documentation (e.g., README, API docs) to reflect this new provider option.
✅ Verification successful
Verification Successful: "pgvector" is consistently integrated across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for other vector database provider definitions # Test: Search for similar provider definitions rg -i 'provider.*(?:lancedb|qdrant|weaviate)' # Test: Search for pgvector mentions to ensure it's added everywhere necessary rg -i 'pgvector'Length of output: 5953
docker-compose.yml (1)
65-65
: Approve the update to pgvector image, with verification steps.The change from
postgres:latest
topgvector/pgvector:pg17
aligns well with the PR objective of adding PGVector support. This update is crucial for enabling vector operations in PostgreSQL.However, please consider the following:
- Ensure that existing data and schemas are compatible with this new image.
- Verify that all components interacting with the database are compatible with PostgreSQL 17.
- Consider documenting this change in the project's documentation or README.
To verify the compatibility and impact of this change, you can run the following script:
This script will help identify any potential compatibility issues or areas that might need adjustment due to the PostgreSQL version change and pgvector introduction.
✅ Verification successful
Verified the pgvector image update. No compatibility issues detected.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for PostgreSQL version dependencies and pgvector usage in the codebase # Check for explicit PostgreSQL version dependencies echo "Checking for explicit PostgreSQL version dependencies:" rg -i 'postgres(ql)?\s*(version|v)?\s*([0-9]+)' --type py # Check for pgvector usage in Python files echo "Checking for pgvector usage:" rg 'pgvector' --type py # Check for SQL commands related to vector operations echo "Checking for vector-related SQL commands:" rg -i 'create extension vector|<\[.*\]>|l2_distance|inner_product|cosine_distance' --type sqlLength of output: 2371
Script:
#!/bin/bash # Description: Re-check for vector-related SQL commands without type filtering echo "Re-checking for vector-related SQL commands without file type restrictions:" rg -i 'create extension vector|l2_distance|inner_product|cosine_distance'Length of output: 391
.github/workflows/test_pgvector.yml (3)
1-11
: LGTM: Workflow configuration is well-defined.The workflow name, trigger conditions, and concurrency settings are appropriately configured for testing pgvector integration on pull requests and manual runs.
45-61
: LGTM: Test environment setup is comprehensive.The setup steps for the test environment are well-defined. The use of Poetry for dependency management and the inclusion of the postgres extra ensure a consistent and appropriate test environment.
1-67
: Overall, excellent implementation of pgvector integration testing workflow.This GitHub Actions workflow file is well-structured and comprehensive. It effectively sets up a pgvector-enabled PostgreSQL environment, manages dependencies using Poetry, and runs integration tests. The workflow aligns perfectly with the PR objective of adding PGVector support for cognee.
Key strengths:
- Clear job dependencies and conditions
- Proper use of GitHub Secrets for sensitive data
- Comprehensive setup of the test environment
- Specific pgvector test execution
The minor suggestions provided earlier will further enhance security and clarity. Great job on implementing this crucial testing workflow!
pyproject.toml (4)
73-73
: LGTM: Addition of pgvector dependencyThe addition of the pgvector dependency aligns with the PR objective of adding PGVector support to the project. The version constraint "^0.3.5" allows for compatible updates, which is a good practice.
74-74
: LGTM: psycopg2 dependency made optionalMaking the psycopg2 dependency optional is a good practice. It allows users to install the package without PostgreSQL support if they don't need it, reducing unnecessary dependencies for some use cases.
82-82
: LGTM: Addition of 'postgres' extraThe addition of the 'postgres' extra is a good practice. It allows users to easily install PostgreSQL support when needed, which is consistent with the PR objective of adding PGVector support. This change, combined with making psycopg2 optional, provides flexibility in installation options.
73-82
: Summary: Well-structured changes for PGVector supportThe changes in this file are well-structured and consistent with the PR objective of adding PGVector support. The addition of the pgvector dependency, making psycopg2 optional, and introducing the 'postgres' extra work together to provide the necessary functionality while maintaining flexibility in installation options. These modifications follow poetry best practices and should improve the project's usability for different use cases.
cognee/modules/settings/get_settings.py (2)
Line range hint
1-146
: Overall, the change is well-implemented and focused.The addition of PGVector to the vector_dbs list is the only change in this file, and it's implemented correctly. The modification is minimal, focused, and doesn't disrupt the existing functionality of the get_settings function. This change aligns well with the PR objective of adding PGVector support to the project.
44-46
: Verify handling of PGVector option in related code.While the addition of PGVector to the vector_dbs list is correct, it's important to ensure that other parts of the codebase can handle this new option properly.
Let's verify the integration of PGVector in related files:
Please review the output of these checks to ensure proper integration of PGVector across the codebase.
✅ Verification successful
PGVector integration verified successfully.
All related codebase areas handle PGVector appropriately, and no TODOs or FIXMEs were found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for PGVector handling in related files # Test 1: Check if PGVector is handled in vector database configuration echo "Checking vector database configuration handling:" rg -i 'pgvector|PGVector' --type py # Test 2: Verify if there are any TODOs or FIXMEs related to PGVector echo "\nChecking for TODOs or FIXMEs related to PGVector:" rg -i '(TODO|FIXME).*pgvector' --type py # Test 3: Look for potential switch statements or if-else blocks that might need updating echo "\nChecking for potential switch statements or if-else blocks:" rg -i 'if.*vector_db.*=|match.*vector_db' --type pyLength of output: 5346
cognee/api/v1/config/config.py (3)
Line range hint
1-121
: Consider overall improvements for consistency and documentation.After reviewing the entire file, here are some suggestions for overall improvement:
Docstrings: Add docstrings to all methods for better documentation. This will help users understand the purpose and usage of each method.
Consistent error handling: Implement a consistent approach to error handling across all methods. For example, decide whether to use exceptions or return error codes, and apply this consistently.
Type hinting: Add type hints to all method parameters and return values for better code clarity and to enable static type checking.
Further refactoring: Consider creating a base method for updating configuration objects, which could be used by multiple setter methods to reduce code duplication.
Naming convention: Ensure all method names follow a consistent naming convention. For example, some methods start with "set_" while others don't.
Here's an example of how you might implement some of these suggestions:
from typing import Any, Dict class config: @staticmethod def _update_config(config_object: Any, config_dict: Dict[str, Any]) -> bool: """ Updates the given config object with values from config_dict. Args: config_object: The configuration object to update. config_dict: A dictionary of configuration key-value pairs. Returns: bool: True if successful, False otherwise. Raises: AttributeError: If an invalid attribute is provided in config_dict. """ valid_attrs = set(dir(config_object)) - set(dir(object)) for key, value in config_dict.items(): if key in valid_attrs: setattr(config_object, key, value) else: raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}") return True @staticmethod def set_relational_db_config(config_dict: Dict[str, Any]) -> bool: """ Updates the relational database configuration. Args: config_dict: A dictionary of configuration key-value pairs. Returns: bool: True if successful, False otherwise. """ return config._update_config(get_relational_config(), config_dict) @staticmethod def set_vector_db_config(config_dict: Dict[str, Any]) -> bool: """ Updates the vector database configuration. Args: config_dict: A dictionary of configuration key-value pairs. Returns: bool: True if successful, False otherwise. """ return config._update_config(get_vectordb_config(), config_dict) # ... (apply similar patterns to other methods)This refactoring introduces consistent docstrings, type hinting, and a shared
_update_config
method to reduce duplication.Could you confirm if these suggestions align with the project's coding standards and documentation practices?
98-109
: 🛠️ Refactor suggestionConsider enhancing error handling and adding return value.
The
set_relational_db_config
method looks good overall, but there are a few suggestions for improvement:
- Consider adding type checking for the values being set. This could prevent issues with incorrect data types.
- The error message for invalid attributes could be more informative. Perhaps include a list of valid attributes in the error message.
- The method doesn't return anything. Consider returning the updated config object or a boolean indicating success. This would make it clearer to users that the operation was successful.
Here's a suggested improvement:
@staticmethod def set_relational_db_config(config_dict: dict) -> bool: """ Updates the relational db config with values from config_dict. Returns True if successful, False otherwise. """ relational_db_config = get_relational_config() valid_attrs = set(dir(relational_db_config)) - set(dir(object)) for key, value in config_dict.items(): if key in valid_attrs: setattr(relational_db_config, key, value) else: raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}") return TrueThis version includes a return value, uses
setattr()
for clarity, and provides a more informative error message.Could you clarify if there's a specific reason for using
object.__setattr__()
instead of the standardsetattr()
function?
110-121
: 🛠️ Refactor suggestionConsider refactoring to reduce code duplication.
The
set_vector_db_config
method is nearly identical toset_relational_db_config
. This duplication suggests an opportunity for refactoring to improve maintainability and reduce the risk of inconsistencies.Consider creating a generic method for updating configurations:
@staticmethod def _update_config(config_object, config_dict: dict) -> bool: """ Updates the given config object with values from config_dict. Returns True if successful, False otherwise. """ valid_attrs = set(dir(config_object)) - set(dir(object)) for key, value in config_dict.items(): if key in valid_attrs: setattr(config_object, key, value) else: raise AttributeError(f"'{key}' is not a valid attribute. Valid attributes are: {', '.join(valid_attrs)}") return True @staticmethod def set_relational_db_config(config_dict: dict) -> bool: return config._update_config(get_relational_config(), config_dict) @staticmethod def set_vector_db_config(config_dict: dict) -> bool: return config._update_config(get_vectordb_config(), config_dict)This refactoring addresses the duplication issue and incorporates the improvements suggested in the previous comment.
As with the previous method, could you clarify if there's a specific reason for using
object.__setattr__()
instead of the standardsetattr()
function? If not, we should consistently usesetattr()
for better readability and maintainability.README.md (2)
193-194
: LGTM: PGVector added to supported vector stores.The addition of PGVector to the list of supported vector stores is consistent with the PR objective and accurately updates the project's capabilities.
193-197
: Overall README.md updates look good.The changes to the README.md file accurately reflect the project's new capabilities, including PGVector support (the main PR objective) and NetworkX as an additional graph store option. These updates enhance the documentation's completeness and provide valuable information to users about the project's supported technologies.
cognee/api/client.py (1)
377-377
: LGTM! Verify consistency with other parts of the codebase.The addition of "pgvector" as a new literal option for the provider field in VectorDBConfigDTO is correct and aligns with the PR objective of adding PGVector support for cognee.
To ensure full compatibility, please run the following script to check for any other occurrences of vector database provider definitions that might need updating:
This will help identify any other places in the codebase where vector database providers are defined or used, ensuring consistency across the project.
✅ Verification successful
Consistency with other parts of the codebase confirmed.
The addition of"pgvector"
as a new literal option for theprovider
field inVectorDBConfigDTO
is consistent and fully supported across the codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for other occurrences of vector database provider definitions # Search for vector database provider definitions echo "Searching for vector database provider definitions:" rg -i "provider.*(?:lancedb|qdrant|weaviate)" # Search for pgvector-related code echo "\nSearching for pgvector-related code:" rg -i "pgvector"Length of output: 6153
cognee/infrastructure/databases/vector/create_vector_engine.py (2)
3-3
: Import statement is appropriateThe import of
get_relational_config
is necessary for accessing relational database configurations required for the pgvector provider.
31-49
: Verify thatconfig["vector_db_key"]
is necessary for PGVectorAdapterEnsure that the
vector_db_key
is required byPGVectorAdapter
. If it's not used, consider removing it to simplify the code.Run the following script to check if
vector_db_key
is utilized inPGVectorAdapter
:✅ Verification successful
vector_db_key
is not utilized in PGVectorAdapterThe
vector_db_key
parameter is not used withinPGVectorAdapter.py
. Consider removing it to simplify the code.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for the usage of 'vector_db_key' in PGVectorAdapter. # Test: Find occurrences of 'vector_db_key' in PGVectorAdapter.py # Expect: The term should be referenced if it's necessary. rg 'vector_db_key' cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.pyLength of output: 85
cognee/api/v1/add/add.py (2)
11-11
: [Approved] Importingget_relational_engine
The import statement for
get_relational_engine
is appropriate and necessary for interacting with the relational database.
17-18
: Maintain Clear Aliases for Database Initialization FunctionsAliasing
create_db_and_tables
ascreate_relational_db_and_tables
andcreate_pgvector_db_and_tables
improves code clarity by distinguishing between the relational and vector database initialization functions. This is a good practice to prevent confusion.cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
225-227
: Ensuredelete_database
method is implementedThe
prune
method callsawait self.delete_database()
, but it's unclear whetherdelete_database
is implemented in this class or inherited properly.Confirm that
delete_database
exists and functions as intended. If not, implement this method or adjust theprune
method accordingly.
def serialize_datetime(data): | ||
"""Recursively convert datetime objects in dictionaries/lists to ISO format.""" | ||
if isinstance(data, dict): | ||
return {key: serialize_datetime(value) for key, value in data.items()} | ||
elif isinstance(data, list): | ||
return [serialize_datetime(item) for item in data] | ||
elif isinstance(data, datetime): | ||
return data.isoformat() # Convert datetime to ISO 8601 string | ||
else: | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider handling additional date-related types and custom objects.
While the current implementation is robust for common use cases, consider the following enhancements:
- Handle other date-related types like
date
andtime
. - Provide a mechanism to handle custom objects with datetime attributes.
Here's a potential implementation that addresses these points:
from datetime import datetime, date, time
from typing import Union, List, Dict, Any
def serialize_datetime(data: Any) -> Any:
"""Recursively convert datetime, date, and time objects in dictionaries/lists to ISO format."""
if isinstance(data, dict):
return {key: serialize_datetime(value) for key, value in data.items()}
elif isinstance(data, list):
return [serialize_datetime(item) for item in data]
elif isinstance(data, (datetime, date, time)):
return data.isoformat()
elif hasattr(data, '__dict__'):
return serialize_datetime(data.__dict__)
else:
return data
This implementation:
- Handles
date
andtime
objects in addition todatetime
. - Attempts to serialize custom objects by converting them to dictionaries.
Please consider if these additional cases are relevant to your use case.
connection_string: str = ( | ||
f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use SQLAlchemy's URL
object for safer connection strings
Leveraging SQLAlchemy's URL
object improves readability and ensures proper handling of credentials and connection parameters.
Refactor the code as follows:
+from sqlalchemy.engine import URL
+
# Get configuration for postgres database
relational_config = get_relational_config()
db_username = relational_config.db_username
db_password = relational_config.db_password
db_host = relational_config.db_host
db_port = relational_config.db_port
db_name = relational_config.db_name
- connection_string: str = (
- f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
- )
+ connection_url = URL.create(
+ drivername="postgresql+asyncpg",
+ username=db_username,
+ password=db_password,
+ host=db_host,
+ port=db_port,
+ database=db_name
+ )
Update the adapter initialization:
- return PGVectorAdapter(connection_string,
+ return PGVectorAdapter(connection_url,
config["vector_db_key"],
embedding_engine
)
Committable suggestion was skipped due to low confidence.
Ensure special characters in credentials are URL-encoded
When constructing the connection string, special characters in db_username
and db_password
could lead to connection issues or security vulnerabilities if not properly URL-encoded.
Apply this diff to URL-encode the credentials:
+from urllib.parse import quote_plus
+
# Get configuration for postgres database
relational_config = get_relational_config()
db_username = relational_config.db_username
db_password = relational_config.db_password
db_host = relational_config.db_host
db_port = relational_config.db_port
db_name = relational_config.db_name
connection_string: str = (
- f"postgresql+asyncpg://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}"
+ f"postgresql+asyncpg://{quote_plus(db_username)}:{quote_plus(db_password)}@{db_host}:{db_port}/{db_name}"
)
Committable suggestion was skipped due to low confidence.
cognee/tests/test_pgvector.py
Outdated
|
||
search_results = await cognee.search(SearchType.SUMMARIES, query=random_node_name) | ||
assert len(search_results) != 0, "Query related summaries don't exist." | ||
print("\n\Extracted summaries are:\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the typo in the print statement
There's an unintended backslash in the string, which may cause incorrect output.
Apply this diff to fix the typo:
- print("\n\Extracted summaries are:\n")
+ print("\n\nExtracted summaries are:\n")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print("\n\Extracted summaries are:\n") | |
print("\n\nExtracted summaries are:\n") |
random_node = (await vector_engine.search("entities", "AI"))[0] | ||
random_node_name = random_node.payload["name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle possible empty search results to avoid IndexError
The code does not check if the search result from vector_engine.search("entities", "AI")
is empty before accessing the first element. If the search returns an empty list, accessing [0]
will raise an IndexError
.
Apply this diff to handle the case when no results are returned:
vector_engine = get_vector_engine()
- random_node = (await vector_engine.search("entities", "AI"))[0]
+ search_results = await vector_engine.search("entities", "AI")
+ if not search_results:
+ raise ValueError("No entities found for 'AI'")
+ random_node = search_results[0]
random_node_name = random_node.payload["name"]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
random_node = (await vector_engine.search("entities", "AI"))[0] | |
random_node_name = random_node.payload["name"] | |
vector_engine = get_vector_engine() | |
search_results = await vector_engine.search("entities", "AI") | |
if not search_results: | |
raise ValueError("No entities found for 'AI'") | |
random_node = search_results[0] | |
random_node_name = random_node.payload["name"] |
|
||
await cognee.cognify([dataset_name]) | ||
|
||
from cognee.infrastructure.databases.vector import get_vector_engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move import statement to the top of the file
It's best practice in Python to place all import statements at the top of the file. Moving get_vector_engine
import improves code readability and follows PEP 8 guidelines.
Apply this diff to move the import:
import os
import logging
import pathlib
+from cognee.infrastructure.databases.vector import get_vector_engine
import cognee
from cognee.api.v1.search import SearchType
...
- from cognee.infrastructure.databases.vector import get_vector_engine
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from cognee.infrastructure.databases.vector import get_vector_engine | |
import os | |
import logging | |
import pathlib | |
from cognee.infrastructure.databases.vector import get_vector_engine | |
import cognee | |
from cognee.api.v1.search import SearchType | |
for vector in closest_items: | ||
# TODO: Add normalization of similarity score | ||
vector_list.append(vector) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete implementation: Normalization of similarity scores
The search
method includes a TODO comment for normalizing similarity scores, but the normalization logic is not implemented. This may affect the accuracy of the search results.
Consider implementing the normalization of the similarity scores or remove the TODO comment if it's not required before merging.
async def batch_search( | ||
self, | ||
collection_name: str, | ||
query_texts: List[str], | ||
limit: int = None, | ||
with_vectors: bool = False, | ||
): | ||
query_vectors = await self.embedding_engine.embed_text(query_texts) | ||
|
||
return asyncio.gather( | ||
*[ | ||
self.search( | ||
collection_name=collection_name, | ||
query_vector=query_vector, | ||
limit=limit, | ||
with_vector=with_vectors, | ||
) | ||
for query_vector in query_vectors | ||
] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing await
when calling asyncio.gather
in batch_search
In the batch_search
method, asyncio.gather
is called without await
, which means the coroutines are not executed.
Apply this diff to fix the issue:
return asyncio.gather(
*[
self.search(
collection_name=collection_name,
query_vector=query_vector,
limit=limit,
with_vector=with_vectors,
)
for query_vector in query_vectors
]
)
+await asyncio.gather(...)
Replace the return statement with:
return await asyncio.gather(
*[
self.search(
collection_name=collection_name,
query_vector=query_vector,
limit=limit,
with_vector=with_vectors,
)
for query_vector in query_vectors
]
)
self, collection_name: str, data_points: List[DataPoint] | ||
): | ||
async with self.get_async_session() as session: | ||
if not await self.has_collection(collection_name): | ||
await self.create_collection( | ||
collection_name=collection_name, | ||
payload_schema=type(data_points[0].payload), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing unnecessary payload_schema
to create_collection
In create_data_points
, the payload_schema
is passed to create_collection
, but since it's unused in create_collection
, this parameter can be omitted.
Apply this diff to streamline the method call:
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name=collection_name,
- payload_schema=type(data_points[0].payload),
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self, collection_name: str, data_points: List[DataPoint] | |
): | |
async with self.get_async_session() as session: | |
if not await self.has_collection(collection_name): | |
await self.create_collection( | |
collection_name=collection_name, | |
payload_schema=type(data_points[0].payload), | |
) | |
self, collection_name: str, data_points: List[DataPoint] | |
): | |
async with self.get_async_session() as session: | |
if not await self.has_collection(collection_name): | |
await self.create_collection( | |
collection_name=collection_name, | |
) |
api_key: Optional[str], | ||
embedding_engine: EmbeddingEngine, | ||
): | ||
self.api_key = api_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused parameter api_key
in constructor
The api_key
parameter is stored in self.api_key
but is not used elsewhere in the class. Consider removing it if it's unnecessary or implement its usage as needed.
Apply this diff to remove the unused parameter:
def __init__(
self,
connection_string: str,
- api_key: Optional[str],
embedding_engine: EmbeddingEngine,
):
- self.api_key = api_key
self.embedding_engine = embedding_engine
self.db_uri: str = connection_string
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
api_key: Optional[str], | |
embedding_engine: EmbeddingEngine, | |
): | |
self.api_key = api_key | |
def __init__( | |
self, | |
connection_string: str, | |
embedding_engine: EmbeddingEngine, | |
): | |
self.embedding_engine = embedding_engine | |
self.db_uri: str = connection_string |
async def create_collection(self, collection_name: str, payload_schema=None): | ||
data_point_types = get_type_hints(DataPoint) | ||
vector_size = self.embedding_engine.get_vector_size() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused payload_schema
parameter in create_collection
The payload_schema
parameter in the create_collection
method is not utilized. If it's not required, consider removing it to simplify the method signature.
Apply this diff to remove the unused parameter:
async def create_collection(self, collection_name: str, payload_schema=None):
data_point_types = get_type_hints(DataPoint)
vector_size = self.embedding_engine.get_vector_size()
+async def create_collection(self, collection_name: str):
data_point_types = get_type_hints(DataPoint)
vector_size = self.embedding_engine.get_vector_size()
Also, update the method calls accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def create_collection(self, collection_name: str, payload_schema=None): | |
data_point_types = get_type_hints(DataPoint) | |
vector_size = self.embedding_engine.get_vector_size() | |
async def create_collection(self, collection_name: str): | |
data_point_types = get_type_hints(DataPoint) | |
vector_size = self.embedding_engine.get_vector_size() | |
Refactor fix typo in pgvector integration test Refactor #COG-170
Added missing await statement to batch search for LanceDB and PGVector adapters Fix #COG-170
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)
140-142
: Clarify the use ofscore=0
inretrieve
method.In the
retrieve
method,ScoredResult
instances are created withscore=0
. Since the retrieval operation does not involve scoring, setting a default score may be misleading. Consider settingscore=None
or modifying theScoredResult
class to make thescore
attribute optional when it's not applicable.
190-192
: Use logging instead of print statements for error handling.Currently, errors are printed directly to the console using
print(f"Error during search: {e}")
. Using thelogging
module allows for better control over log levels and integrates well with various logging systems. This change enhances maintainability and operational monitoring.Apply this diff to replace
import asyncio +import logging from pgvector.sqlalchemy import Vector
except Exception as e: - print(f"Error during search: {e}") + logging.error(f"Error during search: {e}") return []Ensure to configure the logging system appropriately in your application.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1 hunks)
- cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1 hunks)
- cognee/tests/test_pgvector.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/test_pgvector.py
🧰 Additional context used
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
Line range hint
155-161
: Correct usage ofawait
withasyncio.gather
The addition of
await
beforeasyncio.gather
is a crucial improvement. This change ensures that the method waits for all search operations to complete before returning the results, rather than returning a coroutine object. This modification aligns with best practices for asynchronous programming in Python.
try: | ||
# Get PGVectorDataPoint Table from database | ||
PGVectorDataPoint = await self.get_table(collection_name) | ||
|
||
# Find closest vectors to query_vector | ||
closest_items = await session.execute( | ||
select( | ||
PGVectorDataPoint, | ||
PGVectorDataPoint.c.vector.cosine_distance(query_vector).label( | ||
"similarity" | ||
), | ||
) | ||
.order_by("similarity") | ||
.limit(limit) | ||
) | ||
|
||
vector_list = [] | ||
# Extract distances and find min/max for normalization | ||
for vector in closest_items: | ||
# TODO: Add normalization of similarity score | ||
vector_list.append(vector) | ||
|
||
# Create and return ScoredResult objects | ||
return [ | ||
ScoredResult( | ||
id=str(row.id), payload=row.payload, score=row.similarity | ||
) | ||
for row in vector_list | ||
] | ||
|
||
except Exception as e: | ||
print(f"Error during search: {e}") | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid catching broad exceptions in search
method.
Catching all exceptions using except Exception
can mask unexpected errors and make debugging more difficult. It's better to catch specific exceptions that you anticipate might occur during database operations or vector computations.
Consider modifying the exception handling:
try:
# Database operations
+ except (SpecificException1, SpecificException2) as e:
+ logging.error(f"Error during search: {e}")
+ return []
except Exception as e:
- print(f"Error during search: {e}")
- return []
+ # Re-raise unexpected exceptions
+ raise e
Replace SpecificException1
, SpecificException2
, etc., with the actual exceptions you expect (e.g., SQLAlchemyError
).
Committable suggestion was skipped due to low confidence.
Removed broad exception handling for the search function in PGVectorAdapter Refactor #COG-170
Add PGVector support for cognee
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Chores
docker-compose.yml
to use the newpgvector
PostgreSQL image.pyproject.toml
to include new dependencies for pgvector.