-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Resolve reflection issue when running cognee a second time after… #373
Conversation
… pruning data When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted Fix
WalkthroughThe pull request modifies the Changes
Suggested Reviewers
Possibly related PRs
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Added fix when reflecting metadata to sqlite as well Fix
This reverts commit 394a0b2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
Potential issue with metadata cleanup during pruning operations
The implementation has a few concerns regarding proper cleanup during pruning:
- The
prune()
method in PGVectorAdapter only callsdelete_database()
which isn't implemented in the shown code- There's no explicit cleanup of
Base.metadata
after table deletions- The engine disposal is missing in the pruning operation
This could lead to stale metadata state after pruning operations. Specific changes needed:
- Implement proper
delete_database()
method with table cleanup- Add
Base.metadata.clear()
after table deletions- Include
engine.dispose()
in cleanup🔗 Analysis chain
Line range hint
51-155
: Verify fix handles all pruning scenarios.While the changes address the reflection issue, let's verify that it works correctly in all pruning scenarios.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the interaction between pruning and reflection # Check for any remaining references to Base.metadata in the codebase echo "Checking for remaining Base.metadata references..." rg "Base\.metadata" --type py # Check for proper cleanup in prune method echo "Checking prune method implementation..." rg "async def prune" -A 10 --type py # Look for any TODO/FIXME comments related to pruning echo "Checking for pruning-related TODOs..." rg "TODO.*prun|FIXME.*prun" --type pyLength of output: 7480
Script:
#!/bin/bash # Let's verify the PGVectorAdapter's delete_database implementation and its interaction with Base.metadata # Check delete_database implementation echo "Checking delete_database implementation in PGVectorAdapter..." rg "async def delete_database" -A 15 --type py cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py # Check if there's proper cleanup of Base.metadata after deletion echo "Checking Base.metadata cleanup after deletion..." rg "Base\.metadata\.clear|Base\.metadata\.remove" --type py # Check for any drop table implementations echo "Checking drop table implementations..." rg "drop.*table|DROP.*TABLE" -A 5 --type pyLength of output: 4315
Script:
#!/bin/bash # Let's check the complete implementation of PGVectorAdapter to understand table cleanup echo "Checking PGVectorAdapter implementation..." cat cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py echo "Checking for any database cleanup utilities..." rg "async def.*clean|async def.*delete" -A 10 cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py echo "Checking for any database connection disposal..." rg "dispose|close_all" --type pyLength of output: 12500
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
(3 hunks)
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)
5-5
: LGTM: Import addition is appropriate.
The addition of MetaData
import is necessary for implementing the new reflection mechanism.
# Create a MetaData instance to load table information | ||
metadata = MetaData() | ||
# Load table information from schema into MetaData | ||
await connection.run_sync(metadata.reflect) | ||
|
||
if collection_name in Base.metadata.tables: | ||
if collection_name in metadata.tables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Good fix for reflection issue, but consider performance optimization.
The change correctly addresses the reflection issue by dynamically loading table information instead of relying on cached metadata. However, there are two concerns to address:
- Performance: Reflecting the entire schema on each collection check could be expensive.
- Error handling: Database reflection could fail and should be handled gracefully.
Consider implementing these improvements:
async def has_collection(self, collection_name: str) -> bool:
async with self.engine.begin() as connection:
# Create a MetaData instance to load table information
metadata = MetaData()
- # Load table information from schema into MetaData
- await connection.run_sync(metadata.reflect)
+ try:
+ # Only reflect the specific table we're interested in
+ await connection.run_sync(lambda conn: metadata.reflect(conn, only=[collection_name]))
+ return collection_name in metadata.tables
+ except Exception as e:
+ logger.error(f"Failed to reflect table {collection_name}: {str(e)}")
+ return False
-
- if collection_name in metadata.tables:
- return True
- else:
- return False
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Create a MetaData instance to load table information | |
metadata = MetaData() | |
# Load table information from schema into MetaData | |
await connection.run_sync(metadata.reflect) | |
if collection_name in Base.metadata.tables: | |
if collection_name in metadata.tables: | |
# Create a MetaData instance to load table information | |
metadata = MetaData() | |
try: | |
# Only reflect the specific table we're interested in | |
await connection.run_sync(lambda conn: metadata.reflect(conn, only=[collection_name])) | |
return collection_name in metadata.tables | |
except Exception as e: | |
logger.error(f"Failed to reflect table {collection_name}: {str(e)}") | |
return False |
# Create a MetaData instance to load table information | ||
metadata = MetaData() | ||
# Load table information from schema into MetaData | ||
await connection.run_sync(metadata.reflect) | ||
if collection_name in metadata.tables: | ||
return metadata.tables[collection_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider refactoring to reduce code duplication and improve performance.
The table reflection logic is duplicated between has_collection
and get_table
. Consider extracting this into a shared private method for better maintainability and consistent error handling.
Here's a suggested refactor:
+ async def _reflect_table(self, connection, table_name: str) -> MetaData:
+ """
+ Reflect a specific table from the database.
+
+ Args:
+ connection: The database connection
+ table_name: Name of the table to reflect
+
+ Returns:
+ MetaData object containing the reflected table
+
+ Raises:
+ EntityNotFoundError if table doesn't exist
+ """
+ metadata = MetaData()
+ try:
+ await connection.run_sync(lambda conn: metadata.reflect(conn, only=[table_name]))
+ if table_name not in metadata.tables:
+ raise EntityNotFoundError(f"Table '{table_name}' not found.")
+ return metadata
+ except Exception as e:
+ if isinstance(e, EntityNotFoundError):
+ raise
+ raise EntityNotFoundError(f"Failed to reflect table '{table_name}': {str(e)}")
async def has_collection(self, collection_name: str) -> bool:
async with self.engine.begin() as connection:
- metadata = MetaData()
- await connection.run_sync(metadata.reflect)
- if collection_name in metadata.tables:
+ try:
+ await self._reflect_table(connection, collection_name)
return True
- else:
+ except EntityNotFoundError:
return False
async def get_table(self, collection_name: str) -> Table:
async with self.engine.begin() as connection:
- metadata = MetaData()
- await connection.run_sync(metadata.reflect)
- if collection_name in metadata.tables:
- return metadata.tables[collection_name]
- else:
- raise EntityNotFoundError(message=f"Table '{collection_name}' not found.")
+ metadata = await self._reflect_table(connection, collection_name)
+ return metadata.tables[collection_name]
This refactor:
- Reduces code duplication
- Provides consistent error handling
- Improves performance by only reflecting the needed table
- Makes the code more maintainable
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Create a MetaData instance to load table information | |
metadata = MetaData() | |
# Load table information from schema into MetaData | |
await connection.run_sync(metadata.reflect) | |
if collection_name in metadata.tables: | |
return metadata.tables[collection_name] | |
async def _reflect_table(self, connection, table_name: str) -> MetaData: | |
""" | |
Reflect a specific table from the database. | |
Args: | |
connection: The database connection | |
table_name: Name of the table to reflect | |
Returns: | |
MetaData object containing the reflected table | |
Raises: | |
EntityNotFoundError if table doesn't exist | |
""" | |
metadata = MetaData() | |
try: | |
await connection.run_sync(lambda conn: metadata.reflect(conn, only=[table_name])) | |
if table_name not in metadata.tables: | |
raise EntityNotFoundError(f"Table '{table_name}' not found.") | |
return metadata | |
except Exception as e: | |
if isinstance(e, EntityNotFoundError): | |
raise | |
raise EntityNotFoundError(f"Failed to reflect table '{table_name}': {str(e)}") | |
async def has_collection(self, collection_name: str) -> bool: | |
async with self.engine.begin() as connection: | |
try: | |
await self._reflect_table(connection, collection_name) | |
return True | |
except EntityNotFoundError: | |
return False | |
async def get_table(self, collection_name: str) -> Table: | |
async with self.engine.begin() as connection: | |
metadata = await self._reflect_table(connection, collection_name) | |
return metadata.tables[collection_name] |
… pruning data
When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted
Fix
Summary by CodeRabbit