-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 475 local file endpoint deletion #391
Conversation
Delete local files when deleting data from cognee Feature COG-475
Save current development status Refactor
Add temp development state to branch Chore
Delete local files made by cognee when deleting data from database through endpoint Feature COG-475
WalkthroughThe pull request introduces modifications to the data deletion process in the Changes
Sequence DiagramsequenceDiagram
participant User
participant DeleteMethod
participant SqlAlchemyAdapter
participant Database
participant FileSystem
User->>DeleteMethod: Request entity deletion
DeleteMethod->>SqlAlchemyAdapter: Call delete_data_entity
SqlAlchemyAdapter->>Database: Delete entity record
SqlAlchemyAdapter->>FileSystem: Remove associated local files
SqlAlchemyAdapter-->>DeleteMethod: Deletion confirmation
DeleteMethod-->>User: Deletion complete
Possibly related PRs
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Revert changes on test_pgvector which were made to test deletion of local files Test COG-475
Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
Line range hint
93-113
: Add error handling for entity not found scenarios.While the rename improves clarity, the method should handle cases where the entity doesn't exist, similar to
delete_data_entity
.Consider applying this improvement:
async def delete_entity_by_id(self, table_name: str, data_id: UUID, schema_name: Optional[str] = "public"): """ Delete entity in given table based on id. Table must have an id Column. """ - if self.engine.dialect.name == "sqlite": - async with self.get_async_session() as session: - TableModel = await self.get_table(table_name, schema_name) + async with self.get_async_session() as session: + try: + TableModel = await self.get_table(table_name, schema_name) + if self.engine.dialect.name == "sqlite": + # Foreign key constraints are disabled by default in SQLite + await session.execute(text("PRAGMA foreign_keys = ON;")) - # Foreign key constraints are disabled by default in SQLite (for backwards compatibility), - # so must be enabled for each database connection/session separately. - await session.execute(text("PRAGMA foreign_keys = ON;")) - await session.execute(TableModel.delete().where(TableModel.c.id == data_id)) - await session.commit() - else: - async with self.get_async_session() as session: - TableModel = await self.get_table(table_name, schema_name) - await session.execute(TableModel.delete().where(TableModel.c.id == data_id)) - await session.commit() + result = await session.execute(TableModel.delete().where(TableModel.c.id == data_id)) + if result.rowcount == 0: + raise EntityNotFoundError(f"Entity with id {data_id} not found in table {table_name}") + await session.commit() + except Exception as e: + await session.rollback() + raise
🧹 Nitpick comments (2)
cognee/tests/test_pgvector.py (1)
41-49
: Remove duplicate code in test_getting_of_documentsThe user retrieval code is duplicated. Consider refactoring to reduce duplication:
async def test_getting_of_documents(dataset_name_1): from cognee.modules.users.permissions.methods import get_document_ids_for_user user = await get_default_user() + + # Test with specific dataset document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1" - # Test getting of documents for search when no dataset is provided - user = await get_default_user() + # Test without dataset filter document_ids = await get_document_ids_for_user(user.id) assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
130-131
: Consider optimizing database queries.The current implementation uses separate queries for fetching the entity and checking references. This could be optimized into a single query.
Consider this optimization:
-raw_data_location_entities = (await session.execute( - select(Data.raw_data_location).where(Data.raw_data_location == data_entity.raw_data_location))).all() - -# Don't delete local file unless this is the only reference to the file in the database -if len(raw_data_location_entities) == 1: +# Use a CTE to get reference count and delete in one query +delete_query = delete(Data).where(Data.id == data_id).returning( + Data.raw_data_location, + select(func.count('*')) + .where(Data.raw_data_location == Data.raw_data_location) + .scalar_subquery().label('ref_count') +) +result = await session.execute(delete_query) +deleted_row = result.fetchone() +if deleted_row and deleted_row.ref_count == 1:Also applies to: 141-142
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
(3 hunks)cognee/modules/data/methods/delete_data.py
(1 hunks)cognee/modules/data/methods/delete_dataset.py
(1 hunks)cognee/tests/test_pgvector.py
(3 hunks)
🔇 Additional comments (4)
cognee/modules/data/methods/delete_dataset.py (1)
7-7
: Verify consistent usage of the renamed method
The method call has been updated from delete_data_by_id
to delete_entity_by_id
. Let's ensure this renaming is consistent across the codebase.
✅ Verification successful
The results show that the old method name delete_data_by_id
is not found anywhere in the codebase, while the new method name delete_entity_by_id
is consistently used in both the implementation (SqlAlchemyAdapter.py
) and the caller (delete_dataset.py
). This confirms that the renaming has been done consistently.
Method renaming has been consistently applied
The verification confirms that:
- The old method name
delete_data_by_id
has been completely removed - The new method name
delete_entity_by_id
is properly implemented inSqlAlchemyAdapter.py
and correctly used indelete_dataset.py
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining occurrences of the old method name
# and verify consistent usage of the new method name
echo "Checking for any remaining occurrences of delete_data_by_id..."
rg "delete_data_by_id"
echo "\nVerifying usage of delete_entity_by_id..."
rg "delete_entity_by_id"
Length of output: 594
cognee/tests/test_pgvector.py (1)
110-110
: LGTM! Well-structured test organization
The test execution order is well thought out, with proper setup and teardown sequence. The new test functions are integrated appropriately within the existing test flow.
Also applies to: 140-141
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (2)
7-13
: LGTM: Import changes align with new functionality.
The new imports support the local file deletion feature with proper SQL operations and error handling.
124-127
: LGTM: Proper error handling for entity not found scenarios.
The error handling using try-except with EntityNotFoundError is well implemented.
@@ -17,4 +17,4 @@ async def delete_data(data: Data): | |||
|
|||
db_engine = get_relational_engine() | |||
|
|||
return await db_engine.delete_data_by_id(data.__tablename__, data.id) | |||
return await db_engine.delete_data_entity(data.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the gathered information, I can now provide the final response:
Error handling improvements needed for file operations
The delete_data_entity
implementation has proper error handling for:
- Non-existent data IDs (raises EntityNotFoundError)
- Concurrent deletion (handled via foreign key constraints)
However, the file deletion operation using os.remove()
lacks try-catch error handling, which could fail due to:
- Permission issues
- File already being deleted
- File system errors
Action items:
- Wrap the
os.remove()
operation in a try-catch block - Handle potential OSError/IOError exceptions
- Consider logging file deletion failures
🔗 Analysis chain
Verify error handling in delete_data_entity
The function now uses delete_data_entity
which should handle local file deletion. Ensure that error handling is comprehensive, particularly for:
- Non-existent data IDs
- Failed file deletions
- Concurrent deletion attempts
Let me gather more information about the error handling implementation since the previous results were truncated.
Let me check one more aspect to see how the file deletion errors are handled.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Look for error handling in the delete_data_entity implementation
echo "Checking error handling in delete_data_entity implementation..."
rg -A 10 "async def delete_data_entity"
echo "\nChecking for try-catch blocks around file operations..."
ast-grep --pattern 'try {
$$$
delete_data_entity
$$$
}'
Length of output: 1758
Script:
#!/bin/bash
# Get full implementation of delete_data_entity including try-catch blocks
rg -A 20 "async def delete_data_entity"
# Look for any file deletion operations in the same file
echo "\nChecking file deletion operations..."
rg "unlink|remove|delete.*file" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
Length of output: 3324
Script:
#!/bin/bash
# Look for try-catch blocks around the file deletion operation
rg -B 5 -A 5 "os.remove\(.*raw_data_location" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
# Check if there are any custom exceptions defined for file operations
echo "\nChecking for file operation related exceptions..."
rg "class.*Error" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
Length of output: 940
encoded_text = data_text.encode("utf-8") | ||
data_hash = hashlib.md5(encoded_text).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using a more secure hashing algorithm
MD5 is cryptographically weak and susceptible to hash collisions. Consider using a more secure algorithm like SHA-256.
- encoded_text = data_text.encode("utf-8")
- data_hash = hashlib.md5(encoded_text).hexdigest()
+ encoded_text = data_text.encode("utf-8")
+ data_hash = hashlib.sha256(encoded_text).hexdigest()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
encoded_text = data_text.encode("utf-8") | |
data_hash = hashlib.md5(encoded_text).hexdigest() | |
encoded_text = data_text.encode("utf-8") | |
data_hash = hashlib.sha256(encoded_text).hexdigest() |
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
Outdated
Show resolved
Hide resolved
async def delete_data_entity(self, data_id: UUID): | ||
""" | ||
Delete data and local files related to data if there are no references to it anymore. | ||
""" | ||
async with self.get_async_session() as session: | ||
if self.engine.dialect.name == "sqlite": | ||
# Foreign key constraints are disabled by default in SQLite (for backwards compatibility), | ||
# so must be enabled for each database connection/session separately. | ||
await session.execute(text("PRAGMA foreign_keys = ON;")) | ||
|
||
try: | ||
data_entity = (await session.scalars(select(Data).where(Data.id == data_id))).one() | ||
except (ValueError, NoResultFound) as e: | ||
raise EntityNotFoundError(message=f"Entity not found: {str(e)}") | ||
|
||
# Check if other data objects point to the same raw data location | ||
raw_data_location_entities = (await session.execute( | ||
select(Data.raw_data_location).where(Data.raw_data_location == data_entity.raw_data_location))).all() | ||
|
||
# Don't delete local file unless this is the only reference to the file in the database | ||
if len(raw_data_location_entities) == 1: | ||
# delete local file only if it's created by cognee | ||
from cognee.base_config import get_base_config | ||
config = get_base_config() | ||
if config.data_root_directory in raw_data_location_entities[0].raw_data_location: | ||
os.remove(raw_data_location_entities[0].raw_data_location) | ||
|
||
await session.execute(delete(Data).where(Data.id == data_id)) | ||
await session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Add error handling and validation for file deletion operations
The verification revealed several issues with file deletion implementations:
- In
SqlAlchemyAdapter.py
, file deletion lacks both existence validation and error handling, which could lead to unhandled exceptions. - While
LocalStorage.py
andnetworkx/adapter.py
properly check file existence before deletion, only the networkx adapter implements proper error handling.
Key locations requiring attention:
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
: Add file existence check and try-catch block aroundos.remove
cognee/infrastructure/files/storage/LocalStorage.py
: Add error handling around file deletion operation
🔗 Analysis chain
Verify file deletion implementation across codebase.
Let's verify that all file deletion scenarios are properly handled and there are no orphaned files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential file handling issues in the codebase
# Look for other file deletion operations that might need similar handling
echo "Checking for other file deletion operations:"
rg -l "os\.remove|unlink" --type py
# Look for file path handling that might need validation
echo -e "\nChecking for file path handling:"
rg -l "raw_data_location|file_path|path\.join" --type py
# Look for TODO/FIXME comments related to file handling
echo -e "\nChecking for TODO/FIXME comments related to file handling:"
rg -i "todo|fixme|hack|xxx" -C 2 --type py | rg -i "file|path|delete"
Length of output: 3747
Script:
#!/bin/bash
# Let's examine the file deletion implementations in the identified files
echo "Checking SqlAlchemyAdapter.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
echo -e "\nChecking LocalStorage.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/files/storage/LocalStorage.py
echo -e "\nChecking networkx adapter implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/graph/networkx/adapter.py
echo -e "\nChecking for file existence validation before deletion:"
rg "os\.path\.exists|os\.path\.isfile" -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py
echo -e "\nChecking for exception handling around file operations:"
rg "try.*:.*os\.remove|except.*os\." -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py
Length of output: 5352
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
Show resolved
Hide resolved
from cognee.base_config import get_base_config | ||
config = get_base_config() | ||
if config.data_root_directory in raw_data_location_entities[0].raw_data_location: | ||
os.remove(raw_data_location_entities[0].raw_data_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if the file exists? If it doesn't, then we have a bug somewhere and it should be reported to sentry.
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | 7232b04 | notebooks/hr_demo.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Update assertion message, add veryfing of file existence
Log to sentry that a file which should exist can't be found Chore COG-475
…oteretes/cognee into COG-475-local-file-endpoint-deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/tests/test_pgvector.py (1)
25-26
:⚠️ Potential issueAdd error handling for database queries
The use of
.one()
will raise exceptions if no results are found. Consider adding proper error handling.- data = (await session.scalars(select(Data).where(Data.content_hash == data_hash))).one() + try: + data = (await session.scalars(select(Data).where(Data.content_hash == data_hash))).one() + except NoResultFound: + raise AssertionError(f"No data found with hash {data_hash}")Also applies to: 33-34
🧹 Nitpick comments (3)
cognee/tests/test_pgvector.py (2)
13-39
: Add docstring to explain test scenariosThe test function should have a docstring explaining the two test scenarios: deletion of cognee-created files and preservation of external files.
async def test_local_file_deletion(data_text, file_location): + """ + Test local file deletion behavior: + 1. Verify that files created by cognee are deleted when their data entity is deleted + 2. Verify that files not created by cognee are preserved when their data entity is deleted + """
29-30
: Improve assertion messagesThe assertion messages should clearly indicate the expected vs actual state.
- assert not os.path.exists(data.raw_data_location), f"Data location still exists after deletion: {data.raw_data_location}" + assert not os.path.exists(data.raw_data_location), f"Expected file to be deleted but it still exists at: {data.raw_data_location}" - assert os.path.exists(data.raw_data_location), f"Data location doesn't exists: {data.raw_data_location}" + assert os.path.exists(data.raw_data_location), f"Expected file to be preserved but it was deleted at: {data.raw_data_location}"Also applies to: 37-38
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
133-138
: Consider race conditions in reference countingThe current implementation checks for file references but doesn't handle potential race conditions where another process might add/remove references between the check and deletion.
Consider implementing a proper reference counting mechanism or using database transactions to ensure atomicity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
(3 hunks)cognee/tests/test_pgvector.py
(3 hunks)
🔇 Additional comments (3)
cognee/tests/test_pgvector.py (1)
111-111
: Consider test order dependency
The test execution order matters here as test_getting_of_documents
relies on data being present, while test_local_file_deletion
deletes data. Consider adding cleanup and setup methods to ensure test isolation.
Also applies to: 141-141
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (2)
Line range hint 97-117
: LGTM! Method rename improves clarity
The rename from delete_data_by_id
to delete_entity_by_id
better reflects the generic nature of this method, as it can delete any entity type, not just data.
144-146
:
Add path validation and error handling for file operations
The current implementation is vulnerable to path traversal attacks and lacks proper error handling for file operations.
- if config.data_root_directory in raw_data_location_entities[0].raw_data_location:
- if os.path.exists(raw_data_location_entities[0].raw_data_location):
- os.remove(raw_data_location_entities[0].raw_data_location)
+ file_path = raw_data_location_entities[0].raw_data_location
+ try:
+ # Validate file path is within allowed directory
+ real_path = os.path.realpath(file_path)
+ if not real_path.startswith(os.path.realpath(config.data_root_directory)):
+ logger.error(f"Attempted file deletion outside allowed directory: {file_path}")
+ return
+
+ if os.path.exists(file_path):
+ os.remove(file_path)
+ except OSError as e:
+ logger.error(f"Error deleting file {file_path}: {e}")
Likely invalid or redundant comment.
async def test_getting_of_documents(dataset_name_1): | ||
# Test getting of documents for search per dataset | ||
from cognee.modules.users.permissions.methods import get_document_ids_for_user | ||
user = await get_default_user() | ||
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) | ||
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1" | ||
|
||
# Test getting of documents for search when no dataset is provided | ||
user = await get_default_user() | ||
document_ids = await get_document_ids_for_user(user.id) | ||
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve test function structure and assertions
Several improvements needed:
- Add docstring
- Improve assertion messages
- Remove duplicate user retrieval
async def test_getting_of_documents(dataset_name_1):
+ """
+ Test document retrieval functionality:
+ 1. Verify correct document count when filtering by dataset
+ 2. Verify correct document count when no dataset filter is applied
+ """
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
- assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
+ assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}"
# Test getting of documents for search when no dataset is provided
- user = await get_default_user() # Duplicate user retrieval
document_ids = await get_document_ids_for_user(user.id)
- assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
+ assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def test_getting_of_documents(dataset_name_1): | |
# Test getting of documents for search per dataset | |
from cognee.modules.users.permissions.methods import get_document_ids_for_user | |
user = await get_default_user() | |
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) | |
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1" | |
# Test getting of documents for search when no dataset is provided | |
user = await get_default_user() | |
document_ids = await get_document_ids_for_user(user.id) | |
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2" | |
async def test_getting_of_documents(dataset_name_1): | |
""" | |
Test document retrieval functionality: | |
1. Verify correct document count when filtering by dataset | |
2. Verify correct document count when no dataset filter is applied | |
""" | |
# Test getting of documents for search per dataset | |
from cognee.modules.users.permissions.methods import get_document_ids_for_user | |
user = await get_default_user() | |
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1]) | |
assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}" | |
# Test getting of documents for search when no dataset is provided | |
document_ids = await get_document_ids_for_user(user.id) | |
assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}" |
Add deletion of local files made by cognee on deletion of data through endpoints
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation