Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 475 local file endpoint deletion #391

Merged
merged 13 commits into from
Dec 26, 2024

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Dec 19, 2024

Add deletion of local files made by cognee on deletion of data through endpoints

Summary by CodeRabbit

  • New Features

    • Introduced a new method for deleting data entities along with associated local files.
    • Enhanced error handling for entity deletion operations.
  • Bug Fixes

    • Updated logic for deleting data records to improve functionality and efficiency.
  • Tests

    • Added new tests for verifying local file deletion and document retrieval based on user datasets.
  • Documentation

    • Updated method signatures and documentation to reflect changes in functionality.

Delete local files when deleting data from cognee

Feature COG-475
Save current development status

Refactor
Add temp development state to branch

Chore
Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475
@dexters1 dexters1 self-assigned this Dec 19, 2024
Copy link
Contributor

coderabbitai bot commented Dec 19, 2024

Walkthrough

The pull request introduces modifications to the data deletion process in the cognee system. The changes primarily focus on the SqlAlchemyAdapter class, updating methods for entity deletion and improving error handling. A new method delete_data_entity is added to handle complete data entity deletion, including associated local files. The existing delete_data_by_id method is renamed to delete_entity_by_id, and related methods in data deletion modules are updated accordingly. Additionally, new test functions are introduced to verify local file deletion and document retrieval functionality.

Changes

File Change Summary
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py - Added delete_data_entity method for comprehensive entity deletion
- Renamed delete_data_by_id to delete_entity_by_id
- Updated import statements to include delete and NoResultFound
cognee/modules/data/methods/delete_data.py - Updated method call from delete_data_by_id to delete_data_entity
cognee/modules/data/methods/delete_dataset.py - Changed method call to delete_entity_by_id
cognee/tests/test_pgvector.py - Added test_local_file_deletion test function
- Added test_getting_of_documents test function

Sequence Diagram

sequenceDiagram
    participant User
    participant DeleteMethod
    participant SqlAlchemyAdapter
    participant Database
    participant FileSystem

    User->>DeleteMethod: Request entity deletion
    DeleteMethod->>SqlAlchemyAdapter: Call delete_data_entity
    SqlAlchemyAdapter->>Database: Delete entity record
    SqlAlchemyAdapter->>FileSystem: Remove associated local files
    SqlAlchemyAdapter-->>DeleteMethod: Deletion confirmation
    DeleteMethod-->>User: Deletion complete
Loading

Possibly related PRs

Poem

🐰 A Rabbit's Deletion Dance 🗑️
With SQLAlchemy's mighty hand,
We sweep through data's shifting sand,
Deleting entities with grace,
Leaving no trace, no file misplaced,
A clean slate for our code's embrace!

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475
Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475
@dexters1 dexters1 marked this pull request as ready for review December 19, 2024 16:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

Line range hint 93-113: Add error handling for entity not found scenarios.

While the rename improves clarity, the method should handle cases where the entity doesn't exist, similar to delete_data_entity.

Consider applying this improvement:

 async def delete_entity_by_id(self, table_name: str, data_id: UUID, schema_name: Optional[str] = "public"):
     """
     Delete entity in given table based on id. Table must have an id Column.
     """
-    if self.engine.dialect.name == "sqlite":
-        async with self.get_async_session() as session:
-            TableModel = await self.get_table(table_name, schema_name)
+    async with self.get_async_session() as session:
+        try:
+            TableModel = await self.get_table(table_name, schema_name)
+            if self.engine.dialect.name == "sqlite":
+                # Foreign key constraints are disabled by default in SQLite
+                await session.execute(text("PRAGMA foreign_keys = ON;"))

-            # Foreign key constraints are disabled by default in SQLite (for backwards compatibility),
-            # so must be enabled for each database connection/session separately.
-            await session.execute(text("PRAGMA foreign_keys = ON;"))

-            await session.execute(TableModel.delete().where(TableModel.c.id == data_id))
-            await session.commit()
-    else:
-        async with self.get_async_session() as session:
-            TableModel = await self.get_table(table_name, schema_name)
-            await session.execute(TableModel.delete().where(TableModel.c.id == data_id))
-            await session.commit()
+            result = await session.execute(TableModel.delete().where(TableModel.c.id == data_id))
+            if result.rowcount == 0:
+                raise EntityNotFoundError(f"Entity with id {data_id} not found in table {table_name}")
+            await session.commit()
+        except Exception as e:
+            await session.rollback()
+            raise
🧹 Nitpick comments (2)
cognee/tests/test_pgvector.py (1)

41-49: Remove duplicate code in test_getting_of_documents

The user retrieval code is duplicated. Consider refactoring to reduce duplication:

 async def test_getting_of_documents(dataset_name_1):
     from cognee.modules.users.permissions.methods import get_document_ids_for_user
     user = await get_default_user()
+    
+    # Test with specific dataset
     document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
     assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"

-    # Test getting of documents for search when no dataset is provided
-    user = await get_default_user()
+    # Test without dataset filter
     document_ids = await get_document_ids_for_user(user.id)
     assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

130-131: Consider optimizing database queries.

The current implementation uses separate queries for fetching the entity and checking references. This could be optimized into a single query.

Consider this optimization:

-raw_data_location_entities = (await session.execute(
-    select(Data.raw_data_location).where(Data.raw_data_location == data_entity.raw_data_location))).all()
-
-# Don't delete local file unless this is the only reference to the file in the database
-if len(raw_data_location_entities) == 1:
+# Use a CTE to get reference count and delete in one query
+delete_query = delete(Data).where(Data.id == data_id).returning(
+    Data.raw_data_location,
+    select(func.count('*'))
+    .where(Data.raw_data_location == Data.raw_data_location)
+    .scalar_subquery().label('ref_count')
+)
+result = await session.execute(delete_query)
+deleted_row = result.fetchone()
+if deleted_row and deleted_row.ref_count == 1:

Also applies to: 141-142

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cd1b1 and 6cb7fef.

📒 Files selected for processing (4)
  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (3 hunks)
  • cognee/modules/data/methods/delete_data.py (1 hunks)
  • cognee/modules/data/methods/delete_dataset.py (1 hunks)
  • cognee/tests/test_pgvector.py (3 hunks)
🔇 Additional comments (4)
cognee/modules/data/methods/delete_dataset.py (1)

7-7: Verify consistent usage of the renamed method

The method call has been updated from delete_data_by_id to delete_entity_by_id. Let's ensure this renaming is consistent across the codebase.

✅ Verification successful

The results show that the old method name delete_data_by_id is not found anywhere in the codebase, while the new method name delete_entity_by_id is consistently used in both the implementation (SqlAlchemyAdapter.py) and the caller (delete_dataset.py). This confirms that the renaming has been done consistently.

Method renaming has been consistently applied

The verification confirms that:

  • The old method name delete_data_by_id has been completely removed
  • The new method name delete_entity_by_id is properly implemented in SqlAlchemyAdapter.py and correctly used in delete_dataset.py
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining occurrences of the old method name
# and verify consistent usage of the new method name

echo "Checking for any remaining occurrences of delete_data_by_id..."
rg "delete_data_by_id"

echo "\nVerifying usage of delete_entity_by_id..."
rg "delete_entity_by_id"

Length of output: 594

cognee/tests/test_pgvector.py (1)

110-110: LGTM! Well-structured test organization

The test execution order is well thought out, with proper setup and teardown sequence. The new test functions are integrated appropriately within the existing test flow.

Also applies to: 140-141

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (2)

7-13: LGTM: Import changes align with new functionality.

The new imports support the local file deletion feature with proper SQL operations and error handling.


124-127: LGTM: Proper error handling for entity not found scenarios.

The error handling using try-except with EntityNotFoundError is well implemented.

@@ -17,4 +17,4 @@ async def delete_data(data: Data):

db_engine = get_relational_engine()

return await db_engine.delete_data_by_id(data.__tablename__, data.id)
return await db_engine.delete_data_entity(data.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the gathered information, I can now provide the final response:

Error handling improvements needed for file operations

The delete_data_entity implementation has proper error handling for:

  • Non-existent data IDs (raises EntityNotFoundError)
  • Concurrent deletion (handled via foreign key constraints)

However, the file deletion operation using os.remove() lacks try-catch error handling, which could fail due to:

  • Permission issues
  • File already being deleted
  • File system errors

Action items:

  • Wrap the os.remove() operation in a try-catch block
  • Handle potential OSError/IOError exceptions
  • Consider logging file deletion failures
🔗 Analysis chain

Verify error handling in delete_data_entity

The function now uses delete_data_entity which should handle local file deletion. Ensure that error handling is comprehensive, particularly for:

  1. Non-existent data IDs
  2. Failed file deletions
  3. Concurrent deletion attempts

Let me gather more information about the error handling implementation since the previous results were truncated.


Let me check one more aspect to see how the file deletion errors are handled.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Look for error handling in the delete_data_entity implementation

echo "Checking error handling in delete_data_entity implementation..."
rg -A 10 "async def delete_data_entity"

echo "\nChecking for try-catch blocks around file operations..."
ast-grep --pattern 'try {
  $$$
  delete_data_entity
  $$$
}'

Length of output: 1758


Script:

#!/bin/bash
# Get full implementation of delete_data_entity including try-catch blocks
rg -A 20 "async def delete_data_entity"

# Look for any file deletion operations in the same file
echo "\nChecking file deletion operations..."
rg "unlink|remove|delete.*file" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

Length of output: 3324


Script:

#!/bin/bash
# Look for try-catch blocks around the file deletion operation
rg -B 5 -A 5 "os.remove\(.*raw_data_location" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

# Check if there are any custom exceptions defined for file operations
echo "\nChecking for file operation related exceptions..."
rg "class.*Error" cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

Length of output: 940

cognee/tests/test_pgvector.py Show resolved Hide resolved
Comment on lines +22 to +23
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.md5(encoded_text).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using a more secure hashing algorithm

MD5 is cryptographically weak and susceptible to hash collisions. Consider using a more secure algorithm like SHA-256.

-        encoded_text = data_text.encode("utf-8")
-        data_hash = hashlib.md5(encoded_text).hexdigest()
+        encoded_text = data_text.encode("utf-8")
+        data_hash = hashlib.sha256(encoded_text).hexdigest()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.md5(encoded_text).hexdigest()
encoded_text = data_text.encode("utf-8")
data_hash = hashlib.sha256(encoded_text).hexdigest()

Comment on lines 114 to 142
async def delete_data_entity(self, data_id: UUID):
"""
Delete data and local files related to data if there are no references to it anymore.
"""
async with self.get_async_session() as session:
if self.engine.dialect.name == "sqlite":
# Foreign key constraints are disabled by default in SQLite (for backwards compatibility),
# so must be enabled for each database connection/session separately.
await session.execute(text("PRAGMA foreign_keys = ON;"))

try:
data_entity = (await session.scalars(select(Data).where(Data.id == data_id))).one()
except (ValueError, NoResultFound) as e:
raise EntityNotFoundError(message=f"Entity not found: {str(e)}")

# Check if other data objects point to the same raw data location
raw_data_location_entities = (await session.execute(
select(Data.raw_data_location).where(Data.raw_data_location == data_entity.raw_data_location))).all()

# Don't delete local file unless this is the only reference to the file in the database
if len(raw_data_location_entities) == 1:
# delete local file only if it's created by cognee
from cognee.base_config import get_base_config
config = get_base_config()
if config.data_root_directory in raw_data_location_entities[0].raw_data_location:
os.remove(raw_data_location_entities[0].raw_data_location)

await session.execute(delete(Data).where(Data.id == data_id))
await session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add error handling and validation for file deletion operations

The verification revealed several issues with file deletion implementations:

  • In SqlAlchemyAdapter.py, file deletion lacks both existence validation and error handling, which could lead to unhandled exceptions.
  • While LocalStorage.py and networkx/adapter.py properly check file existence before deletion, only the networkx adapter implements proper error handling.

Key locations requiring attention:

  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py: Add file existence check and try-catch block around os.remove
  • cognee/infrastructure/files/storage/LocalStorage.py: Add error handling around file deletion operation
🔗 Analysis chain

Verify file deletion implementation across codebase.

Let's verify that all file deletion scenarios are properly handled and there are no orphaned files.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential file handling issues in the codebase

# Look for other file deletion operations that might need similar handling
echo "Checking for other file deletion operations:"
rg -l "os\.remove|unlink" --type py

# Look for file path handling that might need validation
echo -e "\nChecking for file path handling:"
rg -l "raw_data_location|file_path|path\.join" --type py

# Look for TODO/FIXME comments related to file handling
echo -e "\nChecking for TODO/FIXME comments related to file handling:"
rg -i "todo|fixme|hack|xxx" -C 2 --type py | rg -i "file|path|delete"

Length of output: 3747


Script:

#!/bin/bash
# Let's examine the file deletion implementations in the identified files

echo "Checking SqlAlchemyAdapter.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py

echo -e "\nChecking LocalStorage.py implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/files/storage/LocalStorage.py

echo -e "\nChecking networkx adapter implementation:"
rg "os\.remove" -B 5 -A 5 cognee/infrastructure/databases/graph/networkx/adapter.py

echo -e "\nChecking for file existence validation before deletion:"
rg "os\.path\.exists|os\.path\.isfile" -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py

echo -e "\nChecking for exception handling around file operations:"
rg "try.*:.*os\.remove|except.*os\." -B 2 -A 2 cognee/infrastructure/files/storage/LocalStorage.py cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py cognee/infrastructure/databases/graph/networkx/adapter.py

Length of output: 5352

from cognee.base_config import get_base_config
config = get_base_config()
if config.data_root_directory in raw_data_location_entities[0].raw_data_location:
os.remove(raw_data_location_entities[0].raw_data_location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if the file exists? If it doesn't, then we have a bug somewhere and it should be reported to sentry.

cognee/tests/test_pgvector.py Outdated Show resolved Hide resolved
Copy link

gitguardian bot commented Dec 20, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 7232b04 notebooks/hr_demo.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Log to sentry that a file which should exist can't be found

Chore COG-475
…oteretes/cognee into COG-475-local-file-endpoint-deletion
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/tests/test_pgvector.py (1)

25-26: ⚠️ Potential issue

Add error handling for database queries

The use of .one() will raise exceptions if no results are found. Consider adding proper error handling.

-        data = (await session.scalars(select(Data).where(Data.content_hash == data_hash))).one()
+        try:
+            data = (await session.scalars(select(Data).where(Data.content_hash == data_hash))).one()
+        except NoResultFound:
+            raise AssertionError(f"No data found with hash {data_hash}")

Also applies to: 33-34

🧹 Nitpick comments (3)
cognee/tests/test_pgvector.py (2)

13-39: Add docstring to explain test scenarios

The test function should have a docstring explaining the two test scenarios: deletion of cognee-created files and preservation of external files.

 async def test_local_file_deletion(data_text, file_location):
+    """
+    Test local file deletion behavior:
+    1. Verify that files created by cognee are deleted when their data entity is deleted
+    2. Verify that files not created by cognee are preserved when their data entity is deleted
+    """

29-30: Improve assertion messages

The assertion messages should clearly indicate the expected vs actual state.

-        assert not os.path.exists(data.raw_data_location), f"Data location still exists after deletion: {data.raw_data_location}"
+        assert not os.path.exists(data.raw_data_location), f"Expected file to be deleted but it still exists at: {data.raw_data_location}"

-        assert os.path.exists(data.raw_data_location), f"Data location doesn't exists: {data.raw_data_location}"
+        assert os.path.exists(data.raw_data_location), f"Expected file to be preserved but it was deleted at: {data.raw_data_location}"

Also applies to: 37-38

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

133-138: Consider race conditions in reference counting

The current implementation checks for file references but doesn't handle potential race conditions where another process might add/remove references between the check and deletion.

Consider implementing a proper reference counting mechanism or using database transactions to ensure atomicity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cb7fef and 14dbe6c.

📒 Files selected for processing (2)
  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (3 hunks)
  • cognee/tests/test_pgvector.py (3 hunks)
🔇 Additional comments (3)
cognee/tests/test_pgvector.py (1)

111-111: Consider test order dependency

The test execution order matters here as test_getting_of_documents relies on data being present, while test_local_file_deletion deletes data. Consider adding cleanup and setup methods to ensure test isolation.

Also applies to: 141-141

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (2)

Line range hint 97-117: LGTM! Method rename improves clarity

The rename from delete_data_by_id to delete_entity_by_id better reflects the generic nature of this method, as it can delete any entity type, not just data.


144-146: ⚠️ Potential issue

Add path validation and error handling for file operations

The current implementation is vulnerable to path traversal attacks and lacks proper error handling for file operations.

-                if config.data_root_directory in raw_data_location_entities[0].raw_data_location:
-                    if os.path.exists(raw_data_location_entities[0].raw_data_location):
-                        os.remove(raw_data_location_entities[0].raw_data_location)
+                file_path = raw_data_location_entities[0].raw_data_location
+                try:
+                    # Validate file path is within allowed directory
+                    real_path = os.path.realpath(file_path)
+                    if not real_path.startswith(os.path.realpath(config.data_root_directory)):
+                        logger.error(f"Attempted file deletion outside allowed directory: {file_path}")
+                        return
+                    
+                    if os.path.exists(file_path):
+                        os.remove(file_path)
+                except OSError as e:
+                    logger.error(f"Error deleting file {file_path}: {e}")

Likely invalid or redundant comment.

Comment on lines +40 to +50
async def test_getting_of_documents(dataset_name_1):
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"

# Test getting of documents for search when no dataset is provided
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve test function structure and assertions

Several improvements needed:

  1. Add docstring
  2. Improve assertion messages
  3. Remove duplicate user retrieval
 async def test_getting_of_documents(dataset_name_1):
+    """
+    Test document retrieval functionality:
+    1. Verify correct document count when filtering by dataset
+    2. Verify correct document count when no dataset filter is applied
+    """
     # Test getting of documents for search per dataset
     from cognee.modules.users.permissions.methods import get_document_ids_for_user
     user = await get_default_user()
     document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
-    assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
+    assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}"

     # Test getting of documents for search when no dataset is provided
-    user = await get_default_user()  # Duplicate user retrieval
     document_ids = await get_document_ids_for_user(user.id)
-    assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
+    assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_getting_of_documents(dataset_name_1):
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Number of expected documents doesn't match {len(document_ids)} != 1"
# Test getting of documents for search when no dataset is provided
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Number of expected documents doesn't match {len(document_ids)} != 2"
async def test_getting_of_documents(dataset_name_1):
"""
Test document retrieval functionality:
1. Verify correct document count when filtering by dataset
2. Verify correct document count when no dataset filter is applied
"""
# Test getting of documents for search per dataset
from cognee.modules.users.permissions.methods import get_document_ids_for_user
user = await get_default_user()
document_ids = await get_document_ids_for_user(user.id, [dataset_name_1])
assert len(document_ids) == 1, f"Expected 1 document when filtering by dataset, but got {len(document_ids)}"
# Test getting of documents for search when no dataset is provided
document_ids = await get_document_ids_for_user(user.id)
assert len(document_ids) == 2, f"Expected 2 documents without dataset filter, but got {len(document_ids)}"

@Vasilije1990 Vasilije1990 merged commit 51559f0 into dev Dec 26, 2024
22 of 26 checks passed
@Vasilije1990 Vasilije1990 deleted the COG-475-local-file-endpoint-deletion branch December 26, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants