Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 505 data dataset model changes #260

Merged
merged 14 commits into from
Dec 6, 2024
Merged

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Dec 5, 2024

  • Allow Data to exist outside of Dataset
  • Allow one Data item to be a part of multiple Datasets
  • Add ownership information to Data
  • Add text deduplication for owners of Data items
  • Fix issue with dlt merging of data

Summary by CodeRabbit

  • New Features

    • Introduced a new GitHub Actions workflow for automated deduplication testing.
    • Enhanced metadata handling by adding content_hash and owner_id to data entries.
    • Updated the identifier generation for binary data to use a content hash instead of a name.
    • Added a new utility function to compute the MD5 hash of file contents.
  • Bug Fixes

    • Improved logic for saving data to prevent overwriting existing files.
  • Tests

    • Added comprehensive asynchronous tests for deduplication functionality across various data types and databases.
  • Documentation

    • Updated content in the natural language processing overview file to provide detailed insights into the field.
    • Added a module-level docstring to enhance documentation of custom exceptions.

Added ability to store single data instance in multiple datasets

Feature COG-505
If text is added to cognee it will be saved by hash so the same text can't be stored multiple times

Feature COG-505
Data is deduplicated per user so if a user tries to add data which already exists it will just be redirected to existing data in database

Feature COG-505
Resolve issue with dlt data not being merged for data_id

Fix COG-505
@dexters1 dexters1 self-assigned this Dec 5, 2024
Copy link
Contributor

coderabbitai bot commented Dec 5, 2024

Walkthrough

This pull request introduces a new GitHub Actions workflow for testing deduplication functionality, triggered by manual dispatch or pull requests to the main branch. It includes jobs for checking documentation changes and running deduplication tests with PostgreSQL support. Additionally, several modifications are made across various modules, including simplifying function calls by removing the dataset_name parameter, enhancing metadata handling, and adjusting the Data model to include new attributes. A new test file for deduplication is also added, encapsulating tests for various data types and database configurations.

Changes

File Path Change Summary
.github/workflows/test_deduplication.yml New workflow for testing deduplication functionality, triggered by manual dispatch and PRs.
cognee/api/v1/add/add.py Simplified save_data_to_file calls by removing dataset_name parameter.
cognee/infrastructure/files/utils/get_file_metadata.py Added content_hash field to FileMetadata and computed MD5 hash of file content.
cognee/modules/data/models/Data.py Added owner_id (UUID) and content_hash (String) attributes to Data class.
cognee/modules/data/operations/write_metadata.py Updated write_metadata to check for existing metadata and streamline logic in get_metadata_dict.
cognee/modules/ingestion/data_types/BinaryData.py Changed get_identifier to return content_hash instead of concatenated name and extension.
cognee/modules/ingestion/identify.py Updated identify function to include user parameter, changing UUID generation logic.
cognee/modules/ingestion/save_data_to_file.py Removed dataset_name parameter from save_data_to_file signature and adjusted file naming logic.
cognee/tasks/ingestion/ingest_data_with_metadata.py Enhanced data_resources and ingest_data_with_metadata functions to include user context.
cognee/tasks/ingestion/save_data_item_to_storage.py Removed dataset_name from calls to save_data_to_file.
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py Removed dataset_name from calls to save_data_to_file.
cognee/tasks/ingestion/transform_data.py Removed dataset_name from save_data_to_file calls for Document and ImageDocument.
cognee/tests/test_data/Natural_language_processing_copy.txt Added detailed overview content on natural language processing.
cognee/tests/test_deduplication.py Introduced tests for deduplication functionality, including tests for PostgreSQL and SQLite.

Possibly related PRs

Suggested reviewers

  • borisarzentar
  • Vasilije1990

🐰 In the meadow, changes bloom,
New tests for deduplication loom.
With workflows set to run and play,
Data saved in a smarter way.
Hashes and owners, all in line,
In our code, the bunnies shine! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

dexters1 and others added 3 commits December 5, 2024 17:10
Copy link

gitguardian bot commented Dec 5, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 813b76c .github/workflows/test_deduplication.yml View secret
9573981 Triggered Generic Password 813b76c cognee/tests/test_deduplication.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Add testing of dataset data table content

Test COG-505
@dexters1 dexters1 marked this pull request as ready for review December 5, 2024 18:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (14)
cognee/tasks/ingestion/ingest_data_with_metadata.py (1)

28-28: Use a safer method to remove 'file://' prefix from file_path

Using file_path.replace("file://", "") may unintentionally replace occurrences of 'file://' elsewhere in the path. It's safer to remove the prefix only if it exists at the start of the string.

Apply this diff to make the change:

- with open(file_path.replace("file://", ""), mode="rb") as file:
+ if file_path.startswith("file://"):
+     file_path = file_path[len("file://"):]
+ with open(file_path, mode="rb") as file:

Also, update the similar code in line 62:

- with open(file_path.replace("file://", ""), mode="rb") as file:
+ if file_path.startswith("file://"):
+     file_path = file_path[len("file://"):]
+ with open(file_path, mode="rb") as file:

Also applies to: 62-62

cognee/modules/ingestion/identify.py (1)

6-9: Ensure user.id is a string when generating UUID

When concatenating data_content_hash with user.id to generate the UUID, ensure that user.id is converted to a string to prevent any type errors.

Apply this diff:

def identify(data: IngestionData, user: User) -> str:
    data_content_hash: str = data.get_identifier()
    # Return UUID hash of file contents + owner id
-   return uuid5(NAMESPACE_OID, f"{data_content_hash}{user.id}")
+   return uuid5(NAMESPACE_OID, f"{data_content_hash}{str(user.id)}")
cognee/tasks/ingestion/save_data_item_to_storage.py (1)

13-13: Use a safer method to handle 'file://' prefix in data_item

Using data_item.replace("file://", "") may unintentionally replace 'file://' elsewhere in the string. It's better to check if the string starts with 'file://' and remove the prefix accordingly.

Apply this diff:

elif isinstance(data_item, str):
    # Data is a file path
-   if data_item.startswith("file://") or data_item.startswith("/"):
-       file_path = data_item.replace("file://", "")
+   if data_item.startswith("file://"):
+       file_path = data_item[len("file://"):]
+   elif data_item.startswith("/"):
+       file_path = data_item
    # Data is text
    else:
        file_path = save_data_to_file(data_item)

Ensure that file_path is correctly assigned in all cases.

cognee/modules/ingestion/data_types/BinaryData.py (1)

20-20: Add return type hint for get_identifier

For better type safety and documentation, add a return type hint to the method signature.

Apply this diff:

-   def get_identifier(self):
+   def get_identifier(self) -> str:
cognee/modules/ingestion/save_data_to_file.py (1)

24-26: Add type hints and improve file existence check

The file existence check could be improved with better type hints and path handling.

Apply this diff:

-    # Don't save file if it already exists
-    if not os.path.isfile(os.path.join(storage_path, file_name)):
-        LocalStorage(storage_path).store(file_name, classified_data.get_data())
+    file_path: str = os.path.join(storage_path, file_name)
+    # Don't save file if it already exists
+    if not os.path.isfile(file_path):
+        LocalStorage(storage_path).store(file_name, classified_data.get_data())
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (1)

Line range hint 6-8: Consider removing unused dataset_name parameter

The dataset_name parameter is only used for llama_index data type. Consider refactoring to remove this parameter or document why it's needed.

Either remove the parameter if not needed or document its purpose:

-async def save_data_item_with_metadata_to_storage(
-    data_item: Union[BinaryIO, str, Any], dataset_name: str
-) -> str:
+async def save_data_item_with_metadata_to_storage(
+    data_item: Union[BinaryIO, str, Any],
+    dataset_name: str | None = None  # Only used for llama_index data type
+) -> str:
cognee/modules/data/models/Data.py (1)

Line range hint 44-54: Update to_json() method to include new fields

The to_json() method should be updated to include the new owner_id and content_hash fields for consistency.

     def to_json(self) -> dict:
         return {
             "id": str(self.id),
             "name": self.name,
             "extension": self.extension,
             "mimeType": self.mime_type,
             "rawDataLocation": self.raw_data_location,
+            "ownerId": str(self.owner_id),
+            "contentHash": self.content_hash,
             "createdAt": self.created_at.isoformat(),
             "updatedAt": self.updated_at.isoformat() if self.updated_at else None,
         }
.github/workflows/test_deduplication.yml (1)

1-14: Add workflow timeout and dependency caching

Consider adding these improvements to the workflow:

  1. Add a timeout to prevent hanging jobs
  2. Add caching for Poetry dependencies
name: test | deduplication

on:
  workflow_dispatch:
  pull_request:
    branches:
      - main
    types: [labeled, synchronize]

+timeout-minutes: 30

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

+    steps:
+      # Add after Setup Python step
+      - name: Cache Poetry dependencies
+        uses: actions/cache@v3
+        with:
+          path: .venv
+          key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}
cognee/api/v1/add/add.py (3)

Line range hint 116-127: Update Data creation to include new required fields

The Data creation logic needs to be updated to support the new data model changes:

  1. Set owner_id from the user parameter
  2. Add content_hash from file_metadata
                        data = Data(
                            id = data_id,
                            name = file_metadata["name"],
                            raw_data_location = file_metadata["file_path"],
                            extension = file_metadata["extension"],
                            mime_type = file_metadata["mime_type"],
+                            owner_id = user.id,
+                            content_hash = file_metadata.get("content_hash"),
                        )

Line range hint 98-106: Add deduplication check using content_hash

Before creating a new Data record, check if content with the same hash already exists to implement deduplication.

                    data = (await session.execute(
-                        select(Data).filter(Data.id == data_id)
+                        select(Data).filter(
+                            (Data.id == data_id) |
+                            (Data.content_hash == file_metadata.get("content_hash"))
+                        )
                    )).scalar_one_or_none()

Line range hint 134-143: Include new fields in yielded metadata

Update the yielded metadata to include the new owner_id and content_hash fields for consistency with the Data model.

                yield {
                    "id": data_id,
                    "name": file_metadata["name"],
                    "file_path": file_metadata["file_path"],
                    "extension": file_metadata["extension"],
                    "mime_type": file_metadata["mime_type"],
+                    "owner_id": str(user.id),
+                    "content_hash": file_metadata.get("content_hash"),
                }
cognee/tests/test_deduplication.py (2)

11-97: Consider enhancing test robustness and error handling

While the test structure is good, consider these improvements:

  1. Add error handling for file operations
  2. Add cleanup between different file type tests to ensure complete isolation
  3. Add negative test cases (e.g., different content with same names)

137-154: Add test environment cleanup

Consider adding cleanup of test directories after tests complete to prevent test data accumulation.

async def main():
+   try:
        data_directory_path = str(
            pathlib.Path(
                os.path.join(pathlib.Path(__file__).parent, ".data_storage/test_deduplication")
            ).resolve()
        )
        cognee.config.data_root_directory(data_directory_path)
        cognee_directory_path = str(
            pathlib.Path(
                os.path.join(pathlib.Path(__file__).parent, ".cognee_system/test_deduplication")
            ).resolve()
        )
        cognee.config.system_root_directory(cognee_directory_path)

        await test_deduplication_postgres()
        await test_deduplication_sqlite()
+   finally:
+       # Clean up test directories
+       import shutil
+       shutil.rmtree(data_directory_path, ignore_errors=True)
+       shutil.rmtree(cognee_directory_path, ignore_errors=True)
cognee/tests/test_data/Natural_language_processing_copy.txt (1)

1-1: Fix grammatical issue with missing comma

Add a comma before "as well as" in the compound predicate for better readability and grammatical correctness.

- The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
+ The technology can then accurately extract information and insights contained in the documents, as well as categorize and organize the documents themselves.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...

(AI_HYDRA_LEO_MISSING_COMMA)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acf0368 and add6730.

⛔ Files ignored due to path filters (4)
  • cognee/tests/test_data/example.png is excluded by !**/*.png
  • cognee/tests/test_data/example_copy.png is excluded by !**/*.png
  • cognee/tests/test_data/text_to_speech.mp3 is excluded by !**/*.mp3
  • cognee/tests/test_data/text_to_speech_copy.mp3 is excluded by !**/*.mp3
📒 Files selected for processing (14)
  • .github/workflows/test_deduplication.yml (1 hunks)
  • cognee/api/v1/add/add.py (1 hunks)
  • cognee/infrastructure/files/utils/get_file_metadata.py (3 hunks)
  • cognee/modules/data/models/Data.py (1 hunks)
  • cognee/modules/data/operations/write_metadata.py (0 hunks)
  • cognee/modules/ingestion/data_types/BinaryData.py (1 hunks)
  • cognee/modules/ingestion/identify.py (1 hunks)
  • cognee/modules/ingestion/save_data_to_file.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data_with_metadata.py (6 hunks)
  • cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
  • cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (2 hunks)
  • cognee/tasks/ingestion/transform_data.py (1 hunks)
  • cognee/tests/test_data/Natural_language_processing_copy.txt (1 hunks)
  • cognee/tests/test_deduplication.py (1 hunks)
💤 Files with no reviewable changes (1)
  • cognee/modules/data/operations/write_metadata.py
🧰 Additional context used
🪛 Ruff (0.8.0)
cognee/tasks/ingestion/transform_data.py

14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

🪛 actionlint (1.7.4)
.github/workflows/test_deduplication.yml

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

🪛 LanguageTool
cognee/tests/test_data/Natural_language_processing_copy.txt

[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...

(AI_HYDRA_LEO_MISSING_COMMA)

🔇 Additional comments (1)
cognee/tests/test_deduplication.py (1)

1-10: LGTM: Test setup and imports are appropriate

The imports and logging configuration are well-structured for testing purposes.

cognee/tasks/ingestion/ingest_data_with_metadata.py Outdated Show resolved Hide resolved
cognee/infrastructure/files/utils/get_file_metadata.py Outdated Show resolved Hide resolved
cognee/tasks/ingestion/transform_data.py Show resolved Hide resolved
cognee/modules/ingestion/save_data_to_file.py Show resolved Hide resolved
.github/workflows/test_deduplication.yml Show resolved Hide resolved

steps:
- name: Check out
uses: actions/checkout@master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security: Use specific version for checkout action

Using @master for actions can be dangerous as it may introduce breaking changes or security issues.

-        uses: actions/checkout@master
+        uses: actions/checkout@v4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uses: actions/checkout@master
uses: actions/checkout@v4

cognee/tests/test_deduplication.py Show resolved Hide resolved
cognee/tests/test_deduplication.py Show resolved Hide resolved
Add missing function to get data from database to multimedia deduplication test

Test COG-505
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)
cognee/tests/test_deduplication.py (5)

9-9: Consider adjusting logging level for tests

Setting logging to DEBUG level might generate too much output during test runs. Consider using INFO level for tests unless actively debugging.

-logging.basicConfig(level=logging.DEBUG)
+logging.basicConfig(level=logging.INFO)

11-11: Add docstring to explain test purpose and requirements

Add a docstring to document the test's purpose, prerequisites (test data files), and expected behavior.

 async def test_deduplication():
+    """Test deduplication functionality across different data types.
+    
+    Prerequisites:
+        - Test data files in ./test_data/ directory:
+          - Natural_language_processing.txt and its copy
+          - example.png and its copy
+          - text_to_speech.mp3 and its copy
+    
+    Tests:
+        - Deduplication of text files
+        - Deduplication of text content
+        - Deduplication of image files
+        - Deduplication of sound files
+    """

48-54: Move test text content to a separate file

Long text content should be moved to a test data file to improve readability and maintainability.

-    text = """A quantum computer is a computer that takes advantage...
-        """
+    text_path = os.path.join(
+        pathlib.Path(__file__).parent, "test_data/quantum_computing.txt"
+    )
+    verify_test_file(text_path, "quantum computing text")
+    with open(text_path, 'r') as f:
+        text = f.read()

44-45: Reduce redundant cleanup calls

Consider using pytest fixtures for setup and cleanup to avoid repetition and ensure consistent cleanup.

import pytest

@pytest.fixture(autouse=True)
async def cleanup():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    yield
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

Also applies to: 63-64, 80-81, 97-98


101-137: Use pytest parametrize for database configurations

Instead of separate functions, use pytest's parametrize feature to run the same test with different database configurations.

@pytest.mark.parametrize("db_config", [
    {
        "vector_db_config": {
            "vector_db_url": "",
            "vector_db_key": "",
            "vector_db_provider": "pgvector"
        },
        "relational_db_config": {
            "db_name": "cognee_db",
            "db_host": "127.0.0.1",
            "db_port": "5432",
            "db_username": "cognee",
            "db_password": "cognee",
            "db_provider": "postgres",
        }
    },
    {
        "vector_db_config": {
            "vector_db_url": "",
            "vector_db_key": "",
            "vector_db_provider": "lancedb"
        },
        "relational_db_config": {
            "db_provider": "sqlite",
        }
    }
])
async def test_deduplication_with_db(db_config):
    cognee.config.set_vector_db_config(db_config["vector_db_config"])
    cognee.config.set_relational_db_config(db_config["relational_db_config"])
    await test_deduplication()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between add6730 and 9ba5d49.

📒 Files selected for processing (1)
  • cognee/tests/test_deduplication.py (1 hunks)
🔇 Additional comments (1)
cognee/tests/test_deduplication.py (1)

101-121: Enhance database configuration security and validation

The previous review comment about using environment variables for database credentials and adding configuration validation is still valid.

cognee/tests/test_deduplication.py Show resolved Hide resolved
cognee/tests/test_deduplication.py Show resolved Hide resolved
Moved hash calculation of file to shared utils, added better typing

Refactor COG-505
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
cognee/shared/utils.py (1)

76-95: Add docstring to explain function usage and purpose.

While the implementation is solid, please add a docstring that explains:

  • The function's purpose
  • Parameters and their types
  • Return value
  • Usage examples

Example docstring:

def get_file_content_hash(file_obj: Union[str, BinaryIO]) -> str:
    """
    Compute MD5 hash of file content, reading in chunks for memory efficiency.

    Args:
        file_obj: Either a file path (str) or a file-like object (BinaryIO)

    Returns:
        str: Hexadecimal representation of the MD5 hash

    Example:
        >>> hash1 = get_file_content_hash("path/to/file.txt")
        >>> with open("path/to/file.txt", "rb") as f:
        ...     hash2 = get_file_content_hash(f)
    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba5d49 and e80377b.

📒 Files selected for processing (3)
  • cognee/infrastructure/files/utils/get_file_metadata.py (2 hunks)
  • cognee/shared/utils.py (2 hunks)
  • cognee/tasks/ingestion/ingest_data_with_metadata.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/infrastructure/files/utils/get_file_metadata.py
  • cognee/tasks/ingestion/ingest_data_with_metadata.py
🔇 Additional comments (1)
cognee/shared/utils.py (1)

3-6: LGTM: Imports are correctly placed and necessary.

The new imports support the added functionality for file content hashing.

cognee/shared/utils.py Outdated Show resolved Hide resolved
Added error handling to reading of file in hash util

Refactor COG-505
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
cognee/shared/exceptions/exceptions.py (1)

4-11: Consider enhancing error handling and documentation.

While the implementation is functionally correct, consider these improvements:

  1. Add a docstring to document the exception's purpose and usage
  2. Consider adding an optional details parameter for more specific error information
  3. Allow for different status codes based on specific ingestion failure scenarios

Here's a suggested enhancement:

 class IngestionError(CogneeApiError):
+    """Exception raised for errors during data ingestion process.
+
+    Attributes:
+        message -- explanation of the error
+        name -- name of the error type
+        status_code -- HTTP status code
+        details -- additional error details
+    """
     def __init__(
             self,
             message: str = "Failed to load data.",
             name: str = "IngestionError",
             status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+            details: dict = None
     ):
-        super().__init__(message, name, status_code)
+        self.details = details
+        full_message = f"{message} Details: {details}" if details else message
+        super().__init__(full_message, name, status_code)

This enhancement would allow for more detailed error reporting:

# Usage example
raise IngestionError(
    message="Failed to load CSV data",
    details={"reason": "Invalid column format", "column": "timestamp"}
)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e80377b and 1e098ae.

📒 Files selected for processing (3)
  • cognee/shared/exceptions/__init__.py (1 hunks)
  • cognee/shared/exceptions/exceptions.py (1 hunks)
  • cognee/shared/utils.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cognee/shared/exceptions/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/shared/utils.py
🔇 Additional comments (1)
cognee/shared/exceptions/exceptions.py (1)

1-2: LGTM! Appropriate imports for exception handling.

The imports are correctly specified, bringing in the necessary base exception class and HTTP status codes.

Copy link
Contributor

@borisarzentar borisarzentar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with this implementation and writing that extensive test!

cognee/modules/data/models/Data.py Show resolved Hide resolved
def identify(data: IngestionData, user: User) -> str:
data_content_hash: str = data.get_identifier()
# return UUID hash of file contents + owner id
return uuid5(NAMESPACE_OID,f"{data_content_hash}{user.id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some space is missing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we need owner_id for permissions regarding some Data operations as they are no longer tied to Datasets.

content_hash we don't need right now, but might be nice to have as it's the only way to verify the content of the Data

Add space and newline to ingest function

Refactor COG-505
…eretes/cognee into COG-505-data-dataset-model-changes
@hajdul88 hajdul88 self-requested a review December 6, 2024 13:41
@dexters1 dexters1 merged commit 8415279 into main Dec 6, 2024
40 checks passed
@dexters1 dexters1 deleted the COG-505-data-dataset-model-changes branch December 6, 2024 13:42
This was referenced Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants