Cog 505 data dataset model changes #260

dexters1 · 2024-12-05T16:05:24Z

Allow Data to exist outside of Dataset
Allow one Data item to be a part of multiple Datasets
Add ownership information to Data
Add text deduplication for owners of Data items
Fix issue with dlt merging of data

Summary by CodeRabbit

New Features
- Introduced a new GitHub Actions workflow for automated deduplication testing.
- Enhanced metadata handling by adding content_hash and owner_id to data entries.
- Updated the identifier generation for binary data to use a content hash instead of a name.
- Added a new utility function to compute the MD5 hash of file contents.
Bug Fixes
- Improved logic for saving data to prevent overwriting existing files.
Tests
- Added comprehensive asynchronous tests for deduplication functionality across various data types and databases.
Documentation
- Updated content in the natural language processing overview file to provide detailed insights into the field.
- Added a module-level docstring to enhance documentation of custom exceptions.

Added ability to store single data instance in multiple datasets Feature COG-505

If text is added to cognee it will be saved by hash so the same text can't be stored multiple times Feature COG-505

Data is deduplicated per user so if a user tries to add data which already exists it will just be redirected to existing data in database Feature COG-505

Resolve issue with dlt data not being merged for data_id Fix COG-505

coderabbitai · 2024-12-05T16:05:31Z

Walkthrough

This pull request introduces a new GitHub Actions workflow for testing deduplication functionality, triggered by manual dispatch or pull requests to the main branch. It includes jobs for checking documentation changes and running deduplication tests with PostgreSQL support. Additionally, several modifications are made across various modules, including simplifying function calls by removing the dataset_name parameter, enhancing metadata handling, and adjusting the Data model to include new attributes. A new test file for deduplication is also added, encapsulating tests for various data types and database configurations.

Changes

File Path	Change Summary
`.github/workflows/test_deduplication.yml`	New workflow for testing deduplication functionality, triggered by manual dispatch and PRs.
`cognee/api/v1/add/add.py`	Simplified `save_data_to_file` calls by removing `dataset_name` parameter.
`cognee/infrastructure/files/utils/get_file_metadata.py`	Added `content_hash` field to `FileMetadata` and computed MD5 hash of file content.
`cognee/modules/data/models/Data.py`	Added `owner_id` (UUID) and `content_hash` (String) attributes to `Data` class.
`cognee/modules/data/operations/write_metadata.py`	Updated `write_metadata` to check for existing metadata and streamline logic in `get_metadata_dict`.
`cognee/modules/ingestion/data_types/BinaryData.py`	Changed `get_identifier` to return `content_hash` instead of concatenated name and extension.
`cognee/modules/ingestion/identify.py`	Updated `identify` function to include `user` parameter, changing UUID generation logic.
`cognee/modules/ingestion/save_data_to_file.py`	Removed `dataset_name` parameter from `save_data_to_file` signature and adjusted file naming logic.
`cognee/tasks/ingestion/ingest_data_with_metadata.py`	Enhanced `data_resources` and `ingest_data_with_metadata` functions to include user context.
`cognee/tasks/ingestion/save_data_item_to_storage.py`	Removed `dataset_name` from calls to `save_data_to_file`.
`cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py`	Removed `dataset_name` from calls to `save_data_to_file`.
`cognee/tasks/ingestion/transform_data.py`	Removed `dataset_name` from `save_data_to_file` calls for `Document` and `ImageDocument`.
`cognee/tests/test_data/Natural_language_processing_copy.txt`	Added detailed overview content on natural language processing.
`cognee/tests/test_deduplication.py`	Introduced tests for deduplication functionality, including tests for PostgreSQL and SQLite.

Possibly related PRs

Cog 170 pgvector adapter #158: The changes in the main PR involve the introduction of a new GitHub Actions workflow for testing deduplication, which is related to the modifications made in the pgvector integration, as both involve testing and database interactions.
COG-485 - Fix/integration test warnings #176: This PR focuses on enhancing error handling and response structures in the API, which may relate to the testing framework introduced in the main PR, as both aim to improve the robustness of the codebase.
Remove automatic execution of test that require LLM inputs from merge… #184: The changes in this PR involve modifying workflow triggers, which could be relevant to the new workflow introduced in the main PR for testing deduplication.
Cog 577 add unit test task #197: This PR introduces unit tests, which align with the testing objectives of the main PR, as both aim to enhance the testing framework within the project.
Cog 417 chunking unit tests #205: The modifications in this PR focus on chunking unit tests, which may relate to the testing framework and objectives outlined in the main PR.
fix: Resolve issue when metadata is updated #249: This PR addresses issues with metadata updates, which could be relevant to the deduplication testing introduced in the main PR, as both involve data handling and integrity.
chore: Fix issue with notebook github actions #256: This PR aims to fix issues with notebook GitHub Actions, which may relate to the automation and testing aspects introduced in the main PR.

Suggested reviewers

borisarzentar
Vasilije1990

🐰 In the meadow, changes bloom,
New tests for deduplication loom.
With workflows set to run and play,
Data saved in a smarter way.
Hashes and owners, all in line,
In our code, the bunnies shine! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Added end to end test for text deduplication Test COG-505

…eretes/cognee into COG-505-data-dataset-model-changes

gitguardian · 2024-12-05T18:26:28Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`813b76c`	.github/workflows/test_deduplication.yml	View secret
9573981	Triggered	Generic Password	`813b76c`	cognee/tests/test_deduplication.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Add testing of dataset data table content Test COG-505

coderabbitai

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (14)

cognee/tasks/ingestion/ingest_data_with_metadata.py (1)
28-28: Use a safer method to remove 'file://' prefix from file_path

Using file_path.replace("file://", "") may unintentionally replace occurrences of 'file://' elsewhere in the path. It's safer to remove the prefix only if it exists at the start of the string.

Apply this diff to make the change:
- with open(file_path.replace("file://", ""), mode="rb") as file:
+ if file_path.startswith("file://"):
+     file_path = file_path[len("file://"):]
+ with open(file_path, mode="rb") as file:
Also, update the similar code in line 62:
- with open(file_path.replace("file://", ""), mode="rb") as file:
+ if file_path.startswith("file://"):
+     file_path = file_path[len("file://"):]
+ with open(file_path, mode="rb") as file:
Also applies to: 62-62
cognee/modules/ingestion/identify.py (1)
6-9: Ensure user.id is a string when generating UUID

When concatenating data_content_hash with user.id to generate the UUID, ensure that user.id is converted to a string to prevent any type errors.

Apply this diff:
def identify(data: IngestionData, user: User) -> str:
    data_content_hash: str = data.get_identifier()
    # Return UUID hash of file contents + owner id
-   return uuid5(NAMESPACE_OID, f"{data_content_hash}{user.id}")
+   return uuid5(NAMESPACE_OID, f"{data_content_hash}{str(user.id)}")
cognee/tasks/ingestion/save_data_item_to_storage.py (1)
13-13: Use a safer method to handle 'file://' prefix in data_item

Using data_item.replace("file://", "") may unintentionally replace 'file://' elsewhere in the string. It's better to check if the string starts with 'file://' and remove the prefix accordingly.

Apply this diff:
elif isinstance(data_item, str):
    # Data is a file path
-   if data_item.startswith("file://") or data_item.startswith("/"):
-       file_path = data_item.replace("file://", "")
+   if data_item.startswith("file://"):
+       file_path = data_item[len("file://"):]
+   elif data_item.startswith("/"):
+       file_path = data_item
    # Data is text
    else:
        file_path = save_data_to_file(data_item)
Ensure that file_path is correctly assigned in all cases.
cognee/modules/ingestion/data_types/BinaryData.py (1)
20-20: Add return type hint for get_identifier

For better type safety and documentation, add a return type hint to the method signature.

Apply this diff:
-   def get_identifier(self):
+   def get_identifier(self) -> str:
cognee/modules/ingestion/save_data_to_file.py (1)
24-26: Add type hints and improve file existence check

The file existence check could be improved with better type hints and path handling.

Apply this diff:
-    # Don't save file if it already exists
-    if not os.path.isfile(os.path.join(storage_path, file_name)):
-        LocalStorage(storage_path).store(file_name, classified_data.get_data())
+    file_path: str = os.path.join(storage_path, file_name)
+    # Don't save file if it already exists
+    if not os.path.isfile(file_path):
+        LocalStorage(storage_path).store(file_name, classified_data.get_data())
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (1)
Line range hint 6-8: Consider removing unused dataset_name parameter

The dataset_name parameter is only used for llama_index data type. Consider refactoring to remove this parameter or document why it's needed.

Either remove the parameter if not needed or document its purpose:
-async def save_data_item_with_metadata_to_storage(
-    data_item: Union[BinaryIO, str, Any], dataset_name: str
-) -> str:
+async def save_data_item_with_metadata_to_storage(
+    data_item: Union[BinaryIO, str, Any],
+    dataset_name: str | None = None  # Only used for llama_index data type
+) -> str:
cognee/modules/data/models/Data.py (1)
Line range hint 44-54: Update to_json() method to include new fields

The to_json() method should be updated to include the new owner_id and content_hash fields for consistency.
     def to_json(self) -> dict:
         return {
             "id": str(self.id),
             "name": self.name,
             "extension": self.extension,
             "mimeType": self.mime_type,
             "rawDataLocation": self.raw_data_location,
+            "ownerId": str(self.owner_id),
+            "contentHash": self.content_hash,
             "createdAt": self.created_at.isoformat(),
             "updatedAt": self.updated_at.isoformat() if self.updated_at else None,
         }
.github/workflows/test_deduplication.yml (1)
1-14: Add workflow timeout and dependency caching

Consider adding these improvements to the workflow:

Add a timeout to prevent hanging jobs

Add caching for Poetry dependencies
name: test | deduplication

on:
  workflow_dispatch:
  pull_request:
    branches:
      - main
    types: [labeled, synchronize]

+timeout-minutes: 30

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

+    steps:
+      # Add after Setup Python step
+      - name: Cache Poetry dependencies
+        uses: actions/cache@v3
+        with:
+          path: .venv
+          key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}
cognee/api/v1/add/add.py (3)
Line range hint 116-127: Update Data creation to include new required fields

The Data creation logic needs to be updated to support the new data model changes:

Set owner_id from the user parameter

Add content_hash from file_metadata
                        data = Data(
                            id = data_id,
                            name = file_metadata["name"],
                            raw_data_location = file_metadata["file_path"],
                            extension = file_metadata["extension"],
                            mime_type = file_metadata["mime_type"],
+                            owner_id = user.id,
+                            content_hash = file_metadata.get("content_hash"),
                        )
Line range hint 98-106: Add deduplication check using content_hash

Before creating a new Data record, check if content with the same hash already exists to implement deduplication.
                    data = (await session.execute(
-                        select(Data).filter(Data.id == data_id)
+                        select(Data).filter(
+                            (Data.id == data_id) |
+                            (Data.content_hash == file_metadata.get("content_hash"))
+                        )
                    )).scalar_one_or_none()
Line range hint 134-143: Include new fields in yielded metadata

Update the yielded metadata to include the new owner_id and content_hash fields for consistency with the Data model.
                yield {
                    "id": data_id,
                    "name": file_metadata["name"],
                    "file_path": file_metadata["file_path"],
                    "extension": file_metadata["extension"],
                    "mime_type": file_metadata["mime_type"],
+                    "owner_id": str(user.id),
+                    "content_hash": file_metadata.get("content_hash"),
                }
cognee/tests/test_deduplication.py (2)
11-97: Consider enhancing test robustness and error handling

While the test structure is good, consider these improvements:

Add error handling for file operations

Add cleanup between different file type tests to ensure complete isolation

Add negative test cases (e.g., different content with same names)

137-154: Add test environment cleanup

Consider adding cleanup of test directories after tests complete to prevent test data accumulation.
async def main():
+   try:
        data_directory_path = str(
            pathlib.Path(
                os.path.join(pathlib.Path(__file__).parent, ".data_storage/test_deduplication")
            ).resolve()
        )
        cognee.config.data_root_directory(data_directory_path)
        cognee_directory_path = str(
            pathlib.Path(
                os.path.join(pathlib.Path(__file__).parent, ".cognee_system/test_deduplication")
            ).resolve()
        )
        cognee.config.system_root_directory(cognee_directory_path)

        await test_deduplication_postgres()
        await test_deduplication_sqlite()
+   finally:
+       # Clean up test directories
+       import shutil
+       shutil.rmtree(data_directory_path, ignore_errors=True)
+       shutil.rmtree(cognee_directory_path, ignore_errors=True)
cognee/tests/test_data/Natural_language_processing_copy.txt (1)
1-1: Fix grammatical issue with missing comma

Add a comma before "as well as" in the compound predicate for better readability and grammatical correctness.
- The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
+ The technology can then accurately extract information and insights contained in the documents, as well as categorize and organize the documents themselves.
🧰 Tools

🪛 LanguageTool

[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...

(AI_HYDRA_LEO_MISSING_COMMA)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acf0368 and add6730.

⛔ Files ignored due to path filters (4)

cognee/tests/test_data/example.png is excluded by !**/*.png
cognee/tests/test_data/example_copy.png is excluded by !**/*.png
cognee/tests/test_data/text_to_speech.mp3 is excluded by !**/*.mp3
cognee/tests/test_data/text_to_speech_copy.mp3 is excluded by !**/*.mp3

📒 Files selected for processing (14)

.github/workflows/test_deduplication.yml (1 hunks)
cognee/api/v1/add/add.py (1 hunks)
cognee/infrastructure/files/utils/get_file_metadata.py (3 hunks)
cognee/modules/data/models/Data.py (1 hunks)
cognee/modules/data/operations/write_metadata.py (0 hunks)
cognee/modules/ingestion/data_types/BinaryData.py (1 hunks)
cognee/modules/ingestion/identify.py (1 hunks)
cognee/modules/ingestion/save_data_to_file.py (1 hunks)
cognee/tasks/ingestion/ingest_data_with_metadata.py (6 hunks)
cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (2 hunks)
cognee/tasks/ingestion/transform_data.py (1 hunks)
cognee/tests/test_data/Natural_language_processing_copy.txt (1 hunks)
cognee/tests/test_deduplication.py (1 hunks)

💤 Files with no reviewable changes (1)

cognee/modules/data/operations/write_metadata.py

🧰 Additional context used

🪛 Ruff (0.8.0)

cognee/tasks/ingestion/transform_data.py

14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

🪛 actionlint (1.7.4)

.github/workflows/test_deduplication.yml

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

🪛 LanguageTool

cognee/tests/test_data/Natural_language_processing_copy.txt

[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...

(AI_HYDRA_LEO_MISSING_COMMA)

🔇 Additional comments (1)

cognee/tests/test_deduplication.py (1)

1-10: LGTM: Test setup and imports are appropriate

The imports and logging configuration are well-structured for testing purposes.

cognee/tasks/ingestion/ingest_data_with_metadata.py

cognee/infrastructure/files/utils/get_file_metadata.py

cognee/tasks/ingestion/transform_data.py

cognee/modules/ingestion/data_types/BinaryData.py

cognee/modules/ingestion/save_data_to_file.py

.github/workflows/test_deduplication.yml

coderabbitai · 2024-12-05T18:47:54Z

.github/workflows/test_deduplication.yml

+
+    steps:
+      - name: Check out
+        uses: actions/checkout@master


⚠️ Potential issue

Security: Use specific version for checkout action

Using @master for actions can be dangerous as it may introduce breaking changes or security issues.

- uses: actions/checkout@master + uses: actions/checkout@v4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: actions/checkout@master

uses: actions/checkout@v4

cognee/tests/test_deduplication.py

cognee/tests/test_data/Natural_language_processing_copy.txt

Add missing function to get data from database to multimedia deduplication test Test COG-505

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)

cognee/tests/test_deduplication.py (5)

9-9: Consider adjusting logging level for tests

Setting logging to DEBUG level might generate too much output during test runs. Consider using INFO level for tests unless actively debugging.

-logging.basicConfig(level=logging.DEBUG)
+logging.basicConfig(level=logging.INFO)

11-11: Add docstring to explain test purpose and requirements

Add a docstring to document the test's purpose, prerequisites (test data files), and expected behavior.

 async def test_deduplication():
+    """Test deduplication functionality across different data types.
+    
+    Prerequisites:
+        - Test data files in ./test_data/ directory:
+          - Natural_language_processing.txt and its copy
+          - example.png and its copy
+          - text_to_speech.mp3 and its copy
+    
+    Tests:
+        - Deduplication of text files
+        - Deduplication of text content
+        - Deduplication of image files
+        - Deduplication of sound files
+    """

48-54: Move test text content to a separate file

Long text content should be moved to a test data file to improve readability and maintainability.

-    text = """A quantum computer is a computer that takes advantage...
-        """
+    text_path = os.path.join(
+        pathlib.Path(__file__).parent, "test_data/quantum_computing.txt"
+    )
+    verify_test_file(text_path, "quantum computing text")
+    with open(text_path, 'r') as f:
+        text = f.read()

44-45: Reduce redundant cleanup calls

Consider using pytest fixtures for setup and cleanup to avoid repetition and ensure consistent cleanup.

import pytest

@pytest.fixture(autouse=True)
async def cleanup():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    yield
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

Also applies to: 63-64, 80-81, 97-98

101-137: Use pytest parametrize for database configurations

Instead of separate functions, use pytest's parametrize feature to run the same test with different database configurations.

@pytest.mark.parametrize("db_config", [
    {
        "vector_db_config": {
            "vector_db_url": "",
            "vector_db_key": "",
            "vector_db_provider": "pgvector"
        },
        "relational_db_config": {
            "db_name": "cognee_db",
            "db_host": "127.0.0.1",
            "db_port": "5432",
            "db_username": "cognee",
            "db_password": "cognee",
            "db_provider": "postgres",
        }
    },
    {
        "vector_db_config": {
            "vector_db_url": "",
            "vector_db_key": "",
            "vector_db_provider": "lancedb"
        },
        "relational_db_config": {
            "db_provider": "sqlite",
        }
    }
])
async def test_deduplication_with_db(db_config):
    cognee.config.set_vector_db_config(db_config["vector_db_config"])
    cognee.config.set_relational_db_config(db_config["relational_db_config"])
    await test_deduplication()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between add6730 and 9ba5d49.

📒 Files selected for processing (1)

cognee/tests/test_deduplication.py (1 hunks)

🔇 Additional comments (1)

cognee/tests/test_deduplication.py (1)

101-121: Enhance database configuration security and validation

The previous review comment about using environment variables for database credentials and adding configuration validation is still valid.

cognee/tests/test_deduplication.py

Moved hash calculation of file to shared utils, added better typing Refactor COG-505

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

cognee/shared/utils.py (1)

76-95: Add docstring to explain function usage and purpose.

While the implementation is solid, please add a docstring that explains:

The function's purpose
Parameters and their types
Return value
Usage examples

Example docstring:

def get_file_content_hash(file_obj: Union[str, BinaryIO]) -> str:
    """
    Compute MD5 hash of file content, reading in chunks for memory efficiency.

    Args:
        file_obj: Either a file path (str) or a file-like object (BinaryIO)

    Returns:
        str: Hexadecimal representation of the MD5 hash

    Example:
        >>> hash1 = get_file_content_hash("path/to/file.txt")
        >>> with open("path/to/file.txt", "rb") as f:
        ...     hash2 = get_file_content_hash(f)
    """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba5d49 and e80377b.

📒 Files selected for processing (3)

cognee/infrastructure/files/utils/get_file_metadata.py (2 hunks)
cognee/shared/utils.py (2 hunks)
cognee/tasks/ingestion/ingest_data_with_metadata.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cognee/infrastructure/files/utils/get_file_metadata.py
cognee/tasks/ingestion/ingest_data_with_metadata.py

🔇 Additional comments (1)

cognee/shared/utils.py (1)

3-6: LGTM: Imports are correctly placed and necessary.

The new imports support the added functionality for file content hashing.

cognee/shared/utils.py

Added error handling to reading of file in hash util Refactor COG-505

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

cognee/shared/exceptions/exceptions.py (1)

4-11: Consider enhancing error handling and documentation.

While the implementation is functionally correct, consider these improvements:

Add a docstring to document the exception's purpose and usage
Consider adding an optional details parameter for more specific error information
Allow for different status codes based on specific ingestion failure scenarios

Here's a suggested enhancement:

 class IngestionError(CogneeApiError):
+    """Exception raised for errors during data ingestion process.
+
+    Attributes:
+        message -- explanation of the error
+        name -- name of the error type
+        status_code -- HTTP status code
+        details -- additional error details
+    """
     def __init__(
             self,
             message: str = "Failed to load data.",
             name: str = "IngestionError",
             status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+            details: dict = None
     ):
-        super().__init__(message, name, status_code)
+        self.details = details
+        full_message = f"{message} Details: {details}" if details else message
+        super().__init__(full_message, name, status_code)

This enhancement would allow for more detailed error reporting:

# Usage example
raise IngestionError(
    message="Failed to load CSV data",
    details={"reason": "Invalid column format", "column": "timestamp"}
)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e80377b and 1e098ae.

📒 Files selected for processing (3)

cognee/shared/exceptions/__init__.py (1 hunks)
cognee/shared/exceptions/exceptions.py (1 hunks)
cognee/shared/utils.py (3 hunks)

✅ Files skipped from review due to trivial changes (1)

cognee/shared/exceptions/init.py

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/shared/utils.py

🔇 Additional comments (1)

cognee/shared/exceptions/exceptions.py (1)

1-2: LGTM! Appropriate imports for exception handling.

The imports are correctly specified, bringing in the necessary base exception class and HTTP status codes.

borisarzentar

Great job with this implementation and writing that extensive test!

cognee/modules/data/models/Data.py

borisarzentar · 2024-12-06T12:00:02Z

cognee/modules/ingestion/identify.py

+def identify(data: IngestionData, user: User) -> str:
+    data_content_hash: str = data.get_identifier()
+    # return UUID hash of file contents + owner id
+    return uuid5(NAMESPACE_OID,f"{data_content_hash}{user.id}")


Some space is missing.

I would say we need owner_id for permissions regarding some Data operations as they are no longer tied to Datasets.

content_hash we don't need right now, but might be nice to have as it's the only way to verify the content of the Data

Add space and newline to ingest function Refactor COG-505

…eretes/cognee into COG-505-data-dataset-model-changes

dexters1 added 4 commits December 4, 2024 15:53

feat: Add ability to store single data instance in multiple datasets

6bb0f3d

Added ability to store single data instance in multiple datasets Feature COG-505

feat: Add text deduplication

0ce254b

If text is added to cognee it will be saved by hash so the same text can't be stored multiple times Feature COG-505

feat: Add deduplication of data

f5b5e56

Data is deduplicated per user so if a user tries to add data which already exists it will just be redirected to existing data in database Feature COG-505

fix: Fix merge of data for dlt

378e7b8

Resolve issue with dlt data not being merged for data_id Fix COG-505

dexters1 added the run-checks label Dec 5, 2024

dexters1 self-assigned this Dec 5, 2024

dexters1 and others added 3 commits December 5, 2024 17:10

Merge branch 'main' into COG-505-data-dataset-model-changes

349ddfe

test: Add test for text deduplication

813b76c

Added end to end test for text deduplication Test COG-505

Merge branch 'COG-505-data-dataset-model-changes' of github.com:topot…

387002d

…eretes/cognee into COG-505-data-dataset-model-changes

test: Add testing of dataset data table content

add6730

Add testing of dataset data table content Test COG-505

dexters1 marked this pull request as ready for review December 5, 2024 18:43

dexters1 requested a review from borisarzentar December 5, 2024 18:43

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

test: Fix test for multimedia deduplication

9ba5d49

Add missing function to get data from database to multimedia deduplication test Test COG-505

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

cognee/tests/test_deduplication.py Show resolved Hide resolved

cognee/tests/test_deduplication.py Show resolved Hide resolved

refactor: Move hash calculation of file to util

e80377b

Moved hash calculation of file to shared utils, added better typing Refactor COG-505

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

cognee/shared/utils.py Outdated Show resolved Hide resolved

refactor: Add error handling to hash util

1e098ae

Added error handling to reading of file in hash util Refactor COG-505

coderabbitai bot reviewed Dec 5, 2024

View reviewed changes

Merge branch 'main' into COG-505-data-dataset-model-changes

9429e5e

borisarzentar reviewed Dec 6, 2024

View reviewed changes

dexters1 added 2 commits December 6, 2024 13:48

refactor: Add space to ingest function

cc6fbe2

Add space and newline to ingest function Refactor COG-505

Merge branch 'COG-505-data-dataset-model-changes' of github.com:topot…

d7fa9f3

…eretes/cognee into COG-505-data-dataset-model-changes

hajdul88 self-requested a review December 6, 2024 13:41

hajdul88 approved these changes Dec 6, 2024

View reviewed changes

dexters1 merged commit 8415279 into main Dec 6, 2024
40 checks passed

dexters1 deleted the COG-505-data-dataset-model-changes branch December 6, 2024 13:42

coderabbitai bot mentioned this pull request Dec 12, 2024

Add type to DataPoint metadata #362

Closed

This was referenced Dec 12, 2024

Add type to DataPoint metadata #364

Merged

Cog 656 deployment state #368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cog 505 data dataset model changes #260

Cog 505 data dataset model changes #260

dexters1 commented Dec 5, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

gitguardian bot commented Dec 5, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Dec 5, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

borisarzentar left a comment

borisarzentar Dec 6, 2024

dexters1 Dec 6, 2024

Cog 505 data dataset model changes #260

Cog 505 data dataset model changes #260

Conversation

dexters1 commented Dec 5, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 5, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

gitguardian bot commented Dec 5, 2024 • edited Loading

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 5, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

borisarzentar left a comment

Choose a reason for hiding this comment

borisarzentar Dec 6, 2024

Choose a reason for hiding this comment

dexters1 Dec 6, 2024

Choose a reason for hiding this comment

dexters1 commented Dec 5, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

gitguardian bot commented Dec 5, 2024 •

edited

Loading