-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 505 data dataset model changes #260
Conversation
Added ability to store single data instance in multiple datasets Feature COG-505
If text is added to cognee it will be saved by hash so the same text can't be stored multiple times Feature COG-505
Data is deduplicated per user so if a user tries to add data which already exists it will just be redirected to existing data in database Feature COG-505
Resolve issue with dlt data not being merged for data_id Fix COG-505
WalkthroughThis pull request introduces a new GitHub Actions workflow for testing deduplication functionality, triggered by manual dispatch or pull requests to the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Added end to end test for text deduplication Test COG-505
…eretes/cognee into COG-505-data-dataset-model-changes
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | 813b76c | .github/workflows/test_deduplication.yml | View secret |
9573981 | Triggered | Generic Password | 813b76c | cognee/tests/test_deduplication.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Add testing of dataset data table content Test COG-505
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🧹 Outside diff range and nitpick comments (14)
cognee/tasks/ingestion/ingest_data_with_metadata.py (1)
28-28
: Use a safer method to remove'file://'
prefix fromfile_path
Using
file_path.replace("file://", "")
may unintentionally replace occurrences of'file://'
elsewhere in the path. It's safer to remove the prefix only if it exists at the start of the string.Apply this diff to make the change:
- with open(file_path.replace("file://", ""), mode="rb") as file: + if file_path.startswith("file://"): + file_path = file_path[len("file://"):] + with open(file_path, mode="rb") as file:Also, update the similar code in line 62:
- with open(file_path.replace("file://", ""), mode="rb") as file: + if file_path.startswith("file://"): + file_path = file_path[len("file://"):] + with open(file_path, mode="rb") as file:Also applies to: 62-62
cognee/modules/ingestion/identify.py (1)
6-9
: Ensureuser.id
is a string when generating UUIDWhen concatenating
data_content_hash
withuser.id
to generate the UUID, ensure thatuser.id
is converted to a string to prevent any type errors.Apply this diff:
def identify(data: IngestionData, user: User) -> str: data_content_hash: str = data.get_identifier() # Return UUID hash of file contents + owner id - return uuid5(NAMESPACE_OID, f"{data_content_hash}{user.id}") + return uuid5(NAMESPACE_OID, f"{data_content_hash}{str(user.id)}")cognee/tasks/ingestion/save_data_item_to_storage.py (1)
13-13
: Use a safer method to handle'file://'
prefix indata_item
Using
data_item.replace("file://", "")
may unintentionally replace'file://'
elsewhere in the string. It's better to check if the string starts with'file://'
and remove the prefix accordingly.Apply this diff:
elif isinstance(data_item, str): # Data is a file path - if data_item.startswith("file://") or data_item.startswith("/"): - file_path = data_item.replace("file://", "") + if data_item.startswith("file://"): + file_path = data_item[len("file://"):] + elif data_item.startswith("/"): + file_path = data_item # Data is text else: file_path = save_data_to_file(data_item)Ensure that
file_path
is correctly assigned in all cases.cognee/modules/ingestion/data_types/BinaryData.py (1)
20-20
: Add return type hint for get_identifierFor better type safety and documentation, add a return type hint to the method signature.
Apply this diff:
- def get_identifier(self): + def get_identifier(self) -> str:cognee/modules/ingestion/save_data_to_file.py (1)
24-26
: Add type hints and improve file existence checkThe file existence check could be improved with better type hints and path handling.
Apply this diff:
- # Don't save file if it already exists - if not os.path.isfile(os.path.join(storage_path, file_name)): - LocalStorage(storage_path).store(file_name, classified_data.get_data()) + file_path: str = os.path.join(storage_path, file_name) + # Don't save file if it already exists + if not os.path.isfile(file_path): + LocalStorage(storage_path).store(file_name, classified_data.get_data())cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (1)
Line range hint
6-8
: Consider removing unused dataset_name parameterThe
dataset_name
parameter is only used for llama_index data type. Consider refactoring to remove this parameter or document why it's needed.Either remove the parameter if not needed or document its purpose:
-async def save_data_item_with_metadata_to_storage( - data_item: Union[BinaryIO, str, Any], dataset_name: str -) -> str: +async def save_data_item_with_metadata_to_storage( + data_item: Union[BinaryIO, str, Any], + dataset_name: str | None = None # Only used for llama_index data type +) -> str:cognee/modules/data/models/Data.py (1)
Line range hint
44-54
: Update to_json() method to include new fieldsThe to_json() method should be updated to include the new owner_id and content_hash fields for consistency.
def to_json(self) -> dict: return { "id": str(self.id), "name": self.name, "extension": self.extension, "mimeType": self.mime_type, "rawDataLocation": self.raw_data_location, + "ownerId": str(self.owner_id), + "contentHash": self.content_hash, "createdAt": self.created_at.isoformat(), "updatedAt": self.updated_at.isoformat() if self.updated_at else None, }.github/workflows/test_deduplication.yml (1)
1-14
: Add workflow timeout and dependency cachingConsider adding these improvements to the workflow:
- Add a timeout to prevent hanging jobs
- Add caching for Poetry dependencies
name: test | deduplication on: workflow_dispatch: pull_request: branches: - main types: [labeled, synchronize] +timeout-minutes: 30 concurrency: group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} cancel-in-progress: true + steps: + # Add after Setup Python step + - name: Cache Poetry dependencies + uses: actions/cache@v3 + with: + path: .venv + key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}cognee/api/v1/add/add.py (3)
Line range hint
116-127
: Update Data creation to include new required fieldsThe Data creation logic needs to be updated to support the new data model changes:
- Set owner_id from the user parameter
- Add content_hash from file_metadata
data = Data( id = data_id, name = file_metadata["name"], raw_data_location = file_metadata["file_path"], extension = file_metadata["extension"], mime_type = file_metadata["mime_type"], + owner_id = user.id, + content_hash = file_metadata.get("content_hash"), )
Line range hint
98-106
: Add deduplication check using content_hashBefore creating a new Data record, check if content with the same hash already exists to implement deduplication.
data = (await session.execute( - select(Data).filter(Data.id == data_id) + select(Data).filter( + (Data.id == data_id) | + (Data.content_hash == file_metadata.get("content_hash")) + ) )).scalar_one_or_none()
Line range hint
134-143
: Include new fields in yielded metadataUpdate the yielded metadata to include the new owner_id and content_hash fields for consistency with the Data model.
yield { "id": data_id, "name": file_metadata["name"], "file_path": file_metadata["file_path"], "extension": file_metadata["extension"], "mime_type": file_metadata["mime_type"], + "owner_id": str(user.id), + "content_hash": file_metadata.get("content_hash"), }cognee/tests/test_deduplication.py (2)
11-97
: Consider enhancing test robustness and error handlingWhile the test structure is good, consider these improvements:
- Add error handling for file operations
- Add cleanup between different file type tests to ensure complete isolation
- Add negative test cases (e.g., different content with same names)
137-154
: Add test environment cleanupConsider adding cleanup of test directories after tests complete to prevent test data accumulation.
async def main(): + try: data_directory_path = str( pathlib.Path( os.path.join(pathlib.Path(__file__).parent, ".data_storage/test_deduplication") ).resolve() ) cognee.config.data_root_directory(data_directory_path) cognee_directory_path = str( pathlib.Path( os.path.join(pathlib.Path(__file__).parent, ".cognee_system/test_deduplication") ).resolve() ) cognee.config.system_root_directory(cognee_directory_path) await test_deduplication_postgres() await test_deduplication_sqlite() + finally: + # Clean up test directories + import shutil + shutil.rmtree(data_directory_path, ignore_errors=True) + shutil.rmtree(cognee_directory_path, ignore_errors=True)cognee/tests/test_data/Natural_language_processing_copy.txt (1)
1-1
: Fix grammatical issue with missing commaAdd a comma before "as well as" in the compound predicate for better readability and grammatical correctness.
- The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. + The technology can then accurately extract information and insights contained in the documents, as well as categorize and organize the documents themselves.🧰 Tools
🪛 LanguageTool
[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...(AI_HYDRA_LEO_MISSING_COMMA)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (4)
-
cognee/tests/test_data/example.png
is excluded by!**/*.png
-
cognee/tests/test_data/example_copy.png
is excluded by!**/*.png
-
cognee/tests/test_data/text_to_speech.mp3
is excluded by!**/*.mp3
-
cognee/tests/test_data/text_to_speech_copy.mp3
is excluded by!**/*.mp3
📒 Files selected for processing (14)
-
.github/workflows/test_deduplication.yml
(1 hunks) -
cognee/api/v1/add/add.py
(1 hunks) -
cognee/infrastructure/files/utils/get_file_metadata.py
(3 hunks) -
cognee/modules/data/models/Data.py
(1 hunks) -
cognee/modules/data/operations/write_metadata.py
(0 hunks) -
cognee/modules/ingestion/data_types/BinaryData.py
(1 hunks) -
cognee/modules/ingestion/identify.py
(1 hunks) -
cognee/modules/ingestion/save_data_to_file.py
(1 hunks) -
cognee/tasks/ingestion/ingest_data_with_metadata.py
(6 hunks) -
cognee/tasks/ingestion/save_data_item_to_storage.py
(1 hunks) -
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
(2 hunks) -
cognee/tasks/ingestion/transform_data.py
(1 hunks) -
cognee/tests/test_data/Natural_language_processing_copy.txt
(1 hunks) -
cognee/tests/test_deduplication.py
(1 hunks)
💤 Files with no reviewable changes (1)
- cognee/modules/data/operations/write_metadata.py
🧰 Additional context used
🪛 Ruff (0.8.0)
cognee/tasks/ingestion/transform_data.py
14-14: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
🪛 actionlint (1.7.4)
.github/workflows/test_deduplication.yml
26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}
(if-cond)
🪛 LanguageTool
cognee/tests/test_data/Natural_language_processing_copy.txt
[uncategorized] ~1-~1: Possible missing comma found.
Context: ...formation and insights contained in the documents as well as categorize and organize the ...
(AI_HYDRA_LEO_MISSING_COMMA)
🔇 Additional comments (1)
cognee/tests/test_deduplication.py (1)
1-10
: LGTM: Test setup and imports are appropriate
The imports and logging configuration are well-structured for testing purposes.
|
||
steps: | ||
- name: Check out | ||
uses: actions/checkout@master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security: Use specific version for checkout action
Using @master for actions can be dangerous as it may introduce breaking changes or security issues.
- uses: actions/checkout@master
+ uses: actions/checkout@v4
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
uses: actions/checkout@master | |
uses: actions/checkout@v4 |
Add missing function to get data from database to multimedia deduplication test Test COG-505
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
cognee/tests/test_deduplication.py (5)
9-9
: Consider adjusting logging level for testsSetting logging to DEBUG level might generate too much output during test runs. Consider using INFO level for tests unless actively debugging.
-logging.basicConfig(level=logging.DEBUG) +logging.basicConfig(level=logging.INFO)
11-11
: Add docstring to explain test purpose and requirementsAdd a docstring to document the test's purpose, prerequisites (test data files), and expected behavior.
async def test_deduplication(): + """Test deduplication functionality across different data types. + + Prerequisites: + - Test data files in ./test_data/ directory: + - Natural_language_processing.txt and its copy + - example.png and its copy + - text_to_speech.mp3 and its copy + + Tests: + - Deduplication of text files + - Deduplication of text content + - Deduplication of image files + - Deduplication of sound files + """
48-54
: Move test text content to a separate fileLong text content should be moved to a test data file to improve readability and maintainability.
- text = """A quantum computer is a computer that takes advantage... - """ + text_path = os.path.join( + pathlib.Path(__file__).parent, "test_data/quantum_computing.txt" + ) + verify_test_file(text_path, "quantum computing text") + with open(text_path, 'r') as f: + text = f.read()
44-45
: Reduce redundant cleanup callsConsider using pytest fixtures for setup and cleanup to avoid repetition and ensure consistent cleanup.
import pytest @pytest.fixture(autouse=True) async def cleanup(): await cognee.prune.prune_data() await cognee.prune.prune_system(metadata=True) yield await cognee.prune.prune_data() await cognee.prune.prune_system(metadata=True)Also applies to: 63-64, 80-81, 97-98
101-137
: Use pytest parametrize for database configurationsInstead of separate functions, use pytest's parametrize feature to run the same test with different database configurations.
@pytest.mark.parametrize("db_config", [ { "vector_db_config": { "vector_db_url": "", "vector_db_key": "", "vector_db_provider": "pgvector" }, "relational_db_config": { "db_name": "cognee_db", "db_host": "127.0.0.1", "db_port": "5432", "db_username": "cognee", "db_password": "cognee", "db_provider": "postgres", } }, { "vector_db_config": { "vector_db_url": "", "vector_db_key": "", "vector_db_provider": "lancedb" }, "relational_db_config": { "db_provider": "sqlite", } } ]) async def test_deduplication_with_db(db_config): cognee.config.set_vector_db_config(db_config["vector_db_config"]) cognee.config.set_relational_db_config(db_config["relational_db_config"]) await test_deduplication()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/tests/test_deduplication.py
(1 hunks)
🔇 Additional comments (1)
cognee/tests/test_deduplication.py (1)
101-121
: Enhance database configuration security and validation
The previous review comment about using environment variables for database credentials and adding configuration validation is still valid.
Moved hash calculation of file to shared utils, added better typing Refactor COG-505
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
cognee/shared/utils.py (1)
76-95
: Add docstring to explain function usage and purpose.While the implementation is solid, please add a docstring that explains:
- The function's purpose
- Parameters and their types
- Return value
- Usage examples
Example docstring:
def get_file_content_hash(file_obj: Union[str, BinaryIO]) -> str: """ Compute MD5 hash of file content, reading in chunks for memory efficiency. Args: file_obj: Either a file path (str) or a file-like object (BinaryIO) Returns: str: Hexadecimal representation of the MD5 hash Example: >>> hash1 = get_file_content_hash("path/to/file.txt") >>> with open("path/to/file.txt", "rb") as f: ... hash2 = get_file_content_hash(f) """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
cognee/infrastructure/files/utils/get_file_metadata.py
(2 hunks)cognee/shared/utils.py
(2 hunks)cognee/tasks/ingestion/ingest_data_with_metadata.py
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cognee/infrastructure/files/utils/get_file_metadata.py
- cognee/tasks/ingestion/ingest_data_with_metadata.py
🔇 Additional comments (1)
cognee/shared/utils.py (1)
3-6
: LGTM: Imports are correctly placed and necessary.
The new imports support the added functionality for file content hashing.
Added error handling to reading of file in hash util Refactor COG-505
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
cognee/shared/exceptions/exceptions.py (1)
4-11
: Consider enhancing error handling and documentation.While the implementation is functionally correct, consider these improvements:
- Add a docstring to document the exception's purpose and usage
- Consider adding an optional
details
parameter for more specific error information- Allow for different status codes based on specific ingestion failure scenarios
Here's a suggested enhancement:
class IngestionError(CogneeApiError): + """Exception raised for errors during data ingestion process. + + Attributes: + message -- explanation of the error + name -- name of the error type + status_code -- HTTP status code + details -- additional error details + """ def __init__( self, message: str = "Failed to load data.", name: str = "IngestionError", status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, + details: dict = None ): - super().__init__(message, name, status_code) + self.details = details + full_message = f"{message} Details: {details}" if details else message + super().__init__(full_message, name, status_code)This enhancement would allow for more detailed error reporting:
# Usage example raise IngestionError( message="Failed to load CSV data", details={"reason": "Invalid column format", "column": "timestamp"} )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
cognee/shared/exceptions/__init__.py
(1 hunks)cognee/shared/exceptions/exceptions.py
(1 hunks)cognee/shared/utils.py
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/shared/exceptions/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/shared/utils.py
🔇 Additional comments (1)
cognee/shared/exceptions/exceptions.py (1)
1-2
: LGTM! Appropriate imports for exception handling.
The imports are correctly specified, bringing in the necessary base exception class and HTTP status codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job with this implementation and writing that extensive test!
cognee/modules/ingestion/identify.py
Outdated
def identify(data: IngestionData, user: User) -> str: | ||
data_content_hash: str = data.get_identifier() | ||
# return UUID hash of file contents + owner id | ||
return uuid5(NAMESPACE_OID,f"{data_content_hash}{user.id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some space is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say we need owner_id for permissions regarding some Data operations as they are no longer tied to Datasets.
content_hash we don't need right now, but might be nice to have as it's the only way to verify the content of the Data
Add space and newline to ingest function Refactor COG-505
…eretes/cognee into COG-505-data-dataset-model-changes
Summary by CodeRabbit
New Features
content_hash
andowner_id
to data entries.Bug Fixes
Tests
Documentation