Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 337 llama index support #186

Merged
merged 21 commits into from
Nov 17, 2024
Merged

Cog 337 llama index support #186

merged 21 commits into from
Nov 17, 2024

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Nov 8, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced the ingest_data_with_metadata function for ingesting data along with metadata.
    • Added the save_data_item_to_storage function to handle various data item types for storage.
    • Created a new Jupyter notebook for utilizing LlamaIndex documents within the Cognee framework.
    • Added the save_data_item_with_metadata_to_storage function for enhanced data storage capabilities.
  • Improvements

    • Updated error messages in permission checks to use user email instead of username.
    • Streamlined the save_data_to_storage function for improved efficiency.
    • Simplified parameter passing in the cognee.search function.
  • Dependencies

    • Updated version constraints for instructor and tenacity packages.
    • Added new dependency llama-index-core.
    • Updated psycopg2 to include an optional attribute.
  • Chores

    • Added a new GitHub Actions workflow for testing the LlamaIndex notebook.

Added support for LlamaIndex Document type

Feature #COG-337
Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage

Docs #COG-337
Allow usage of metadata from LlamaIndex documents

Feature #COG-337
Change name of llama index function

Refactor #COG-337
Downgrade needed on tenacity and instructor modules to support llama index

Chore #COG-337
Added task that will have access to metadata if data is provided from different data ingestion tools

Feature #COG-337
Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well

Docs #COG-337
Added missing parameter to function call

Fix #COG-337
Moved data storing from async to sync

Refactor #COG-337
Refactor so ingest_data file was modified instead of having two ingest tasks

Refactor #COG-337
Merged new and old data ingestion tasks into one

Refactor #COG-337
Returned ingest_data and save_data_to_storage tasks

Refactor #COG-337
Returned previous ignestion tasks to add function

Refactor #COG-337
Copy link
Contributor

coderabbitai bot commented Nov 8, 2024

Walkthrough

This pull request introduces several changes across multiple files in the cognee project. Key modifications include the addition of new functions for data ingestion and storage, updates to error handling messages, and adjustments to import statements. A new Jupyter notebook is also included to demonstrate the integration of LlamaIndex documents. Additionally, the pyproject.toml file has updated dependency versions and added a new dependency.

Changes

File Path Change Summary
cognee/api/v1/add/add_v2.py Added a newline at the end of the file; no changes to function logic.
cognee/modules/users/permissions/methods/check_permission_on_documents.py Updated exception message to use user's email instead of username in PermissionDeniedException.
cognee/tasks/ingestion/__init__.py Added imports for save_data_item_to_storage and save_data_item_with_metadata_to_storage functions.
cognee/tasks/ingestion/ingest_data.py Removed import for get_relational_config; no changes to function logic.
cognee/tasks/ingestion/ingest_data_with_metadata.py Introduced ingest_data_with_metadata function for data ingestion with metadata.
cognee/tasks/ingestion/save_data_item_to_storage.py Introduced save_data_item_to_storage function for saving various data items.
cognee/tasks/ingestion/save_data_to_storage.py Replaced import for save_data_to_file with save_data_item_to_storage; simplified logic in save_data_to_storage.
cognee/tasks/ingestion/transform_data.py Introduced get_data_from_llama_index function for handling document types.
notebooks/cognee_llama_index.ipynb New notebook demonstrating LlamaIndex integration with a sample dataset.
pyproject.toml Updated dependency versions for instructor and tenacity, and added new dependency llama-index-core.
.github/workflows/test_cognee_llama_index_notebook.yml New workflow added for testing the LlamaIndex notebook.
examples/python/simple_example.py Modified parameter passing to cognee.search, switching from dictionary to keyword arguments.

Possibly related PRs

  • Cog 170 pgvector adapter #158: The changes in this PR add support for PGVector, which is directly related to the modifications in the main PR that involve the add function in add_v2.py, where the new function create_pgvector_db_and_tables is introduced.
  • Feat/cog 553 graph memory projection #196: This PR introduces the CogneeGraph class, which is relevant to the main PR's focus on graph-related functionalities, particularly in how data is structured and managed within the graph context.
  • Cog 577 add unit test task #197: This PR adds unit tests, which may relate to the changes in the main PR by ensuring that the functionalities introduced, such as those in the add function, are properly tested and validated.

Poem

🐰 In the code where bunnies hop,
New functions bloom, they never stop.
With data stored and permissions clear,
Our project grows, let's give a cheer!
LlamaIndex leads the way,
In Cognee's world, we play all day! 🌼


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c4700b1 and f6e8294.

📒 Files selected for processing (1)
  • pyproject.toml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

gitguardian bot commented Nov 8, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password b05a07f notebooks/cognee_llama_index.ipynb View secret
13947935 Triggered Generic High Entropy Secret f6e8294 cognee/shared/utils.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (10)
cognee/tasks/ingestion/save_data_item_to_storage.py (2)

6-6: Add docstring to improve code documentation.

The function lacks a docstring explaining its purpose, parameters, return value, and possible exceptions.

 def save_data_item_to_storage(data_item: Union[BinaryIO, Document, str], dataset_name: str) -> str:
+    """Save a data item to storage based on its type.
+
+    Args:
+        data_item: The data to save. Can be a file object, LlamaIndex Document, or string path/content
+        dataset_name: Name of the dataset to save the data under
+
+    Returns:
+        str: Path where the data was saved
+
+    Raises:
+        ValueError: If the data_item type is not supported
+    """

22-23: Enhance error message with supported types.

The current error message could be more helpful by listing the supported types.

-        raise ValueError(f"Data type not supported: {type(data_item)}")
+        raise ValueError(
+            f"Data type not supported: {type(data_item)}. "
+            "Supported types are: BinaryIO, Document, or str (file path or content)"
+        )
cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)

36-36: Add null check for user.email

While changing from username to email makes sense, we should ensure the email field is not null to prevent unclear error messages.

-            raise PermissionDeniedException(f"User {user.email} does not have {permission_type} permission on documents")
+            user_identifier = user.email or "Unknown user"
+            raise PermissionDeniedException(f"User {user_identifier} does not have {permission_type} permission on documents")
cognee/tasks/ingestion/ingest_data.py (4)

Line range hint 11-24: Consider splitting the ingest_data function for better maintainability.

The function handles multiple responsibilities including pipeline setup, data ingestion, and telemetry. Consider breaking it down into smaller, focused functions.

Suggested structure:

async def setup_pipeline(destination):
    return dlt.pipeline(
        pipeline_name="file_load_from_filesystem",
        destination=destination,
    )

async def ingest_data(file_paths: list[str], dataset_name: str, user: User):
    destination = get_dlt_destination()
    pipeline = await setup_pipeline(destination)
    # ... rest of the code

Line range hint 31-63: Add explicit error handling for database operations.

The database operations lack explicit error handling which could lead to inconsistent states in case of failures. Consider adding try-catch blocks and explicit transaction management.

Suggested improvement:

async def handle_data_record(session, data_id, file_metadata, dataset):
    try:
        data = (await session.execute(
            select(Data).filter(Data.id == data_id)
        )).scalar_one_or_none()

        if data is not None:
            data.name = file_metadata["name"]
            data.raw_data_location = file_metadata["file_path"]
            data.extension = file_metadata["extension"]
            data.mime_type = file_metadata["mime_type"]
            await session.merge(data)
        else:
            data = Data(
                id=data_id,
                name=file_metadata["name"],
                raw_data_location=file_metadata["file_path"],
                extension=file_metadata["extension"],
                mime_type=file_metadata["mime_type"],
            )
            dataset.data.append(data)
        
        await session.commit()
        return data
    except Exception as e:
        await session.rollback()
        raise RuntimeError(f"Failed to handle data record: {str(e)}")

Line range hint 26-30: Add input validation for file paths and secure file handling.

The current implementation directly uses file paths with minimal validation, which could pose security risks. Consider adding path validation and sanitization.

from pathlib import Path
import os

def validate_file_path(file_path: str) -> str:
    # Remove file:// prefix and normalize path
    cleaned_path = file_path.replace("file://", "")
    abs_path = os.path.abspath(cleaned_path)
    
    # Ensure path is within allowed directory
    if not abs_path.startswith('/allowed/base/path'):
        raise ValueError("File path is outside allowed directory")
    
    if not os.path.exists(abs_path):
        raise FileNotFoundError(f"File not found: {abs_path}")
        
    return abs_path

# Use in the function:
validated_path = validate_file_path(file_path)
with open(validated_path, mode="rb") as file:
    # ... rest of the code

Line range hint 65-67: Ensure atomic operations for permission assignment.

The permission assignment happens after yielding data, which could lead to a race condition where data is accessible before permissions are properly set.

Consider moving the permission assignment before yielding the data:

await give_permission_on_document(user, data_id, "read")
await give_permission_on_document(user, data_id, "write")

yield {
    "id": data_id,
    "name": file_metadata["name"],
    "file_path": file_metadata["file_path"],
    "extension": file_metadata["extension"],
    "mime_type": file_metadata["mime_type"],
}
notebooks/cognee_llama_index.ipynb (3)

27-34: Add error handling and data validation for robust data loading.

Consider these improvements for production readiness:

  1. Add try-except block for data loading
  2. Validate the loaded data (check for required columns, non-empty values)
  3. Make the sample size configurable
  4. Consider caching the dataset locally
+try:
     news = pd.read_csv(
         "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
-    )[:5]
+    ).head(5)  # Using head() is more explicit than slice
+    
+    # Validate required columns
+    required_columns = ['title', 'text']
+    if not all(col in news.columns for col in required_columns):
+        raise ValueError(f"Dataset missing required columns: {required_columns}")
+        
+    # Validate non-empty values
+    if news[required_columns].isnull().any().any():
+        raise ValueError("Dataset contains null values in required columns")
+except Exception as e:
+    raise RuntimeError(f"Failed to load news dataset: {str(e)}")

50-53: Enhance document preparation with metadata and better formatting.

The current document preparation could be improved by:

  1. Adding clear separation between title and text
  2. Including metadata for better document tracking
  3. Adding error handling
 documents = [
-    Document(text=f"{row['title']}: {row['text']}")
+    Document(
+        text=f"Title: {row['title']}\n\nContent: {row['text']}",
+        metadata={
+            "title": row['title'],
+            "doc_id": f"news_{i}",
+            "source": "diffbot_news_dataset"
+        }
+    )
     for i, row in news.iterrows()
 ]

1-205: Consider converting this notebook into a production-ready tutorial.

While this notebook effectively demonstrates the LlamaIndex integration, consider these improvements for production use:

  1. Add a requirements.txt or environment.yml file
  2. Include proper error handling throughout
  3. Add detailed documentation about prerequisites and setup
  4. Consider creating a proper Python package with the core functionality
  5. Add example outputs and expected results

Would you like me to help create a production-ready version of this tutorial?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1e01e84 and 8c7fdc4.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • cognee/api/v1/add/add_v2.py (1 hunks)
  • cognee/modules/users/permissions/methods/check_permission_on_documents.py (1 hunks)
  • cognee/tasks/ingestion/__init__.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data_with_metadata.py (1 hunks)
  • cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
  • cognee/tasks/ingestion/save_data_to_storage.py (2 hunks)
  • cognee/tasks/ingestion/transform_data.py (1 hunks)
  • notebooks/cognee_llama_index.ipynb (1 hunks)
  • pyproject.toml (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cognee/api/v1/add/add_v2.py
🧰 Additional context used
🪛 Ruff
cognee/tasks/ingestion/transform_data.py

8-8: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

🔇 Additional comments (10)
cognee/tasks/ingestion/__init__.py (1)

3-3: LGTM! Verify the imported module exists.

The new import follows the consistent pattern of the module and aligns with the PR's objective of adding LlamaIndex support.

Let's verify the imported module exists:

✅ Verification successful

Import verified and correctly used across the codebase

The module save_data_item_to_storage.py exists in the expected location and the import statement is consistently used across multiple files in the ingestion package.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of the imported module
# Expected: The file should exist in the same directory

# Test 1: Check if the module file exists
fd --type f "save_data_item_to_storage.py" cognee/tasks/ingestion/

# Test 2: Verify there are no typos in the import statement across the codebase
rg "from.*save_data_item_to_storage import" --type py

Length of output: 525

cognee/tasks/ingestion/save_data_to_storage.py (2)

1-2: LGTM! Import changes align with the new storage abstraction.

The switch to save_data_item_to_storage suggests a more flexible storage approach that can accommodate different types of data items, including LlamaIndex documents.


12-13: 🛠️ Refactor suggestion

Verify type handling and error propagation.

While the simplified code improves readability, there are a few concerns to address:

  1. The function signature suggests it only handles Union[BinaryIO, str], but the AI summary mentions Document support. Consider updating the type hints to reflect all supported types.
  2. There's no visible error handling for failed storage operations.

Let's verify the type handling in save_data_item_to_storage:

Consider adding error handling:

 def save_data_to_storage(data: Union[BinaryIO, str], dataset_name) -> list[str]:
     if not isinstance(data, list):
         data = [data]
     file_paths = []
     for data_item in data:
-        file_path = save_data_item_to_storage(data_item, dataset_name)
-        file_paths.append(file_path)
+        try:
+            file_path = save_data_item_to_storage(data_item, dataset_name)
+            file_paths.append(file_path)
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data item: {str(e)}") from e
     return file_paths
cognee/tasks/ingestion/transform_data.py (1)

1-4: LGTM! Imports are well-organized and specific.

The imports are properly structured with external packages first (llama_index) followed by internal modules (cognee).

cognee/tasks/ingestion/save_data_item_to_storage.py (1)

1-4: LGTM! Imports and type hints are well-structured.

The imports are appropriate and type hints are properly defined.

cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)

36-36: Verify consistency of user identification in error messages

Let's check if other permission-related error messages in the codebase use email consistently.

✅ Verification successful

Error message format is consistent with codebase patterns

The error message using user.email is consistent with the codebase's patterns. This is the only permission-related error message that includes user identification, and it appropriately uses the email attribute which is a standard identifier for users in the system.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for permission-related error messages to verify consistency
# Look for patterns that might use username or email in error messages

# Search for PermissionDeniedException usage
echo "Checking PermissionDeniedException messages:"
rg -A 1 "PermissionDeniedException.*User.*" --type py

# Search for similar permission-related error messages
echo -e "\nChecking other permission-related error messages:"
rg "permission.*User \{.*\}" --type py

Length of output: 506


Script:

#!/bin/bash
# Let's search more broadly for permission-related error messages and user references
echo "Checking for permission-related exceptions and messages:"
rg "Permission.*Exception.*" -B 1 -A 1 --type py

echo -e "\nChecking for user-related error messages:"
rg "User.*does not have|has no.*permission" --type py

echo -e "\nChecking for any error messages with user references:"
rg "raise.*User" -B 1 -A 1 --type py

Length of output: 2538

pyproject.toml (3)

73-73: LGTM: llama-index-core dependency.

The addition of llama-index-core ^0.11.22 aligns with the PR objectives for adding LlamaIndex support.


49-49: Verify impact of tenacity downgrade.

The downgrade from ^9.0.0 to ^8.4.1 is a breaking change that could affect retry mechanisms and error handling.

Let's check for tenacity usage that might be impacted:

#!/bin/bash
# Description: Search for tenacity usage patterns that might be affected by the version downgrade

# Test: Look for tenacity decorators and retry configurations
rg -A 5 "@retry" 
rg -A 5 "from tenacity"

39-39: Verify impact of instructor downgrade.

The downgrade from 1.6.3 to ^1.5.2 is a breaking change that could affect existing functionality.

Let's check for instructor usage that might be impacted:

✅ Verification successful

Let me generate additional verification scripts to check the specific instructor features being used and compare them with the changelog between versions 1.5.2 and 1.6.3.

The results will help us determine if the downgrade impacts any of the used features by comparing them with the version differences.


Instructor downgrade is safe to proceed

The codebase uses basic instructor features that are stable across versions 1.5.2 to 1.6.3:

  • Core initialization methods: from_openai() and patch()
  • Basic mode enums: Mode.JSON, Mode.MD_JSON, and Mode.ANTHROPIC_TOOLS
  • Standard structured output generation via OpenAI and Anthropic clients

These fundamental features were introduced before 1.5.2 and remain unchanged in the downgrade.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for instructor usage patterns that might be affected by the version downgrade

# Test: Look for instructor imports and usage
rg -A 5 "from instructor" 
rg -A 5 "import instructor"

# Test: Look for specific instructor features that might have changed between versions
ast-grep --pattern 'instructor.$_'

Length of output: 9685


Script:

#!/bin/bash
# Look for imports to understand which instructor features are being used
rg -l "import instructor"

# Look for specific instructor Mode usage
rg "instructor\.Mode\."

# Look for instructor initialization patterns
rg "instructor\.(patch|from_openai)"

Length of output: 1060

cognee/tasks/ingestion/ingest_data.py (1)

6-6: Verify the removal of get_relational_config dependency.

The removal of get_relational_config import while retaining get_relational_engine suggests an architectural change. Let's verify there are no remaining references to the removed import.

cognee/tasks/ingestion/transform_data.py Show resolved Hide resolved
Comment on lines +6 to +18
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
# Specific type checking is used to ensure it's not a child class from Document
if type(data_point) == Document:
file_path = data_point.metadata.get("file_path")
if file_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return file_path
elif type(data_point) == ImageDocument:
if data_point.image_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return data_point.image_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to reduce code duplication and improve error handling

The function has similar logic patterns that could be consolidated, and it needs better error handling.

Consider this improved implementation:

 def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
+    """Extract or generate a file path from a LlamaIndex document.
+
+    Args:
+        data_point: A Document or ImageDocument from LlamaIndex
+        dataset_name: Name of the dataset for file path generation
+
+    Returns:
+        str: Path to the file containing the document's content
+
+    Raises:
+        TypeError: If data_point is neither Document nor ImageDocument
+    """
+    if isinstance(data_point, ImageDocument):
+        return data_point.image_path or save_data_to_file(data_point.text, dataset_name)
+    elif isinstance(data_point, Document):
+        return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name)
+    else:
+        raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}")

This refactored version:

  1. Uses the or operator for cleaner null-coalescing
  2. Adds proper error handling for invalid types
  3. Reduces code duplication
  4. Handles ImageDocument check first to avoid isinstance ambiguity
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
# Specific type checking is used to ensure it's not a child class from Document
if type(data_point) == Document:
file_path = data_point.metadata.get("file_path")
if file_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return file_path
elif type(data_point) == ImageDocument:
if data_point.image_path is None:
file_path = save_data_to_file(data_point.text, dataset_name)
return file_path
return data_point.image_path
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
"""Extract or generate a file path from a LlamaIndex document.
Args:
data_point: A Document or ImageDocument from LlamaIndex
dataset_name: Name of the dataset for file path generation
Returns:
str: Path to the file containing the document's content
Raises:
TypeError: If data_point is neither Document nor ImageDocument
"""
if isinstance(data_point, ImageDocument):
return data_point.image_path or save_data_to_file(data_point.text, dataset_name)
elif isinstance(data_point, Document):
return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name)
else:
raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}")
🧰 Tools
🪛 Ruff

8-8: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)


14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

cognee/tasks/ingestion/save_data_item_to_storage.py Outdated Show resolved Hide resolved
Comment on lines +17 to +18
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Based on the gathered information, I can now provide the final response:

Add path validation and sanitization to prevent path traversal vulnerabilities

The current implementation lacks proper path validation, which could lead to security issues:

  • The file_path = data_item.replace("file://", "") operation is insufficient for security
  • No validation is performed to ensure paths are within allowed boundaries
  • No sanitization against directory traversal attacks (e.g., ../../../etc/passwd)

Required changes:

  • Use os.path.abspath() and verify the resolved path is within the allowed data directory
  • Add path sanitization to remove any potential directory traversal sequences
  • Implement path validation before file operations in save_data_item_to_storage.py

Example vulnerable operations found:

  • Direct path usage in ingest_data.py: open(file_path.replace("file://", ""), mode = "rb")
  • Unchecked path handling in save_data_item_to_storage.py
  • Raw path concatenation in save_data_to_file.py: storage_path = data_directory_path + "/" + dataset_name.replace(".", "/")
🔗 Analysis chain

Add path validation for security.

Direct handling of file paths without validation could lead to path traversal vulnerabilities. Consider:

  1. Validating the path is within allowed boundaries
  2. Sanitizing the path to prevent directory traversal

Let's check if there are any path validation utilities in the codebase:


Let me search for more specific path-related security implementations:


Let me check if there are any security-related utilities or path validation functions in the codebase:


Let me check how the file paths are being used after they're processed:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for path validation utilities
rg -l "(?i)(validate|sanitize|clean|check).*path"

Length of output: 63


Script:

#!/bin/bash
# Search for path-related security checks and normalizations
rg -i "path|directory" -A 3 -B 3 cognee/tasks/ingestion/

Length of output: 11512


Script:

#!/bin/bash
# Search for security-related utilities and path validation functions
rg -i "def.*secure|sanitize|validate|normalize.*path" -g "*.py"

# Also check for any security-related imports
rg -i "from.*security|import.*pathlib|import.*os\.path" -g "*.py"

Length of output: 1185


Script:

#!/bin/bash
# Check for file operations and how paths are used
rg -A 5 "open\(.*file_path" -g "*.py"

# Also check for the save_data_to_file implementation
rg -A 10 "def save_data_to_file" -g "*.py"

Length of output: 6430

Comment on lines 124 to 132
"# Create a clean slate for cognee -- reset data and system state\n",
"await cognee.prune.prune_data()\n",
"await cognee.prune.prune_system(metadata=True)\n",
"\n",
"# Add the LlamaIndex documents, and make it available for cognify\n",
"await cognee.add(documents)\n",
"\n",
"# Use LLMs and cognee to create knowledge graph\n",
"await cognee.cognify()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and progress tracking for Cognee operations.

The current implementation lacks error handling and progress feedback for potentially long-running operations.

+from tqdm.notebook import tqdm
+
 # Create a clean slate for cognee -- reset data and system state
+try:
     await cognee.prune.prune_data()
     await cognee.prune.prune_system(metadata=True)
+    print("✓ System reset complete")
 
-# Add the LlamaIndex documents, and make it available for cognify
-await cognee.add(documents)
+    # Add documents with progress tracking
+    print("\nAdding documents...")
+    for doc in tqdm(documents, desc="Processing"):
+        await cognee.add([doc])
+    print("✓ Documents added successfully")
 
-# Use LLMs and cognee to create knowledge graph
-await cognee.cognify()
+    # Create knowledge graph
+    print("\nGenerating knowledge graph...")
+    await cognee.cognify()
+    print("✓ Knowledge graph generated successfully")
+except Exception as e:
+    print(f"❌ Error during Cognee operations: {str(e)}")
+    raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"# Create a clean slate for cognee -- reset data and system state\n",
"await cognee.prune.prune_data()\n",
"await cognee.prune.prune_system(metadata=True)\n",
"\n",
"# Add the LlamaIndex documents, and make it available for cognify\n",
"await cognee.add(documents)\n",
"\n",
"# Use LLMs and cognee to create knowledge graph\n",
"await cognee.cognify()"
from tqdm.notebook import tqdm
# Create a clean slate for cognee -- reset data and system state
try:
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
print("✓ System reset complete")
# Add documents with progress tracking
print("\nAdding documents...")
for doc in tqdm(documents, desc="Processing"):
await cognee.add([doc])
print("✓ Documents added successfully")
# Create knowledge graph
print("\nGenerating knowledge graph...")
await cognee.cognify()
print("✓ Knowledge graph generated successfully")
except Exception as e:
print(f"❌ Error during Cognee operations: {str(e)}")
raise

Comment on lines 148 to 157
"from cognee import SearchType\n",
"\n",
"# Query cognee for summaries\n",
"search_results = await cognee.search(\n",
" SearchType.SUMMARIES,\n",
" {'query': \"What are the main news discussed in the document?\"}\n",
")\n",
"# Display search results\n",
"print(\"\\n Summary of main news discussed:\\n\")\n",
"print(search_results[0][\"text\"])"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve search results handling and display.

The current implementation only shows the first result without proper error handling or validation.

-# Query cognee for summaries
-search_results = await cognee.search(
-    SearchType.SUMMARIES,
-    {'query': "What are the main news discussed in the document?"}
-)
-# Display search results
-print("\n Summary of main news discussed:\n")
-print(search_results[0]["text"])
+try:
+    # Query cognee for summaries
+    search_results = await cognee.search(
+        SearchType.SUMMARIES,
+        {'query': "What are the main news discussed in the document?"}
+    )
+    
+    if not search_results:
+        print("No results found")
+        return
+        
+    # Display search results
+    print("\nSummary of main news discussed:\n")
+    for i, result in enumerate(search_results, 1):
+        print(f"Result {i}:")
+        print(f"Score: {result.get('score', 'N/A')}")
+        print(f"Text: {result['text']}\n")
+except Exception as e:
+    print(f"Error during search: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"from cognee import SearchType\n",
"\n",
"# Query cognee for summaries\n",
"search_results = await cognee.search(\n",
" SearchType.SUMMARIES,\n",
" {'query': \"What are the main news discussed in the document?\"}\n",
")\n",
"# Display search results\n",
"print(\"\\n Summary of main news discussed:\\n\")\n",
"print(search_results[0][\"text\"])"
from cognee import SearchType
try:
# Query cognee for summaries
search_results = await cognee.search(
SearchType.SUMMARIES,
{'query': "What are the main news discussed in the document?"}
)
if not search_results:
print("No results found")
return
# Display search results
print("\nSummary of main news discussed:\n")
for i, result in enumerate(search_results, 1):
print(f"Result {i}:")
print(f"Score: {result.get('score', 'N/A')}")
print(f"Text: {result['text']}\n")
except Exception as e:
print(f"Error during search: {str(e)}")

Comment on lines +32 to +34
with open(file_path.replace("file://", ""), mode = "rb") as file:
classified_data = ingestion.classify(file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add Exception Handling for File Operations

Opening a file can raise exceptions such as FileNotFoundError or PermissionError. To prevent the program from crashing, add exception handling around file operations to handle these potential errors gracefully.

I can help implement appropriate exception handling if needed.

Comment on lines +44 to +70
async with db_engine.get_async_session() as session:
dataset = await create_dataset(dataset_name, user.id, session)

data_point = (await session.execute(
select(Data).filter(Data.id == data_id)
)).scalar_one_or_none()

if data_point is not None:
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]

await session.merge(data_point)
await session.commit()
else:
data_point = Data(
id = data_id,
name = file_metadata["name"],
raw_data_location = file_metadata["file_path"],
extension = file_metadata["extension"],
mime_type = file_metadata["mime_type"],
)

dataset.data.append(data_point)
await session.commit()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure Proper Transaction Management and Exception Handling in Database Operations

While using async sessions, it's important to handle exceptions that may occur during database operations to prevent partial commits or data inconsistencies. Consider adding exception handling and ensuring that transactions are properly managed.

I can assist in implementing robust exception handling for database transactions if you'd like.

file_path = save_data_item_to_storage(data_item, dataset_name)

# Ingest data and add metadata
with open(file_path.replace("file://", ""), mode = "rb") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use a Robust Method to Parse File URIs

The current approach of using file_path.replace("file://", "") may not handle all edge cases of file URIs, especially across different operating systems. Consider using the urllib.parse or pathlib modules to reliably parse the file path from the URI.

Here's how you might adjust the code:

+from urllib.parse import urlparse
+parsed_url = urlparse(file_path)
+file_system_path = parsed_url.path
-with open(file_path.replace("file://", ""), mode="rb") as file:
+with open(file_system_path, mode="rb") as file:

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +51 to +70
if data_point is not None:
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]

await session.merge(data_point)
await session.commit()
else:
data_point = Data(
id = data_id,
name = file_metadata["name"],
raw_data_location = file_metadata["file_path"],
extension = file_metadata["extension"],
mime_type = file_metadata["mime_type"],
)

dataset.data.append(data_point)
await session.commit()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to Reduce Code Duplication in Data Handling

The code blocks for updating an existing data_point and creating a new one share similar assignments to data_point attributes. To improve maintainability and readability, refactor the code to eliminate duplication.

Here is a suggested refactor:

 if data_point is not None:
-    data_point.name = file_metadata["name"]
-    data_point.raw_data_location = file_metadata["file_path"]
-    data_point.extension = file_metadata["extension"]
-    data_point.mime_type = file_metadata["mime_type"]
-
-    await session.merge(data_point)
-    await session.commit()
+    pass  # Existing data_point, no need to create a new one
 else:
     data_point = Data(
         id=data_id,
-        # These fields will be set below
     )
     dataset.data.append(data_point)
-    await session.commit()

+# Set or update shared attributes
+data_point.name = file_metadata["name"]
+data_point.raw_data_location = file_metadata["file_path"]
+data_point.extension = file_metadata["extension"]
+data_point.mime_type = file_metadata["mime_type"]

+await session.merge(data_point)
+await session.commit()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if data_point is not None:
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]
await session.merge(data_point)
await session.commit()
else:
data_point = Data(
id = data_id,
name = file_metadata["name"],
raw_data_location = file_metadata["file_path"],
extension = file_metadata["extension"],
mime_type = file_metadata["mime_type"],
)
dataset.data.append(data_point)
await session.commit()
if data_point is not None:
pass # Existing data_point, no need to create a new one
else:
data_point = Data(
id=data_id,
)
dataset.data.append(data_point)
# Set or update shared attributes
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]
await session.merge(data_point)
await session.commit()

Copy link
Contributor

@0xideas 0xideas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good overall, I left a few questions/suggestions. Can you think of any documentation that we should update/expand?

@@ -0,0 +1,205 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test analogous to test_notebook for this new notebook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dexters1 , what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, if we plan on having more notebooks in the future we can make my test_notebook action reusable as there will be little difference between notebooks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this in this MR? I think we should keep track of our published notebooks behaving as expected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for this notebook in this MR, but the reusable action will be made in the next sprint. I don't have time to take care of it in this sprint.

@dexters1 dexters1 requested a review from 0xideas November 14, 2024 10:55
@dexters1 dexters1 self-assigned this Nov 14, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
.github/workflows/test_cognee_llama_index_notebook.yml (3)

15-17: Consider using DEBUG log level for better visibility.

Setting RUNTIME__LOG_LEVEL to ERROR might hide useful information during notebook execution. Consider using DEBUG level in the workflow for better troubleshooting capabilities.

-  RUNTIME__LOG_LEVEL: ERROR
+  RUNTIME__LOG_LEVEL: DEBUG

47-51: Consider adding Jupyter to project dependencies.

Instead of adding Jupyter during the workflow execution, consider adding it to the project's pyproject.toml as a development dependency.

This would make the development environment setup more consistent:

# In pyproject.toml
[tool.poetry.group.dev.dependencies]
+ jupyter = "^1.0.0"

52-63: Add newline at end of file.

Add a newline character at the end of the file to comply with POSIX standards.

          --execute notebooks/cognee_llama_index.ipynb \
          --output executed_notebook.ipynb \
-          --ExecutePreprocessor.timeout=1200
+          --ExecutePreprocessor.timeout=1200
+
🧰 Tools
🪛 yamllint

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8c7fdc4 and f13146d.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • .github/workflows/test_cognee_llama_index_notebook.yml (1 hunks)
  • pyproject.toml (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml
🧰 Additional context used
🪛 actionlint
.github/workflows/test_cognee_llama_index_notebook.yml

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

🪛 yamllint
.github/workflows/test_cognee_llama_index_notebook.yml

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (1)
.github/workflows/test_cognee_llama_index_notebook.yml (1)

1-14: LGTM! Well-configured workflow triggers and concurrency settings.

The workflow configuration properly handles manual triggers and PR events, with appropriate concurrency controls to prevent parallel runs.

Comment on lines 18 to 27
jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml

run_notebook_test:
name: test
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}
runs-on: ubuntu-latest
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the condition syntax in the if statement.

The current condition mixes template syntax unnecessarily, which could cause evaluation issues.

-    if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}
+    if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml
run_notebook_test:
name: test
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}
runs-on: ubuntu-latest
jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml
run_notebook_test:
name: test
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'
runs-on: ubuntu-latest
🧰 Tools
🪛 actionlint

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dexters1 This

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borisarzentar Added change, maybe we should change this for the rest of the actions as well?

Remove dictionary and use string for query in notebook and simple example

Fix COG-337
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
examples/python/simple_example.py (2)

Line range hint 1-40: Consider enhancing the example documentation

While the example is well-documented with prerequisites, it would be valuable to add a comment about LlamaIndex integration since this PR is specifically about adding LlamaIndex support.

Consider adding a comment like:

# Prerequisites:
# 1. Copy `.env.template` and rename it to `.env`.
# 2. Add your OpenAI API key to the `.env` file in the `LLM_API_KEY` field:
#    LLM_API_KEY = "your_key_here"
# 3. (Optional) To minimize setup effort, set `VECTOR_DB_PROVIDER="lancedb"` in `.env".
+# 4. (Optional) This example supports LlamaIndex documents for enhanced document processing

API parameter style inconsistency found in documentation and examples

The search API usage shows inconsistency in parameter style:

  • simple_example.py uses named parameters: SearchType.INSIGHTS, query='Tell me about NLP'
  • docs/quickstart.md uses positional parameters: "INSIGHTS", {'query': 'Tell me about NLP'}
  • Test files consistently use named parameters: SearchType.INSIGHTS, query = random_node_name

The documentation in quickstart.md needs to be updated to match the current API style using named parameters and SearchType enum.

🔗 Analysis chain

Line range hint 1-40: Verify the impact of API changes

Since this example demonstrates the core API usage, we should verify that all other examples and documentation are consistent with this new parameter style.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other instances of cognee.search to ensure consistent parameter style
rg -l "cognee.search" --type python | while read -r file; do
  if [ "$file" != "examples/python/simple_example.py" ]; then
    echo "=== $file ==="
    rg -A 2 "cognee\.search" "$file"
  fi
done

Length of output: 81


Script:

#!/bin/bash
# Search for other instances of cognee.search using ripgrep without file type filter
rg -l "cognee\.search"

Length of output: 403


Script:

#!/bin/bash
# Search for Python files in the repository
fd '\.py$'

Length of output: 14317


Script:

#!/bin/bash
# Search for documentation files that might contain API examples
fd -e md -e rst

Length of output: 607


Script:

#!/bin/bash
# Check the search API usage in documentation and test files
echo "=== Documentation examples ==="
rg -A 5 "cognee\.search" docs/
echo -e "\n=== Test files ==="
rg -A 5 "cognee\.search" "cognee/tests/"

Length of output: 7591

notebooks/cognee_llama_index.ipynb (2)

50-53: Add validation for document creation

The document creation process should validate the input data to handle potential empty or malformed entries.

+def create_document(row: pd.Series) -> Document:
+    if not row['title'].strip() or not row['text'].strip():
+        raise ValueError(f"Empty title or text in row: {row.name}")
+    return Document(text=f"{row['title']}: {row['text']}")
+
 documents = [
-    Document(text=f"{row['title']}: {row['text']}")
-    for i, row in news.iterrows()
+    create_document(row)
+    for _, row in news.iterrows()
+    if pd.notna(row['title']) and pd.notna(row['text'])
 ]

199-199: Consider updating Python version

The notebook uses Python 3.9.6 which is quite old. Consider updating to a newer version (3.11+) to benefit from performance improvements, security patches, and ensure compatibility with latest package versions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f13146d and cc53693.

📒 Files selected for processing (2)
  • examples/python/simple_example.py (1 hunks)
  • notebooks/cognee_llama_index.ipynb (1 hunks)
🔇 Additional comments (2)
examples/python/simple_example.py (1)

29-31: LGTM! API parameter style improvement

The change from dictionary-style parameters to keyword arguments makes the API more Pythonic and easier to use.

notebooks/cognee_llama_index.ipynb (1)

1-1: Verify notebook test implementation

The conversation indicates that a test for this notebook has been added, but let's verify its implementation.

✅ Verification successful

Test implementation is properly configured

The notebook test has been correctly implemented with:

  • Proper workflow triggers (PR with 'run-checks' label)
  • Required environment setup (Python, Poetry)
  • Necessary secrets configuration (LLM_API_KEY, Graphistry credentials)
  • Appropriate notebook execution timeout (1200s)
  • Correct notebook path and execution parameters
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and configuration of notebook test

# Check if test file exists
echo "Checking for notebook test file..."
fd "test_cognee_llama_index_notebook.yml" .github/workflows/

# Check test configuration
echo -e "\nChecking test configuration..."
rg -A 10 "cognee_llama_index.ipynb" .github/workflows/

Length of output: 638


Script:

#!/bin/bash
# Let's check the complete test configuration to ensure it's properly set up
echo "Complete test configuration:"
cat .github/workflows/test_cognee_llama_index_notebook.yml

Length of output: 1780

Comment on lines +27 to +34
"import pandas as pd\n",
"from llama_index.core import Document\n",
"\n",
"news = pd.read_csv(\n",
" \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n",
")[:5]\n",
"\n",
"news.head()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and data validation for dataset loading

The current implementation lacks error handling for dataset loading and validation of the expected structure.

 import pandas as pd
 from llama_index.core import Document
+from typing import List, Dict
+
+def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame:
+    try:
+        df = pd.read_csv(url)
+        if not all(col in df.columns for col in ['title', 'text']):
+            raise ValueError("Dataset missing required columns: title, text")
+        return df.head(sample_size)
+    except Exception as e:
+        raise RuntimeError(f"Failed to load news dataset: {str(e)}")
 
-news = pd.read_csv(
-    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
-)[:5]
+# Load and validate dataset
+news = load_news_dataset(
+    "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"import pandas as pd\n",
"from llama_index.core import Document\n",
"\n",
"news = pd.read_csv(\n",
" \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n",
")[:5]\n",
"\n",
"news.head()"
import pandas as pd
from llama_index.core import Document
from typing import List, Dict
def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame:
try:
df = pd.read_csv(url)
if not all(col in df.columns for col in ['title', 'text']):
raise ValueError("Dataset missing required columns: title, text")
return df.head(sample_size)
except Exception as e:
raise RuntimeError(f"Failed to load news dataset: {str(e)}")
# Load and validate dataset
news = load_news_dataset(
"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
)
news.head()

Copy link
Contributor

@0xideas 0xideas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except for the small thing in GitHub actions, for consistency's sake :)

pull_request:
branches:
- main
types: [labeled]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, we apply [labeled, synchronize] in the other GitHub actions

@0xideas
Copy link
Contributor

0xideas commented Nov 14, 2024

(@dexters1 could you also merge the latest main?) <- actually, never mind, its not necessary!

pyproject.toml Outdated
@@ -69,7 +69,8 @@ fastapi-users = {version = "*", extras = ["sqlalchemy"]}
alembic = "^1.13.3"
asyncpg = "^0.29.0"
pgvector = "^0.3.5"
psycopg2 = "^2.9.10"
psycopg2 = {version = "^2.9.10", optional = true}
llama-index-core = "^0.11.22"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this optional please

pyproject.toml Outdated
@@ -36,7 +36,7 @@ boto3 = "^1.26.125"
botocore="^1.35.54"
gunicorn = "^20.1.0"
sqlalchemy = "2.0.35"
instructor = "1.6.3"
instructor = "^1.5.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not fix the version? it is not a very reliable lib, as ti changed a few times already

Added the following changes that were requested in pull request:

Added synchronize label,
Made uniform syntax in if statement in workflow,
fixed instructor dependency,
added llama-index to be optional

Refactor COG-337
@dexters1 dexters1 requested a review from 0xideas November 15, 2024 14:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
.github/workflows/test_cognee_llama_index_notebook.yml (1)

15-17: Consider using INFO log level for better debugging capabilities.

Setting RUNTIME__LOG_LEVEL to ERROR might make it difficult to debug workflow failures. Consider using INFO level in the CI environment.

 env:
-  RUNTIME__LOG_LEVEL: ERROR
+  RUNTIME__LOG_LEVEL: INFO
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cc53693 and 417b119.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • .github/workflows/test_cognee_llama_index_notebook.yml (1 hunks)
  • pyproject.toml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml
🧰 Additional context used
🪛 yamllint
.github/workflows/test_cognee_llama_index_notebook.yml

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (2)
.github/workflows/test_cognee_llama_index_notebook.yml (2)

1-9: LGTM! Workflow trigger configuration is well-structured.

The workflow triggers and event types are properly configured.


18-26: LGTM! Job structure and conditions are well-defined.

The job dependencies and conditions are properly configured.

Comment on lines +35 to +51
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'

- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true

- name: Install dependencies
run: |
poetry install --no-interaction --all-extras --no-root
poetry add jupyter --no-interaction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add Poetry caching for faster workflow execution.

Consider these improvements for the Python/Poetry setup:

  1. Add Poetry cache to speed up dependency installation
  2. Pin Poetry version more specifically for better reproducibility
 - name: Setup Python
   uses: actions/setup-python@v5
   with:
     python-version: '3.11.x'

+- name: Cache Poetry
+  uses: actions/cache@v3
+  with:
+    path: ~/.cache/pypoetry
+    key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
+    restore-keys: |
+      ${{ runner.os }}-poetry-

 - name: Install Poetry
-  uses: snok/[email protected]
+  uses: snok/[email protected]
   with:
     virtualenvs-create: true
     virtualenvs-in-project: true
     installer-parallel: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'
- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
- name: Install dependencies
run: |
poetry install --no-interaction --all-extras --no-root
poetry add jupyter --no-interaction
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'
- name: Cache Poetry
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-poetry-
- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
- name: Install dependencies
run: |
poetry install --no-interaction --all-extras --no-root
poetry add jupyter --no-interaction

Comment on lines +52 to +63
- name: Execute Jupyter Notebook
env:
ENV: 'dev'
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}
run: |
poetry run jupyter nbconvert \
--to notebook \
--execute notebooks/cognee_llama_index.ipynb \
--output executed_notebook.ipynb \
--ExecutePreprocessor.timeout=1200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance notebook execution error handling and fix EOF.

  1. The 20-minute timeout suggests potential performance concerns. Consider adding error handling and output capture.
  2. Add a newline at the end of file.
       run: |
         poetry run jupyter nbconvert \
         --to notebook \
         --execute notebooks/cognee_llama_index.ipynb \
         --output executed_notebook.ipynb \
-        --ExecutePreprocessor.timeout=1200
+        --ExecutePreprocessor.timeout=1200 \
+        || {
+          echo "::error::Notebook execution failed"
+          cat executed_notebook.ipynb
+          exit 1
+        }
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Execute Jupyter Notebook
env:
ENV: 'dev'
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}
run: |
poetry run jupyter nbconvert \
--to notebook \
--execute notebooks/cognee_llama_index.ipynb \
--output executed_notebook.ipynb \
--ExecutePreprocessor.timeout=1200
- name: Execute Jupyter Notebook
env:
ENV: 'dev'
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}
run: |
poetry run jupyter nbconvert \
--to notebook \
--execute notebooks/cognee_llama_index.ipynb \
--output executed_notebook.ipynb \
--ExecutePreprocessor.timeout=1200 \
|| {
echo "::error::Notebook execution failed"
cat executed_notebook.ipynb
exit 1
}
🧰 Tools
🪛 yamllint

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

Resolve issue with llama-index being mandatory to run cognee

Fix COG-337
Removed additional references to llama-index from core cognee lib.
Added llama-index-core install from notebook

Fix COG-337
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (7)
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (4)

1-2: Add docstring with type hint documentation.

The function's type hints should be documented in a docstring to improve code maintainability and IDE support.

Add a docstring like this:

 def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
+    """Save data item with metadata to storage.
+    
+    Args:
+        data_item (Union[BinaryIO, str, Any]): The data to save. Can be:
+            - A file-like object
+            - A string (file path or text content)
+            - A Document from llama_index
+        dataset_name (str): Name of the dataset
+    
+    Returns:
+        str: Path where the data was saved
+        
+    Raises:
+        ValueError: If the data type is not supported
+    """

10-12: Add input validation for dataset_name and return value.

The dataset_name parameter and the return value from get_data_from_llama_index should be validated.

+    if not dataset_name or not isinstance(dataset_name, str):
+        raise ValueError("dataset_name must be a non-empty string")
+
     # Check if data is of type Document or any of it's subclasses
     if isinstance(data_item, Document):
-        file_path = get_data_from_llama_index(data_item, dataset_name)
+        file_path = get_data_from_llama_index(data_item, dataset_name)
+        if not file_path:
+            raise ValueError("Failed to get file path from Document")

25-26: Improve error message with supported type information.

The error message could be more helpful by listing the supported types.

-        raise ValueError(f"Data type not supported: {type(data_item)}")
+        supported_types = [
+            "llama_index.core.Document",
+            "File object with 'file' and 'filename' attributes",
+            "String (file path or text content)"
+        ]
+        raise ValueError(
+            f"Data type not supported: {type(data_item)}. "
+            f"Supported types are: {', '.join(supported_types)}"
+        )

4-28: Consider architectural improvements.

The current implementation could benefit from:

  1. Separation of concerns by splitting into smaller, focused functions for each data type
  2. Adding logging for better debugging and monitoring
  3. Consider implementing a strategy pattern for different data types

This would improve maintainability and make it easier to add support for new data types.

notebooks/cognee_llama_index.ipynb (3)

16-16: Improve dependency installation and data source management

  1. The pip install command should check for successful installation
  2. The data source URL should be configurable and its availability should be verified

Consider these improvements:

- %pip install llama-index-core
+ %%capture pip_output
+ %pip install llama-index-core
+ if "Successfully installed" not in pip_output.stdout:
+     raise RuntimeError("Failed to install llama-index-core")

+ # Define data source URL as a configuration
+ NEWS_DATASET_URL = "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
+ SAMPLE_SIZE = 5

- news = pd.read_csv(
-     "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
- )[:5]
+ # Validate URL and load data
+ try:
+     response = requests.head(NEWS_DATASET_URL)
+     response.raise_for_status()
+     news = pd.read_csv(NEWS_DATASET_URL).head(SAMPLE_SIZE)
+ except Exception as e:
+     raise RuntimeError(f"Failed to access dataset: {str(e)}")

Also applies to: 39-41


59-62: Add validation and progress tracking for document preparation

The document creation process should validate row contents and track progress, especially important when working with larger datasets.

+ from tqdm.notebook import tqdm
+ 
+ def create_document(row):
+     if not isinstance(row['title'], str) or not isinstance(row['text'], str):
+         raise ValueError(f"Invalid row format: title and text must be strings")
+     return Document(text=f"{row['title']}: {row['text']}")

- documents = [
-     Document(text=f"{row['title']}: {row['text']}")
-     for i, row in news.iterrows()
- ]
+ documents = []
+ for i, row in tqdm(news.iterrows(), desc="Creating documents", total=len(news)):
+     try:
+         documents.append(create_document(row))
+     except Exception as e:
+         print(f"Warning: Failed to create document for row {i}: {str(e)}")

224-224: Update Python version

The notebook uses Python 3.9.6 which is quite old. Consider updating to a more recent version (3.11+) to benefit from performance improvements, security fixes, and new features.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 417b119 and c4700b1.

📒 Files selected for processing (5)
  • cognee/tasks/ingestion/__init__.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data_with_metadata.py (1 hunks)
  • cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
  • cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (1 hunks)
  • notebooks/cognee_llama_index.ipynb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cognee/tasks/ingestion/init.py
  • cognee/tasks/ingestion/ingest_data_with_metadata.py
  • cognee/tasks/ingestion/save_data_item_to_storage.py

Comment on lines +14 to +16
# data is a file object coming from upload.
elif hasattr(data_item, "file"):
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add robust file object validation.

The current implementation makes assumptions about the file object structure without proper validation.

-    elif hasattr(data_item, "file"):
-        file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
+    elif hasattr(data_item, "file") and hasattr(data_item, "filename"):
+        if not data_item.file:
+            raise ValueError("File object is empty")
+        if not data_item.filename:
+            raise ValueError("Filename is missing")
+        file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# data is a file object coming from upload.
elif hasattr(data_item, "file"):
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
# data is a file object coming from upload.
elif hasattr(data_item, "file") and hasattr(data_item, "filename"):
if not data_item.file:
raise ValueError("File object is empty")
if not data_item.filename:
raise ValueError("Filename is missing")
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

Comment on lines +4 to +8
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
# Dynamic import is used because the llama_index module is optional.
# For the same reason Any is accepted as a data item
from llama_index.core import Document
from .transform_data import get_data_from_llama_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for dynamic imports.

The dynamic imports could fail silently if the optional dependencies are not installed. Consider adding explicit error handling.

-    from llama_index.core import Document
-    from .transform_data import get_data_from_llama_index
+    try:
+        from llama_index.core import Document
+        from .transform_data import get_data_from_llama_index
+    except ImportError as e:
+        raise ImportError(
+            "Failed to import required dependencies. "
+            "Please ensure llama_index is installed for Document support."
+        ) from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
# Dynamic import is used because the llama_index module is optional.
# For the same reason Any is accepted as a data item
from llama_index.core import Document
from .transform_data import get_data_from_llama_index
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
# Dynamic import is used because the llama_index module is optional.
# For the same reason Any is accepted as a data item
try:
from llama_index.core import Document
from .transform_data import get_data_from_llama_index
except ImportError as e:
raise ImportError(
"Failed to import required dependencies. "
"Please ensure llama_index is installed for Document support."
) from e

Comment on lines +18 to +24
elif isinstance(data_item, str):
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
# data is text
else:
file_path = save_data_to_file(data_item, dataset_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance string input validation and handling.

The current string handling could be improved with better file path validation and text content limits.

     elif isinstance(data_item, str):
+        if not data_item.strip():
+            raise ValueError("Empty string input")
+
         # data is a file path
         if data_item.startswith("file://") or data_item.startswith("/"):
             file_path = data_item.replace("file://", "")
+            # Verify file exists and is accessible
+            import os
+            if not os.path.isfile(file_path):
+                raise FileNotFoundError(f"File not found: {file_path}")
         # data is text
         else:
+            # Add reasonable size limit for text content
+            if len(data_item) > 10_000_000:  # 10MB limit
+                raise ValueError("Text content exceeds size limit")
             file_path = save_data_to_file(data_item, dataset_name)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif isinstance(data_item, str):
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
# data is text
else:
file_path = save_data_to_file(data_item, dataset_name)
elif isinstance(data_item, str):
if not data_item.strip():
raise ValueError("Empty string input")
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
# Verify file exists and is accessible
import os
if not os.path.isfile(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
# data is text
else:
# Add reasonable size limit for text content
if len(data_item) > 10_000_000: # 10MB limit
raise ValueError("Text content exceeds size limit")
file_path = save_data_to_file(data_item, dataset_name)

Comment on lines +145 to +152
"async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n",
" await create_relational_db_and_tables()\n",
" await create_pgvector_db_and_tables()\n",
"\n",
" if user is None:\n",
" user = await get_default_user()\n",
"\n",
" await ingest_data_with_metadata(data, dataset_name, user)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve function definition and documentation

The add function should be properly documented and defined in a module rather than a notebook cell.

- async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None):
+ async def add(
+     data: Union[BinaryIO, list[BinaryIO], str, list[str]],
+     dataset_name: str = "main_dataset",
+     user: User = None
+ ) -> None:
+     """Add data to Cognee for processing.
+     
+     Args:
+         data: Input data in various formats (file, string, or lists thereof)
+         dataset_name: Name to identify the dataset (default: "main_dataset")
+         user: User performing the operation (default: None, uses default user)
+         
+     Raises:
+         RuntimeError: If database setup fails
+     """
     await create_relational_db_and_tables()
     await create_pgvector_db_and_tables()

Consider moving this function to a proper module in the codebase.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n",
" await create_relational_db_and_tables()\n",
" await create_pgvector_db_and_tables()\n",
"\n",
" if user is None:\n",
" user = await get_default_user()\n",
"\n",
" await ingest_data_with_metadata(data, dataset_name, user)\n",
async def add(
data: Union[BinaryIO, list[BinaryIO], str, list[str]],
dataset_name: str = "main_dataset",
user: User = None
) -> None:
"""Add data to Cognee for processing.
Args:
data: Input data in various formats (file, string, or lists thereof)
dataset_name: Name to identify the dataset (default: "main_dataset")
user: User performing the operation (default: None, uses default user)
Raises:
RuntimeError: If database setup fails
"""
await create_relational_db_and_tables()
await create_pgvector_db_and_tables()
if user is None:
user = await get_default_user()
await ingest_data_with_metadata(data, dataset_name, user)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants