Cog 337 llama index support #186

dexters1 · 2024-11-08T15:59:43Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced the ingest_data_with_metadata function for ingesting data along with metadata.
- Added the save_data_item_to_storage function to handle various data item types for storage.
- Created a new Jupyter notebook for utilizing LlamaIndex documents within the Cognee framework.
- Added the save_data_item_with_metadata_to_storage function for enhanced data storage capabilities.
Improvements
- Updated error messages in permission checks to use user email instead of username.
- Streamlined the save_data_to_storage function for improved efficiency.
- Simplified parameter passing in the cognee.search function.
Dependencies
- Updated version constraints for instructor and tenacity packages.
- Added new dependency llama-index-core.
- Updated psycopg2 to include an optional attribute.
Chores
- Added a new GitHub Actions workflow for testing the LlamaIndex notebook.

Added support for LlamaIndex Document type Feature #COG-337

Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage Docs #COG-337

Allow usage of metadata from LlamaIndex documents Feature #COG-337

Change name of llama index function Refactor #COG-337

Downgrade needed on tenacity and instructor modules to support llama index Chore #COG-337

…cognee into COG-337-llama-index-metadata

Added task that will have access to metadata if data is provided from different data ingestion tools Feature #COG-337

Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well Docs #COG-337

Added missing parameter to function call Fix #COG-337

Moved data storing from async to sync Refactor #COG-337

Refactor so ingest_data file was modified instead of having two ingest tasks Refactor #COG-337

Merged new and old data ingestion tasks into one Refactor #COG-337

Returned ingest_data and save_data_to_storage tasks Refactor #COG-337

Returned previous ignestion tasks to add function Refactor #COG-337

coderabbitai · 2024-11-08T15:59:50Z

Walkthrough

This pull request introduces several changes across multiple files in the cognee project. Key modifications include the addition of new functions for data ingestion and storage, updates to error handling messages, and adjustments to import statements. A new Jupyter notebook is also included to demonstrate the integration of LlamaIndex documents. Additionally, the pyproject.toml file has updated dependency versions and added a new dependency.

Changes

File Path	Change Summary
`cognee/api/v1/add/add_v2.py`	Added a newline at the end of the file; no changes to function logic.
`cognee/modules/users/permissions/methods/check_permission_on_documents.py`	Updated exception message to use user's email instead of username in `PermissionDeniedException`.
`cognee/tasks/ingestion/__init__.py`	Added imports for `save_data_item_to_storage` and `save_data_item_with_metadata_to_storage` functions.
`cognee/tasks/ingestion/ingest_data.py`	Removed import for `get_relational_config`; no changes to function logic.
`cognee/tasks/ingestion/ingest_data_with_metadata.py`	Introduced `ingest_data_with_metadata` function for data ingestion with metadata.
`cognee/tasks/ingestion/save_data_item_to_storage.py`	Introduced `save_data_item_to_storage` function for saving various data items.
`cognee/tasks/ingestion/save_data_to_storage.py`	Replaced import for `save_data_to_file` with `save_data_item_to_storage`; simplified logic in `save_data_to_storage`.
`cognee/tasks/ingestion/transform_data.py`	Introduced `get_data_from_llama_index` function for handling document types.
`notebooks/cognee_llama_index.ipynb`	New notebook demonstrating LlamaIndex integration with a sample dataset.
`pyproject.toml`	Updated dependency versions for `instructor` and `tenacity`, and added new dependency `llama-index-core`.
`.github/workflows/test_cognee_llama_index_notebook.yml`	New workflow added for testing the LlamaIndex notebook.
`examples/python/simple_example.py`	Modified parameter passing to `cognee.search`, switching from dictionary to keyword arguments.

Possibly related PRs

Cog 170 pgvector adapter #158: The changes in this PR add support for PGVector, which is directly related to the modifications in the main PR that involve the add function in add_v2.py, where the new function create_pgvector_db_and_tables is introduced.
Feat/cog 553 graph memory projection #196: This PR introduces the CogneeGraph class, which is relevant to the main PR's focus on graph-related functionalities, particularly in how data is structured and managed within the graph context.
Cog 577 add unit test task #197: This PR adds unit tests, which may relate to the changes in the main PR by ensuring that the functionalities introduced, such as those in the add function, are properly tested and validated.

Poem

🐰 In the code where bunnies hop,
New functions bloom, they never stop.
With data stored and permissions clear,
Our project grows, let's give a cheer!
LlamaIndex leads the way,
In Cognee's world, we play all day! 🌼

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c4700b1 and f6e8294.

📒 Files selected for processing (1)

pyproject.toml (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

gitguardian · 2024-11-08T16:00:19Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`b05a07f`	notebooks/cognee_llama_index.ipynb	View secret
13947935	Triggered	Generic High Entropy Secret	`f6e8294`	cognee/shared/utils.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

coderabbitai

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (10)

cognee/tasks/ingestion/save_data_item_to_storage.py (2)
6-6: Add docstring to improve code documentation.

The function lacks a docstring explaining its purpose, parameters, return value, and possible exceptions.
 def save_data_item_to_storage(data_item: Union[BinaryIO, Document, str], dataset_name: str) -> str:
+    """Save a data item to storage based on its type.
+
+    Args:
+        data_item: The data to save. Can be a file object, LlamaIndex Document, or string path/content
+        dataset_name: Name of the dataset to save the data under
+
+    Returns:
+        str: Path where the data was saved
+
+    Raises:
+        ValueError: If the data_item type is not supported
+    """
22-23: Enhance error message with supported types.

The current error message could be more helpful by listing the supported types.
-        raise ValueError(f"Data type not supported: {type(data_item)}")
+        raise ValueError(
+            f"Data type not supported: {type(data_item)}. "
+            "Supported types are: BinaryIO, Document, or str (file path or content)"
+        )
cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)
36-36: Add null check for user.email

While changing from username to email makes sense, we should ensure the email field is not null to prevent unclear error messages.
-            raise PermissionDeniedException(f"User {user.email} does not have {permission_type} permission on documents")
+            user_identifier = user.email or "Unknown user"
+            raise PermissionDeniedException(f"User {user_identifier} does not have {permission_type} permission on documents")
cognee/tasks/ingestion/ingest_data.py (4)
Line range hint 11-24: Consider splitting the ingest_data function for better maintainability.

The function handles multiple responsibilities including pipeline setup, data ingestion, and telemetry. Consider breaking it down into smaller, focused functions.

Suggested structure:
async def setup_pipeline(destination):
    return dlt.pipeline(
        pipeline_name="file_load_from_filesystem",
        destination=destination,
    )

async def ingest_data(file_paths: list[str], dataset_name: str, user: User):
    destination = get_dlt_destination()
    pipeline = await setup_pipeline(destination)
    # ... rest of the code
Line range hint 31-63: Add explicit error handling for database operations.

The database operations lack explicit error handling which could lead to inconsistent states in case of failures. Consider adding try-catch blocks and explicit transaction management.

Suggested improvement:
async def handle_data_record(session, data_id, file_metadata, dataset):
    try:
        data = (await session.execute(
            select(Data).filter(Data.id == data_id)
        )).scalar_one_or_none()

        if data is not None:
            data.name = file_metadata["name"]
            data.raw_data_location = file_metadata["file_path"]
            data.extension = file_metadata["extension"]
            data.mime_type = file_metadata["mime_type"]
            await session.merge(data)
        else:
            data = Data(
                id=data_id,
                name=file_metadata["name"],
                raw_data_location=file_metadata["file_path"],
                extension=file_metadata["extension"],
                mime_type=file_metadata["mime_type"],
            )
            dataset.data.append(data)
        
        await session.commit()
        return data
    except Exception as e:
        await session.rollback()
        raise RuntimeError(f"Failed to handle data record: {str(e)}")
Line range hint 26-30: Add input validation for file paths and secure file handling.

The current implementation directly uses file paths with minimal validation, which could pose security risks. Consider adding path validation and sanitization.
from pathlib import Path
import os

def validate_file_path(file_path: str) -> str:
    # Remove file:// prefix and normalize path
    cleaned_path = file_path.replace("file://", "")
    abs_path = os.path.abspath(cleaned_path)
    
    # Ensure path is within allowed directory
    if not abs_path.startswith('/allowed/base/path'):
        raise ValueError("File path is outside allowed directory")
    
    if not os.path.exists(abs_path):
        raise FileNotFoundError(f"File not found: {abs_path}")
        
    return abs_path

# Use in the function:
validated_path = validate_file_path(file_path)
with open(validated_path, mode="rb") as file:
    # ... rest of the code
Line range hint 65-67: Ensure atomic operations for permission assignment.

The permission assignment happens after yielding data, which could lead to a race condition where data is accessible before permissions are properly set.

Consider moving the permission assignment before yielding the data:
await give_permission_on_document(user, data_id, "read")
await give_permission_on_document(user, data_id, "write")

yield {
    "id": data_id,
    "name": file_metadata["name"],
    "file_path": file_metadata["file_path"],
    "extension": file_metadata["extension"],
    "mime_type": file_metadata["mime_type"],
}
notebooks/cognee_llama_index.ipynb (3)
27-34: Add error handling and data validation for robust data loading.

Consider these improvements for production readiness:

Add try-except block for data loading

Validate the loaded data (check for required columns, non-empty values)

Make the sample size configurable

Consider caching the dataset locally
+try:
     news = pd.read_csv(
         "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
-    )[:5]
+    ).head(5)  # Using head() is more explicit than slice
+    
+    # Validate required columns
+    required_columns = ['title', 'text']
+    if not all(col in news.columns for col in required_columns):
+        raise ValueError(f"Dataset missing required columns: {required_columns}")
+        
+    # Validate non-empty values
+    if news[required_columns].isnull().any().any():
+        raise ValueError("Dataset contains null values in required columns")
+except Exception as e:
+    raise RuntimeError(f"Failed to load news dataset: {str(e)}")
50-53: Enhance document preparation with metadata and better formatting.

The current document preparation could be improved by:

Adding clear separation between title and text

Including metadata for better document tracking

Adding error handling
 documents = [
-    Document(text=f"{row['title']}: {row['text']}")
+    Document(
+        text=f"Title: {row['title']}\n\nContent: {row['text']}",
+        metadata={
+            "title": row['title'],
+            "doc_id": f"news_{i}",
+            "source": "diffbot_news_dataset"
+        }
+    )
     for i, row in news.iterrows()
 ]
1-205: Consider converting this notebook into a production-ready tutorial.

While this notebook effectively demonstrates the LlamaIndex integration, consider these improvements for production use:

Add a requirements.txt or environment.yml file

Include proper error handling throughout

Add detailed documentation about prerequisites and setup

Consider creating a proper Python package with the core functionality

Add example outputs and expected results

Would you like me to help create a production-ready version of this tutorial?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1e01e84 and 8c7fdc4.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

cognee/api/v1/add/add_v2.py (1 hunks)
cognee/modules/users/permissions/methods/check_permission_on_documents.py (1 hunks)
cognee/tasks/ingestion/__init__.py (1 hunks)
cognee/tasks/ingestion/ingest_data.py (1 hunks)
cognee/tasks/ingestion/ingest_data_with_metadata.py (1 hunks)
cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
cognee/tasks/ingestion/save_data_to_storage.py (2 hunks)
cognee/tasks/ingestion/transform_data.py (1 hunks)
notebooks/cognee_llama_index.ipynb (1 hunks)
pyproject.toml (3 hunks)

✅ Files skipped from review due to trivial changes (1)

cognee/api/v1/add/add_v2.py

🧰 Additional context used

🪛 Ruff

cognee/tasks/ingestion/transform_data.py

8-8: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

🔇 Additional comments (10)

cognee/tasks/ingestion/__init__.py (1)

3-3: LGTM! Verify the imported module exists.

The new import follows the consistent pattern of the module and aligns with the PR's objective of adding LlamaIndex support.

Let's verify the imported module exists:

✅ Verification successful

Import verified and correctly used across the codebase

The module save_data_item_to_storage.py exists in the expected location and the import statement is consistently used across multiple files in the ingestion package.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of the imported module
# Expected: The file should exist in the same directory

# Test 1: Check if the module file exists
fd --type f "save_data_item_to_storage.py" cognee/tasks/ingestion/

# Test 2: Verify there are no typos in the import statement across the codebase
rg "from.*save_data_item_to_storage import" --type py

Length of output: 525

cognee/tasks/ingestion/save_data_to_storage.py (2)

1-2: LGTM! Import changes align with the new storage abstraction.

The switch to save_data_item_to_storage suggests a more flexible storage approach that can accommodate different types of data items, including LlamaIndex documents.

12-13: 🛠️ Refactor suggestion

Verify type handling and error propagation.

While the simplified code improves readability, there are a few concerns to address:

The function signature suggests it only handles Union[BinaryIO, str], but the AI summary mentions Document support. Consider updating the type hints to reflect all supported types.
There's no visible error handling for failed storage operations.

Let's verify the type handling in save_data_item_to_storage:

Consider adding error handling:

 def save_data_to_storage(data: Union[BinaryIO, str], dataset_name) -> list[str]:
     if not isinstance(data, list):
         data = [data]
     file_paths = []
     for data_item in data:
-        file_path = save_data_item_to_storage(data_item, dataset_name)
-        file_paths.append(file_path)
+        try:
+            file_path = save_data_item_to_storage(data_item, dataset_name)
+            file_paths.append(file_path)
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data item: {str(e)}") from e
     return file_paths

cognee/tasks/ingestion/transform_data.py (1)

1-4: LGTM! Imports are well-organized and specific.

The imports are properly structured with external packages first (llama_index) followed by internal modules (cognee).

cognee/tasks/ingestion/save_data_item_to_storage.py (1)

1-4: LGTM! Imports and type hints are well-structured.

The imports are appropriate and type hints are properly defined.

cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)

36-36: Verify consistency of user identification in error messages

Let's check if other permission-related error messages in the codebase use email consistently.

✅ Verification successful

Error message format is consistent with codebase patterns

The error message using user.email is consistent with the codebase's patterns. This is the only permission-related error message that includes user identification, and it appropriately uses the email attribute which is a standard identifier for users in the system.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for permission-related error messages to verify consistency
# Look for patterns that might use username or email in error messages

# Search for PermissionDeniedException usage
echo "Checking PermissionDeniedException messages:"
rg -A 1 "PermissionDeniedException.*User.*" --type py

# Search for similar permission-related error messages
echo -e "\nChecking other permission-related error messages:"
rg "permission.*User \{.*\}" --type py

Length of output: 506

Script:

#!/bin/bash
# Let's search more broadly for permission-related error messages and user references
echo "Checking for permission-related exceptions and messages:"
rg "Permission.*Exception.*" -B 1 -A 1 --type py

echo -e "\nChecking for user-related error messages:"
rg "User.*does not have|has no.*permission" --type py

echo -e "\nChecking for any error messages with user references:"
rg "raise.*User" -B 1 -A 1 --type py

Length of output: 2538

pyproject.toml (3)

73-73: LGTM: llama-index-core dependency.

The addition of llama-index-core ^0.11.22 aligns with the PR objectives for adding LlamaIndex support.

49-49: Verify impact of tenacity downgrade.

The downgrade from ^9.0.0 to ^8.4.1 is a breaking change that could affect retry mechanisms and error handling.

Let's check for tenacity usage that might be impacted:

#!/bin/bash
# Description: Search for tenacity usage patterns that might be affected by the version downgrade

# Test: Look for tenacity decorators and retry configurations
rg -A 5 "@retry" 
rg -A 5 "from tenacity"

39-39: Verify impact of instructor downgrade.

The downgrade from 1.6.3 to ^1.5.2 is a breaking change that could affect existing functionality.

Let's check for instructor usage that might be impacted:

✅ Verification successful

Let me generate additional verification scripts to check the specific instructor features being used and compare them with the changelog between versions 1.5.2 and 1.6.3.

The results will help us determine if the downgrade impacts any of the used features by comparing them with the version differences.

Instructor downgrade is safe to proceed

The codebase uses basic instructor features that are stable across versions 1.5.2 to 1.6.3:

Core initialization methods: from_openai() and patch()
Basic mode enums: Mode.JSON, Mode.MD_JSON, and Mode.ANTHROPIC_TOOLS
Standard structured output generation via OpenAI and Anthropic clients

These fundamental features were introduced before 1.5.2 and remain unchanged in the downgrade.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for instructor usage patterns that might be affected by the version downgrade

# Test: Look for instructor imports and usage
rg -A 5 "from instructor" 
rg -A 5 "import instructor"

# Test: Look for specific instructor features that might have changed between versions
ast-grep --pattern 'instructor.$_'

Length of output: 9685

Script:

#!/bin/bash
# Look for imports to understand which instructor features are being used
rg -l "import instructor"

# Look for specific instructor Mode usage
rg "instructor\.Mode\."

# Look for instructor initialization patterns
rg "instructor\.(patch|from_openai)"

Length of output: 1060

cognee/tasks/ingestion/ingest_data.py (1)

6-6: Verify the removal of get_relational_config dependency.

The removal of get_relational_config import while retaining get_relational_engine suggests an architectural change. Let's verify there are no remaining references to the removed import.

cognee/tasks/ingestion/transform_data.py

coderabbitai · 2024-11-08T16:06:05Z

cognee/tasks/ingestion/transform_data.py

+def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
+    # Specific type checking is used to ensure it's not a child class from Document
+    if type(data_point) == Document:
+        file_path = data_point.metadata.get("file_path")
+        if file_path is None:
+            file_path = save_data_to_file(data_point.text, dataset_name)
+            return file_path
+        return file_path
+    elif type(data_point) == ImageDocument:
+        if data_point.image_path is None:
+            file_path = save_data_to_file(data_point.text, dataset_name)
+            return file_path
+        return data_point.image_path


🛠️ Refactor suggestion

Refactor to reduce code duplication and improve error handling

The function has similar logic patterns that could be consolidated, and it needs better error handling.

Consider this improved implementation:

def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str: + """Extract or generate a file path from a LlamaIndex document. + + Args: + data_point: A Document or ImageDocument from LlamaIndex + dataset_name: Name of the dataset for file path generation + + Returns: + str: Path to the file containing the document's content + + Raises: + TypeError: If data_point is neither Document nor ImageDocument + """ + if isinstance(data_point, ImageDocument): + return data_point.image_path or save_data_to_file(data_point.text, dataset_name) + elif isinstance(data_point, Document): + return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name) + else: + raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}")

This refactored version:

Uses the or operator for cleaner null-coalescing

Adds proper error handling for invalid types

Reduces code duplication

Handles ImageDocument check first to avoid isinstance ambiguity

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:

# Specific type checking is used to ensure it's not a child class from Document

if type(data_point) == Document:

file_path = data_point.metadata.get("file_path")

if file_path is None:

file_path = save_data_to_file(data_point.text, dataset_name)

return file_path

return file_path

elif type(data_point) == ImageDocument:

if data_point.image_path is None:

file_path = save_data_to_file(data_point.text, dataset_name)

return file_path

return data_point.image_path

def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:

"""Extract or generate a file path from a LlamaIndex document.

Args:

data_point: A Document or ImageDocument from LlamaIndex

dataset_name: Name of the dataset for file path generation

Returns:

str: Path to the file containing the document's content

Raises:

TypeError: If data_point is neither Document nor ImageDocument

"""

if isinstance(data_point, ImageDocument):

return data_point.image_path or save_data_to_file(data_point.text, dataset_name)

elif isinstance(data_point, Document):

return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name)

else:

raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}")

🧰 Tools

🪛 Ruff

8-8: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

14-14: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

cognee/tasks/ingestion/save_data_item_to_storage.py

coderabbitai · 2024-11-08T16:06:05Z

cognee/tasks/ingestion/save_data_item_to_storage.py

+        if data_item.startswith("file://") or data_item.startswith("/"):
+            file_path = data_item.replace("file://", "")


💡 Codebase verification

⚠️ Potential issue

Based on the gathered information, I can now provide the final response:

Add path validation and sanitization to prevent path traversal vulnerabilities

The current implementation lacks proper path validation, which could lead to security issues:

The file_path = data_item.replace("file://", "") operation is insufficient for security

No validation is performed to ensure paths are within allowed boundaries

No sanitization against directory traversal attacks (e.g., ../../../etc/passwd)

Required changes:

Use os.path.abspath() and verify the resolved path is within the allowed data directory

Add path sanitization to remove any potential directory traversal sequences

Implement path validation before file operations in save_data_item_to_storage.py

Example vulnerable operations found:

Direct path usage in ingest_data.py: open(file_path.replace("file://", ""), mode = "rb")

Unchecked path handling in save_data_item_to_storage.py

Raw path concatenation in save_data_to_file.py: storage_path = data_directory_path + "/" + dataset_name.replace(".", "/")

🔗 Analysis chain

Add path validation for security.

Direct handling of file paths without validation could lead to path traversal vulnerabilities. Consider:

Validating the path is within allowed boundaries

Sanitizing the path to prevent directory traversal

Let's check if there are any path validation utilities in the codebase:

Let me search for more specific path-related security implementations:

Let me check if there are any security-related utilities or path validation functions in the codebase:

Let me check how the file paths are being used after they're processed:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for path validation utilities rg -l "(?i)(validate|sanitize|clean|check).*path"

Length of output: 63

Script:

#!/bin/bash # Search for path-related security checks and normalizations rg -i "path|directory" -A 3 -B 3 cognee/tasks/ingestion/

Length of output: 11512

Script:

#!/bin/bash # Search for security-related utilities and path validation functions rg -i "def.*secure|sanitize|validate|normalize.*path" -g "*.py" # Also check for any security-related imports rg -i "from.*security|import.*pathlib|import.*os\.path" -g "*.py"

Length of output: 1185

Script:

#!/bin/bash # Check for file operations and how paths are used rg -A 5 "open\(.*file_path" -g "*.py" # Also check for the save_data_to_file implementation rg -A 10 "def save_data_to_file" -g "*.py"

Length of output: 6430

coderabbitai · 2024-11-08T16:06:05Z

notebooks/cognee_llama_index.ipynb

+    "# Create a clean slate for cognee -- reset data and system state\n",
+    "await cognee.prune.prune_data()\n",
+    "await cognee.prune.prune_system(metadata=True)\n",
+    "\n",
+    "# Add the LlamaIndex documents, and make it available for cognify\n",
+    "await cognee.add(documents)\n",
+    "\n",
+    "# Use LLMs and cognee to create knowledge graph\n",
+    "await cognee.cognify()"


🛠️ Refactor suggestion

Add error handling and progress tracking for Cognee operations.

The current implementation lacks error handling and progress feedback for potentially long-running operations.

+from tqdm.notebook import tqdm + # Create a clean slate for cognee -- reset data and system state +try: await cognee.prune.prune_data() await cognee.prune.prune_system(metadata=True) + print("✓ System reset complete") -# Add the LlamaIndex documents, and make it available for cognify -await cognee.add(documents) + # Add documents with progress tracking + print("\nAdding documents...") + for doc in tqdm(documents, desc="Processing"): + await cognee.add([doc]) + print("✓ Documents added successfully") -# Use LLMs and cognee to create knowledge graph -await cognee.cognify() + # Create knowledge graph + print("\nGenerating knowledge graph...") + await cognee.cognify() + print("✓ Knowledge graph generated successfully") +except Exception as e: + print(f"❌ Error during Cognee operations: {str(e)}") + raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"# Create a clean slate for cognee -- reset data and system state\n",

"await cognee.prune.prune_data()\n",

"await cognee.prune.prune_system(metadata=True)\n",

"\n",

"# Add the LlamaIndex documents, and make it available for cognify\n",

"await cognee.add(documents)\n",

"\n",

"# Use LLMs and cognee to create knowledge graph\n",

"await cognee.cognify()"

from tqdm.notebook import tqdm

# Create a clean slate for cognee -- reset data and system state

try:

await cognee.prune.prune_data()

await cognee.prune.prune_system(metadata=True)

print("✓ System reset complete")

# Add documents with progress tracking

print("\nAdding documents...")

for doc in tqdm(documents, desc="Processing"):

await cognee.add([doc])

print("✓ Documents added successfully")

# Create knowledge graph

print("\nGenerating knowledge graph...")

await cognee.cognify()

print("✓ Knowledge graph generated successfully")

except Exception as e:

print(f"❌ Error during Cognee operations: {str(e)}")

raise

coderabbitai · 2024-11-08T16:06:06Z

notebooks/cognee_llama_index.ipynb

+    "from cognee import SearchType\n",
+    "\n",
+    "# Query cognee for summaries\n",
+    "search_results = await cognee.search(\n",
+    "    SearchType.SUMMARIES,\n",
+    "    {'query': \"What are the main news discussed in the document?\"}\n",
+    ")\n",
+    "# Display search results\n",
+    "print(\"\\n Summary of main news discussed:\\n\")\n",
+    "print(search_results[0][\"text\"])"


🛠️ Refactor suggestion

Improve search results handling and display.

The current implementation only shows the first result without proper error handling or validation.

-# Query cognee for summaries -search_results = await cognee.search( - SearchType.SUMMARIES, - {'query': "What are the main news discussed in the document?"} -) -# Display search results -print("\n Summary of main news discussed:\n") -print(search_results[0]["text"]) +try: + # Query cognee for summaries + search_results = await cognee.search( + SearchType.SUMMARIES, + {'query': "What are the main news discussed in the document?"} + ) + + if not search_results: + print("No results found") + return + + # Display search results + print("\nSummary of main news discussed:\n") + for i, result in enumerate(search_results, 1): + print(f"Result {i}:") + print(f"Score: {result.get('score', 'N/A')}") + print(f"Text: {result['text']}\n") +except Exception as e: + print(f"Error during search: {str(e)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"from cognee import SearchType\n",

"\n",

"# Query cognee for summaries\n",

"search_results = await cognee.search(\n",

" SearchType.SUMMARIES,\n",

" {'query': \"What are the main news discussed in the document?\"}\n",

")\n",

"# Display search results\n",

"print(\"\\n Summary of main news discussed:\\n\")\n",

"print(search_results[0][\"text\"])"

from cognee import SearchType

try:

# Query cognee for summaries

search_results = await cognee.search(

SearchType.SUMMARIES,

{'query': "What are the main news discussed in the document?"}

)

if not search_results:

print("No results found")

return

# Display search results

print("\nSummary of main news discussed:\n")

for i, result in enumerate(search_results, 1):

print(f"Result {i}:")

print(f"Score: {result.get('score', 'N/A')}")

print(f"Text: {result['text']}\n")

except Exception as e:

print(f"Error during search: {str(e)}")

coderabbitai · 2024-11-08T16:06:06Z

cognee/tasks/ingestion/ingest_data_with_metadata.py

+            with open(file_path.replace("file://", ""), mode = "rb") as file:
+                classified_data = ingestion.classify(file)
+


⚠️ Potential issue

Add Exception Handling for File Operations

Opening a file can raise exceptions such as FileNotFoundError or PermissionError. To prevent the program from crashing, add exception handling around file operations to handle these potential errors gracefully.

I can help implement appropriate exception handling if needed.

coderabbitai · 2024-11-08T16:06:06Z

cognee/tasks/ingestion/ingest_data_with_metadata.py

+                async with db_engine.get_async_session() as session:
+                    dataset = await create_dataset(dataset_name, user.id, session)
+
+                    data_point = (await session.execute(
+                        select(Data).filter(Data.id == data_id)
+                    )).scalar_one_or_none()
+
+                    if data_point is not None:
+                        data_point.name = file_metadata["name"]
+                        data_point.raw_data_location = file_metadata["file_path"]
+                        data_point.extension = file_metadata["extension"]
+                        data_point.mime_type = file_metadata["mime_type"]
+
+                        await session.merge(data_point)
+                        await session.commit()
+                    else:
+                        data_point = Data(
+                            id = data_id,
+                            name = file_metadata["name"],
+                            raw_data_location = file_metadata["file_path"],
+                            extension = file_metadata["extension"],
+                            mime_type = file_metadata["mime_type"],
+                        )
+
+                        dataset.data.append(data_point)
+                        await session.commit()
+


⚠️ Potential issue

Ensure Proper Transaction Management and Exception Handling in Database Operations

While using async sessions, it's important to handle exceptions that may occur during database operations to prevent partial commits or data inconsistencies. Consider adding exception handling and ensuring that transactions are properly managed.

I can assist in implementing robust exception handling for database transactions if you'd like.

coderabbitai · 2024-11-08T16:06:06Z

cognee/tasks/ingestion/ingest_data_with_metadata.py

+            file_path = save_data_item_to_storage(data_item, dataset_name)
+
+            # Ingest data and add metadata
+            with open(file_path.replace("file://", ""), mode = "rb") as file:


🛠️ Refactor suggestion

Use a Robust Method to Parse File URIs

The current approach of using file_path.replace("file://", "") may not handle all edge cases of file URIs, especially across different operating systems. Consider using the urllib.parse or pathlib modules to reliably parse the file path from the URI.

Here's how you might adjust the code:

+from urllib.parse import urlparse +parsed_url = urlparse(file_path) +file_system_path = parsed_url.path -with open(file_path.replace("file://", ""), mode="rb") as file: +with open(file_system_path, mode="rb") as file:

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-08T16:06:06Z

cognee/tasks/ingestion/ingest_data_with_metadata.py

+                    if data_point is not None:
+                        data_point.name = file_metadata["name"]
+                        data_point.raw_data_location = file_metadata["file_path"]
+                        data_point.extension = file_metadata["extension"]
+                        data_point.mime_type = file_metadata["mime_type"]
+
+                        await session.merge(data_point)
+                        await session.commit()
+                    else:
+                        data_point = Data(
+                            id = data_id,
+                            name = file_metadata["name"],
+                            raw_data_location = file_metadata["file_path"],
+                            extension = file_metadata["extension"],
+                            mime_type = file_metadata["mime_type"],
+                        )
+
+                        dataset.data.append(data_point)
+                        await session.commit()
+


🛠️ Refactor suggestion

Refactor to Reduce Code Duplication in Data Handling

The code blocks for updating an existing data_point and creating a new one share similar assignments to data_point attributes. To improve maintainability and readability, refactor the code to eliminate duplication.

Here is a suggested refactor:

if data_point is not None: - data_point.name = file_metadata["name"] - data_point.raw_data_location = file_metadata["file_path"] - data_point.extension = file_metadata["extension"] - data_point.mime_type = file_metadata["mime_type"] - - await session.merge(data_point) - await session.commit() + pass # Existing data_point, no need to create a new one else: data_point = Data( id=data_id, - # These fields will be set below ) dataset.data.append(data_point) - await session.commit() +# Set or update shared attributes +data_point.name = file_metadata["name"] +data_point.raw_data_location = file_metadata["file_path"] +data_point.extension = file_metadata["extension"] +data_point.mime_type = file_metadata["mime_type"] +await session.merge(data_point) +await session.commit()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if data_point is not None:

data_point.name = file_metadata["name"]

data_point.raw_data_location = file_metadata["file_path"]

data_point.extension = file_metadata["extension"]

data_point.mime_type = file_metadata["mime_type"]

await session.merge(data_point)

await session.commit()

else:

data_point = Data(

id = data_id,

name = file_metadata["name"],

raw_data_location = file_metadata["file_path"],

extension = file_metadata["extension"],

mime_type = file_metadata["mime_type"],

)

dataset.data.append(data_point)

await session.commit()

if data_point is not None:

pass # Existing data_point, no need to create a new one

else:

data_point = Data(

id=data_id,

)

dataset.data.append(data_point)

# Set or update shared attributes

data_point.name = file_metadata["name"]

data_point.raw_data_location = file_metadata["file_path"]

data_point.extension = file_metadata["extension"]

data_point.mime_type = file_metadata["mime_type"]

await session.merge(data_point)

await session.commit()

0xideas

Looks really good overall, I left a few questions/suggestions. Can you think of any documentation that we should update/expand?

cognee/modules/users/permissions/methods/check_permission_on_documents.py

cognee/tasks/ingestion/ingest_data_with_metadata.py

0xideas · 2024-11-08T16:26:58Z

notebooks/cognee_llama_index.ipynb

@@ -0,0 +1,205 @@
+{


Should we add a test analogous to test_notebook for this new notebook?

@dexters1 , what do you think?

We can, if we plan on having more notebooks in the future we can make my test_notebook action reusable as there will be little difference between notebooks

Could you do this in this MR? I think we should keep track of our published notebooks behaving as expected

I've added a test for this notebook in this MR, but the reusable action will be made in the next sprint. I don't have time to take care of it in this sprint.

…ma-index-support

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

.github/workflows/test_cognee_llama_index_notebook.yml (3)
15-17: Consider using DEBUG log level for better visibility.

Setting RUNTIME__LOG_LEVEL to ERROR might hide useful information during notebook execution. Consider using DEBUG level in the workflow for better troubleshooting capabilities.
-  RUNTIME__LOG_LEVEL: ERROR
+  RUNTIME__LOG_LEVEL: DEBUG
47-51: Consider adding Jupyter to project dependencies.

Instead of adding Jupyter during the workflow execution, consider adding it to the project's pyproject.toml as a development dependency.

This would make the development environment setup more consistent:
# In pyproject.toml
[tool.poetry.group.dev.dependencies]
+ jupyter = "^1.0.0"
52-63: Add newline at end of file.

Add a newline character at the end of the file to comply with POSIX standards.
          --execute notebooks/cognee_llama_index.ipynb \
          --output executed_notebook.ipynb \
-          --ExecutePreprocessor.timeout=1200
+          --ExecutePreprocessor.timeout=1200
+
🧰 Tools

🪛 yamllint

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8c7fdc4 and f13146d.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

.github/workflows/test_cognee_llama_index_notebook.yml (1 hunks)
pyproject.toml (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

🧰 Additional context used

🪛 actionlint

.github/workflows/test_cognee_llama_index_notebook.yml

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

🪛 yamllint

.github/workflows/test_cognee_llama_index_notebook.yml

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (1)

.github/workflows/test_cognee_llama_index_notebook.yml (1)

1-14: LGTM! Well-configured workflow triggers and concurrency settings.

The workflow configuration properly handles manual triggers and PR events, with appropriate concurrency controls to prevent parallel runs.

coderabbitai · 2024-11-14T11:09:34Z

.github/workflows/test_cognee_llama_index_notebook.yml

+jobs:
+  get_docs_changes:
+    name: docs changes
+    uses: ./.github/workflows/get_docs_changes.yml
+
+  run_notebook_test:
+    name: test
+    needs: get_docs_changes
+    if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}
+    runs-on: ubuntu-latest


⚠️ Potential issue

Fix the condition syntax in the if statement.

The current condition mixes template syntax unnecessarily, which could cause evaluation issues.

- if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }} + if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

jobs:

get_docs_changes:

name: docs changes

uses: ./.github/workflows/get_docs_changes.yml

run_notebook_test:

name: test

needs: get_docs_changes

if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}

runs-on: ubuntu-latest

jobs:

get_docs_changes:

name: docs changes

uses: ./.github/workflows/get_docs_changes.yml

run_notebook_test:

name: test

needs: get_docs_changes

if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'

runs-on: ubuntu-latest

🧰 Tools

🪛 actionlint

26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}

(if-cond)

@dexters1 This

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@borisarzentar Added change, maybe we should change this for the rest of the actions as well?

notebooks/cognee_llama_index.ipynb

Remove dictionary and use string for query in notebook and simple example Fix COG-337

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

examples/python/simple_example.py (2)
Line range hint 1-40: Consider enhancing the example documentation

While the example is well-documented with prerequisites, it would be valuable to add a comment about LlamaIndex integration since this PR is specifically about adding LlamaIndex support.

Consider adding a comment like:
# Prerequisites:
# 1. Copy `.env.template` and rename it to `.env`.
# 2. Add your OpenAI API key to the `.env` file in the `LLM_API_KEY` field:
#    LLM_API_KEY = "your_key_here"
# 3. (Optional) To minimize setup effort, set `VECTOR_DB_PROVIDER="lancedb"` in `.env".
+# 4. (Optional) This example supports LlamaIndex documents for enhanced document processing
API parameter style inconsistency found in documentation and examples

The search API usage shows inconsistency in parameter style:

simple_example.py uses named parameters: SearchType.INSIGHTS, query='Tell me about NLP'

docs/quickstart.md uses positional parameters: "INSIGHTS", {'query': 'Tell me about NLP'}

Test files consistently use named parameters: SearchType.INSIGHTS, query = random_node_name

The documentation in quickstart.md needs to be updated to match the current API style using named parameters and SearchType enum.

🔗 Analysis chain

Line range hint 1-40: Verify the impact of API changes

Since this example demonstrates the core API usage, we should verify that all other examples and documentation are consistent with this new parameter style.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for other instances of cognee.search to ensure consistent parameter style
rg -l "cognee.search" --type python | while read -r file; do
  if [ "$file" != "examples/python/simple_example.py" ]; then
    echo "=== $file ==="
    rg -A 2 "cognee\.search" "$file"
  fi
done
Length of output: 81

Script:
#!/bin/bash
# Search for other instances of cognee.search using ripgrep without file type filter
rg -l "cognee\.search"
Length of output: 403

Script:
#!/bin/bash
# Search for Python files in the repository
fd '\.py$'
Length of output: 14317

Script:
#!/bin/bash
# Search for documentation files that might contain API examples
fd -e md -e rst
Length of output: 607

Script:
#!/bin/bash
# Check the search API usage in documentation and test files
echo "=== Documentation examples ==="
rg -A 5 "cognee\.search" docs/
echo -e "\n=== Test files ==="
rg -A 5 "cognee\.search" "cognee/tests/"
Length of output: 7591
notebooks/cognee_llama_index.ipynb (2)
50-53: Add validation for document creation

The document creation process should validate the input data to handle potential empty or malformed entries.
+def create_document(row: pd.Series) -> Document:
+    if not row['title'].strip() or not row['text'].strip():
+        raise ValueError(f"Empty title or text in row: {row.name}")
+    return Document(text=f"{row['title']}: {row['text']}")
+
 documents = [
-    Document(text=f"{row['title']}: {row['text']}")
-    for i, row in news.iterrows()
+    create_document(row)
+    for _, row in news.iterrows()
+    if pd.notna(row['title']) and pd.notna(row['text'])
 ]
199-199: Consider updating Python version

The notebook uses Python 3.9.6 which is quite old. Consider updating to a newer version (3.11+) to benefit from performance improvements, security patches, and ensure compatibility with latest package versions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f13146d and cc53693.

📒 Files selected for processing (2)

examples/python/simple_example.py (1 hunks)
notebooks/cognee_llama_index.ipynb (1 hunks)

🔇 Additional comments (2)

examples/python/simple_example.py (1)

29-31: LGTM! API parameter style improvement

The change from dictionary-style parameters to keyword arguments makes the API more Pythonic and easier to use.

notebooks/cognee_llama_index.ipynb (1)

1-1: Verify notebook test implementation

The conversation indicates that a test for this notebook has been added, but let's verify its implementation.

✅ Verification successful

Test implementation is properly configured ✅

The notebook test has been correctly implemented with:

Proper workflow triggers (PR with 'run-checks' label)
Required environment setup (Python, Poetry)
Necessary secrets configuration (LLM_API_KEY, Graphistry credentials)
Appropriate notebook execution timeout (1200s)
Correct notebook path and execution parameters

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and configuration of notebook test

# Check if test file exists
echo "Checking for notebook test file..."
fd "test_cognee_llama_index_notebook.yml" .github/workflows/

# Check test configuration
echo -e "\nChecking test configuration..."
rg -A 10 "cognee_llama_index.ipynb" .github/workflows/

Length of output: 638

Script:

#!/bin/bash
# Let's check the complete test configuration to ensure it's properly set up
echo "Complete test configuration:"
cat .github/workflows/test_cognee_llama_index_notebook.yml

Length of output: 1780

coderabbitai · 2024-11-14T12:07:35Z

notebooks/cognee_llama_index.ipynb

+    "import pandas as pd\n",
+    "from llama_index.core import Document\n",
+    "\n",
+    "news = pd.read_csv(\n",
+    "    \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n",
+    ")[:5]\n",
+    "\n",
+    "news.head()"


🛠️ Refactor suggestion

Add error handling and data validation for dataset loading

The current implementation lacks error handling for dataset loading and validation of the expected structure.

import pandas as pd from llama_index.core import Document +from typing import List, Dict + +def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame: + try: + df = pd.read_csv(url) + if not all(col in df.columns for col in ['title', 'text']): + raise ValueError("Dataset missing required columns: title, text") + return df.head(sample_size) + except Exception as e: + raise RuntimeError(f"Failed to load news dataset: {str(e)}") -news = pd.read_csv( - "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" -)[:5] +# Load and validate dataset +news = load_news_dataset( + "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"import pandas as pd\n",

"from llama_index.core import Document\n",

"\n",

"news = pd.read_csv(\n",

" \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n",

")[:5]\n",

"\n",

"news.head()"

import pandas as pd

from llama_index.core import Document

from typing import List, Dict

def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame:

try:

df = pd.read_csv(url)

if not all(col in df.columns for col in ['title', 'text']):

raise ValueError("Dataset missing required columns: title, text")

return df.head(sample_size)

except Exception as e:

raise RuntimeError(f"Failed to load news dataset: {str(e)}")

# Load and validate dataset

news = load_news_dataset(

"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"

)

news.head()

0xideas

Looks good to me, except for the small thing in GitHub actions, for consistency's sake :)

0xideas · 2024-11-14T13:46:45Z

.github/workflows/test_cognee_llama_index_notebook.yml

+  pull_request:
+    branches:
+      - main
+    types: [labeled]


currently, we apply [labeled, synchronize] in the other GitHub actions

0xideas · 2024-11-14T13:50:19Z

(@dexters1 could you also merge the latest main?) <- actually, never mind, its not necessary!

Vasilije1990 · 2024-11-14T14:04:22Z

pyproject.toml

@@ -69,7 +69,8 @@ fastapi-users = {version = "*", extras = ["sqlalchemy"]}
 alembic = "^1.13.3"
 asyncpg = "^0.29.0"
 pgvector = "^0.3.5"
-psycopg2 = "^2.9.10"
+psycopg2 = {version = "^2.9.10", optional = true}
+llama-index-core = "^0.11.22"


Make this optional please

Vasilije1990 · 2024-11-14T14:05:08Z

pyproject.toml

@@ -36,7 +36,7 @@ boto3 = "^1.26.125"
 botocore="^1.35.54"
 gunicorn = "^20.1.0"
 sqlalchemy = "2.0.35"
-instructor = "1.6.3"
+instructor = "^1.5.2"


why not fix the version? it is not a very reliable lib, as ti changed a few times already

Added the following changes that were requested in pull request: Added synchronize label, Made uniform syntax in if statement in workflow, fixed instructor dependency, added llama-index to be optional Refactor COG-337

…ma-index-support

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

.github/workflows/test_cognee_llama_index_notebook.yml (1)
15-17: Consider using INFO log level for better debugging capabilities.

Setting RUNTIME__LOG_LEVEL to ERROR might make it difficult to debug workflow failures. Consider using INFO level in the CI environment.
 env:
-  RUNTIME__LOG_LEVEL: ERROR
+  RUNTIME__LOG_LEVEL: INFO

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cc53693 and 417b119.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

.github/workflows/test_cognee_llama_index_notebook.yml (1 hunks)
pyproject.toml (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

🧰 Additional context used

🪛 yamllint

.github/workflows/test_cognee_llama_index_notebook.yml

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (2)

.github/workflows/test_cognee_llama_index_notebook.yml (2)

1-9: LGTM! Workflow trigger configuration is well-structured.

The workflow triggers and event types are properly configured.

18-26: LGTM! Job structure and conditions are well-defined.

The job dependencies and conditions are properly configured.

coderabbitai · 2024-11-15T15:02:37Z

.github/workflows/test_cognee_llama_index_notebook.yml

+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11.x'
+
+      - name: Install Poetry
+        uses: snok/[email protected]
+        with:
+          virtualenvs-create: true
+          virtualenvs-in-project: true
+          installer-parallel: true
+
+      - name: Install dependencies
+        run: |
+          poetry install --no-interaction --all-extras --no-root
+          poetry add jupyter --no-interaction
+


🛠️ Refactor suggestion

Add Poetry caching for faster workflow execution.

Consider these improvements for the Python/Poetry setup:

Add Poetry cache to speed up dependency installation

Pin Poetry version more specifically for better reproducibility

- name: Setup Python uses: actions/setup-python@v5 with: python-version: '3.11.x' +- name: Cache Poetry + uses: actions/cache@v3 + with: + path: ~/.cache/pypoetry + key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }} + restore-keys: | + ${{ runner.os }}-poetry- - name: Install Poetry - uses: snok/[email protected] + uses: snok/[email protected] with: virtualenvs-create: true virtualenvs-in-project: true installer-parallel: true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Setup Python

uses: actions/setup-python@v5

with:

python-version: '3.11.x'

- name: Install Poetry

uses: snok/[email protected]

with:

virtualenvs-create: true

virtualenvs-in-project: true

installer-parallel: true

- name: Install dependencies

run: |

poetry install --no-interaction --all-extras --no-root

poetry add jupyter --no-interaction

- name: Setup Python

uses: actions/setup-python@v5

with:

python-version: '3.11.x'

- name: Cache Poetry

uses: actions/cache@v3

with:

path: ~/.cache/pypoetry

key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}

restore-keys: |

${{ runner.os }}-poetry-

- name: Install Poetry

uses: snok/[email protected]

with:

virtualenvs-create: true

virtualenvs-in-project: true

installer-parallel: true

- name: Install dependencies

run: |

poetry install --no-interaction --all-extras --no-root

poetry add jupyter --no-interaction

coderabbitai · 2024-11-15T15:02:37Z

.github/workflows/test_cognee_llama_index_notebook.yml

+      - name: Execute Jupyter Notebook
+        env:
+          ENV: 'dev'
+          LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+          GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}
+          GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}
+        run: |
+          poetry run jupyter nbconvert \
+          --to notebook \
+          --execute notebooks/cognee_llama_index.ipynb \
+          --output executed_notebook.ipynb \
+          --ExecutePreprocessor.timeout=1200


🛠️ Refactor suggestion

Enhance notebook execution error handling and fix EOF.

The 20-minute timeout suggests potential performance concerns. Consider adding error handling and output capture.

Add a newline at the end of file.

run: | poetry run jupyter nbconvert \ --to notebook \ --execute notebooks/cognee_llama_index.ipynb \ --output executed_notebook.ipynb \ - --ExecutePreprocessor.timeout=1200 + --ExecutePreprocessor.timeout=1200 \ + || { + echo "::error::Notebook execution failed" + cat executed_notebook.ipynb + exit 1 + } +

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Execute Jupyter Notebook

env:

ENV: 'dev'

LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}

GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}

GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}

run: |

poetry run jupyter nbconvert \

--to notebook \

--execute notebooks/cognee_llama_index.ipynb \

--output executed_notebook.ipynb \

--ExecutePreprocessor.timeout=1200

- name: Execute Jupyter Notebook

env:

ENV: 'dev'

LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }}

GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }}

GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }}

run: |

poetry run jupyter nbconvert \

--to notebook \

--execute notebooks/cognee_llama_index.ipynb \

--output executed_notebook.ipynb \

--ExecutePreprocessor.timeout=1200 \

|| {

echo "::error::Notebook execution failed"

cat executed_notebook.ipynb

exit 1

}

🧰 Tools

🪛 yamllint

[error] 63-63: no new line character at the end of file

(new-line-at-end-of-file)

Resolve issue with llama-index being mandatory to run cognee Fix COG-337

Removed additional references to llama-index from core cognee lib. Added llama-index-core install from notebook Fix COG-337

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (7)

cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (4)

1-2: Add docstring with type hint documentation.

The function's type hints should be documented in a docstring to improve code maintainability and IDE support.

Add a docstring like this:

 def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
+    """Save data item with metadata to storage.
+    
+    Args:
+        data_item (Union[BinaryIO, str, Any]): The data to save. Can be:
+            - A file-like object
+            - A string (file path or text content)
+            - A Document from llama_index
+        dataset_name (str): Name of the dataset
+    
+    Returns:
+        str: Path where the data was saved
+        
+    Raises:
+        ValueError: If the data type is not supported
+    """

10-12: Add input validation for dataset_name and return value.

The dataset_name parameter and the return value from get_data_from_llama_index should be validated.

+    if not dataset_name or not isinstance(dataset_name, str):
+        raise ValueError("dataset_name must be a non-empty string")
+
     # Check if data is of type Document or any of it's subclasses
     if isinstance(data_item, Document):
-        file_path = get_data_from_llama_index(data_item, dataset_name)
+        file_path = get_data_from_llama_index(data_item, dataset_name)
+        if not file_path:
+            raise ValueError("Failed to get file path from Document")

25-26: Improve error message with supported type information.

The error message could be more helpful by listing the supported types.

-        raise ValueError(f"Data type not supported: {type(data_item)}")
+        supported_types = [
+            "llama_index.core.Document",
+            "File object with 'file' and 'filename' attributes",
+            "String (file path or text content)"
+        ]
+        raise ValueError(
+            f"Data type not supported: {type(data_item)}. "
+            f"Supported types are: {', '.join(supported_types)}"
+        )

4-28: Consider architectural improvements.

The current implementation could benefit from:

Separation of concerns by splitting into smaller, focused functions for each data type
Adding logging for better debugging and monitoring
Consider implementing a strategy pattern for different data types

This would improve maintainability and make it easier to add support for new data types.

notebooks/cognee_llama_index.ipynb (3)

16-16: Improve dependency installation and data source management

The pip install command should check for successful installation
The data source URL should be configurable and its availability should be verified

Consider these improvements:

- %pip install llama-index-core
+ %%capture pip_output
+ %pip install llama-index-core
+ if "Successfully installed" not in pip_output.stdout:
+     raise RuntimeError("Failed to install llama-index-core")

+ # Define data source URL as a configuration
+ NEWS_DATASET_URL = "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
+ SAMPLE_SIZE = 5

- news = pd.read_csv(
-     "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
- )[:5]
+ # Validate URL and load data
+ try:
+     response = requests.head(NEWS_DATASET_URL)
+     response.raise_for_status()
+     news = pd.read_csv(NEWS_DATASET_URL).head(SAMPLE_SIZE)
+ except Exception as e:
+     raise RuntimeError(f"Failed to access dataset: {str(e)}")

Also applies to: 39-41

59-62: Add validation and progress tracking for document preparation

The document creation process should validate row contents and track progress, especially important when working with larger datasets.

+ from tqdm.notebook import tqdm
+ 
+ def create_document(row):
+     if not isinstance(row['title'], str) or not isinstance(row['text'], str):
+         raise ValueError(f"Invalid row format: title and text must be strings")
+     return Document(text=f"{row['title']}: {row['text']}")

- documents = [
-     Document(text=f"{row['title']}: {row['text']}")
-     for i, row in news.iterrows()
- ]
+ documents = []
+ for i, row in tqdm(news.iterrows(), desc="Creating documents", total=len(news)):
+     try:
+         documents.append(create_document(row))
+     except Exception as e:
+         print(f"Warning: Failed to create document for row {i}: {str(e)}")

224-224: Update Python version

The notebook uses Python 3.9.6 which is quite old. Consider updating to a more recent version (3.11+) to benefit from performance improvements, security fixes, and new features.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 417b119 and c4700b1.

📒 Files selected for processing (5)

cognee/tasks/ingestion/__init__.py (1 hunks)
cognee/tasks/ingestion/ingest_data_with_metadata.py (1 hunks)
cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (1 hunks)
notebooks/cognee_llama_index.ipynb (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

cognee/tasks/ingestion/init.py
cognee/tasks/ingestion/ingest_data_with_metadata.py
cognee/tasks/ingestion/save_data_item_to_storage.py

coderabbitai · 2024-11-15T16:22:23Z

cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py

+    # data is a file object coming from upload.
+    elif hasattr(data_item, "file"):
+        file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)


⚠️ Potential issue

Add robust file object validation.

The current implementation makes assumptions about the file object structure without proper validation.

- elif hasattr(data_item, "file"): - file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) + elif hasattr(data_item, "file") and hasattr(data_item, "filename"): + if not data_item.file: + raise ValueError("File object is empty") + if not data_item.filename: + raise ValueError("Filename is missing") + file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# data is a file object coming from upload.

elif hasattr(data_item, "file"):

file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

# data is a file object coming from upload.

elif hasattr(data_item, "file") and hasattr(data_item, "filename"):

if not data_item.file:

raise ValueError("File object is empty")

if not data_item.filename:

raise ValueError("Filename is missing")

file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)

coderabbitai · 2024-11-15T16:22:23Z

cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py

+def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
+    # Dynamic import is used because the llama_index module is optional. 
+    # For the same reason Any is accepted as a data item
+    from llama_index.core import Document
+    from .transform_data import get_data_from_llama_index


🛠️ Refactor suggestion

Add error handling for dynamic imports.

The dynamic imports could fail silently if the optional dependencies are not installed. Consider adding explicit error handling.

- from llama_index.core import Document - from .transform_data import get_data_from_llama_index + try: + from llama_index.core import Document + from .transform_data import get_data_from_llama_index + except ImportError as e: + raise ImportError( + "Failed to import required dependencies. " + "Please ensure llama_index is installed for Document support." + ) from e

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:

# Dynamic import is used because the llama_index module is optional.

# For the same reason Any is accepted as a data item

from llama_index.core import Document

from .transform_data import get_data_from_llama_index

def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:

# Dynamic import is used because the llama_index module is optional.

# For the same reason Any is accepted as a data item

try:

from llama_index.core import Document

from .transform_data import get_data_from_llama_index

except ImportError as e:

raise ImportError(

"Failed to import required dependencies. "

"Please ensure llama_index is installed for Document support."

) from e

coderabbitai · 2024-11-15T16:22:23Z

cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py

+    elif isinstance(data_item, str):
+        # data is a file path
+        if data_item.startswith("file://") or data_item.startswith("/"):
+            file_path = data_item.replace("file://", "")
+        # data is text
+        else:
+            file_path = save_data_to_file(data_item, dataset_name)


🛠️ Refactor suggestion

Enhance string input validation and handling.

The current string handling could be improved with better file path validation and text content limits.

elif isinstance(data_item, str): + if not data_item.strip(): + raise ValueError("Empty string input") + # data is a file path if data_item.startswith("file://") or data_item.startswith("/"): file_path = data_item.replace("file://", "") + # Verify file exists and is accessible + import os + if not os.path.isfile(file_path): + raise FileNotFoundError(f"File not found: {file_path}") # data is text else: + # Add reasonable size limit for text content + if len(data_item) > 10_000_000: # 10MB limit + raise ValueError("Text content exceeds size limit") file_path = save_data_to_file(data_item, dataset_name)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif isinstance(data_item, str):

# data is a file path

if data_item.startswith("file://") or data_item.startswith("/"):

file_path = data_item.replace("file://", "")

# data is text

else:

file_path = save_data_to_file(data_item, dataset_name)

elif isinstance(data_item, str):

if not data_item.strip():

raise ValueError("Empty string input")

# data is a file path

if data_item.startswith("file://") or data_item.startswith("/"):

file_path = data_item.replace("file://", "")

# Verify file exists and is accessible

import os

if not os.path.isfile(file_path):

raise FileNotFoundError(f"File not found: {file_path}")

# data is text

else:

# Add reasonable size limit for text content

if len(data_item) > 10_000_000: # 10MB limit

raise ValueError("Text content exceeds size limit")

file_path = save_data_to_file(data_item, dataset_name)

coderabbitai · 2024-11-15T16:22:23Z

notebooks/cognee_llama_index.ipynb

+    "async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n",
+    "    await create_relational_db_and_tables()\n",
+    "    await create_pgvector_db_and_tables()\n",
+    "\n",
+    "    if user is None:\n",
+    "        user = await get_default_user()\n",
+    "\n",
+    "    await ingest_data_with_metadata(data, dataset_name, user)\n",


🛠️ Refactor suggestion

Improve function definition and documentation

The add function should be properly documented and defined in a module rather than a notebook cell.

- async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None): + async def add( + data: Union[BinaryIO, list[BinaryIO], str, list[str]], + dataset_name: str = "main_dataset", + user: User = None + ) -> None: + """Add data to Cognee for processing. + + Args: + data: Input data in various formats (file, string, or lists thereof) + dataset_name: Name to identify the dataset (default: "main_dataset") + user: User performing the operation (default: None, uses default user) + + Raises: + RuntimeError: If database setup fails + """ await create_relational_db_and_tables() await create_pgvector_db_and_tables()

Consider moving this function to a proper module in the codebase.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n",

" await create_relational_db_and_tables()\n",

" await create_pgvector_db_and_tables()\n",

"\n",

" if user is None:\n",

" user = await get_default_user()\n",

"\n",

" await ingest_data_with_metadata(data, dataset_name, user)\n",

async def add(

data: Union[BinaryIO, list[BinaryIO], str, list[str]],

dataset_name: str = "main_dataset",

user: User = None

) -> None:

"""Add data to Cognee for processing.

Args:

data: Input data in various formats (file, string, or lists thereof)

dataset_name: Name to identify the dataset (default: "main_dataset")

user: User performing the operation (default: None, uses default user)

Raises:

RuntimeError: If database setup fails

"""

await create_relational_db_and_tables()

await create_pgvector_db_and_tables()

if user is None:

user = await get_default_user()

await ingest_data_with_metadata(data, dataset_name, user)

dexters1 added 14 commits November 6, 2024 12:28

feat: Add support for LlamaIndex Document type

ec5fd80

Added support for LlamaIndex Document type Feature #COG-337

docs: Add Jupyer Notebook for cognee with llama index document type

b05a07f

Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage Docs #COG-337

feat: Add metadata migration from LlamaIndex document type

93445bf

Allow usage of metadata from LlamaIndex documents Feature #COG-337

refactor: Change llama index migration function name

804ca26

Change name of llama index function Refactor #COG-337

chore: Add llama index core dependency

c151d7b

Downgrade needed on tenacity and instructor modules to support llama index Chore #COG-337

Merge branch 'COG-337-llama-index-support' of github.com:topoteretes/…

fb84aa0

…cognee into COG-337-llama-index-metadata

Feature: Add ingest_data_with_metadata task

7d222ee

Added task that will have access to metadata if data is provided from different data ingestion tools Feature #COG-337

docs: Add description on why specific type checking is done

094e45d

Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well Docs #COG-337

fix: Add missing parameter to function call

d5aa316

Added missing parameter to function call Fix #COG-337

refactor: Move storing of data from async to sync function

98dc18c

Moved data storing from async to sync Refactor #COG-337

refactor: Pretend ingest_data was changes instead of having two tasks

2615b35

Refactor so ingest_data file was modified instead of having two ingest tasks Refactor #COG-337

refactor: Use old name for data ingestion with metadata

e4c39e8

Merged new and old data ingestion tasks into one Refactor #COG-337

refactor: Return ingest_data and save_data_to_storage Tasks

733caa7

Returned ingest_data and save_data_to_storage tasks Refactor #COG-337

refactor: Return previous ingestion Tasks to add function

8c7fdc4

Returned previous ignestion tasks to add function Refactor #COG-337

dexters1 added the run-checks label Nov 8, 2024

coderabbitai bot reviewed Nov 8, 2024

View reviewed changes

0xideas reviewed Nov 8, 2024

View reviewed changes

Merge branch 'main' of github.com:topoteretes/cognee into COG-337-lla…

f13146d

…ma-index-support

dexters1 requested a review from 0xideas November 14, 2024 10:55

dexters1 self-assigned this Nov 14, 2024

dexters1 requested a review from borisarzentar November 14, 2024 10:55

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

borisarzentar reviewed Nov 14, 2024

View reviewed changes

notebooks/cognee_llama_index.ipynb Outdated Show resolved Hide resolved

fix: Remove dict and use string for search query

cc53693

Remove dictionary and use string for query in notebook and simple example Fix COG-337

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

dexters1 requested a review from borisarzentar November 14, 2024 12:58

dexters1 added run-checks and removed run-checks labels Nov 14, 2024

0xideas reviewed Nov 14, 2024

View reviewed changes

Vasilije1990 reviewed Nov 14, 2024

View reviewed changes

dexters1 added 2 commits November 15, 2024 15:47

refactor: Add changes request in pull request

af99e4f

Added the following changes that were requested in pull request: Added synchronize label, Made uniform syntax in if statement in workflow, fixed instructor dependency, added llama-index to be optional Refactor COG-337

Merge branch 'main' of github.com:topoteretes/cognee into COG-337-lla…

417b119

…ma-index-support

dexters1 requested a review from 0xideas November 15, 2024 14:52

coderabbitai bot reviewed Nov 15, 2024

View reviewed changes

dexters1 added 2 commits November 15, 2024 16:55

fix: Resolve issue with llama-index being mandatory

4c7a4e2

Resolve issue with llama-index being mandatory to run cognee Fix COG-337

fix: Add install of llama-index to notebook

c4700b1

Removed additional references to llama-index from core cognee lib. Added llama-index-core install from notebook Fix COG-337

coderabbitai bot reviewed Nov 15, 2024

View reviewed changes

dexters1 requested a review from Vasilije1990 November 15, 2024 16:28

Merge branch 'main' into COG-337-llama-index-support

f6e8294

borisarzentar approved these changes Nov 17, 2024

View reviewed changes

borisarzentar merged commit d30adb5 into main Nov 17, 2024
26 of 28 checks passed

borisarzentar deleted the COG-337-llama-index-support branch November 17, 2024 10:47

This was referenced Dec 3, 2024

Feature: Integrate Milvus as a Vector Database Provider #243

Merged

Fix dlt for metadata #247

Merged

Cog 505 data dataset model changes #260

Merged

Cog 685 more document types #269

Merged

This was referenced Dec 12, 2024

Add type to DataPoint metadata #362

Closed

Add type to DataPoint metadata #364

Merged

Cog 656 deployment state #368

Merged

feat: Add search by dataset for cognee #376

Merged

Notebook for evaluation on hotpotqa dataset #389

Closed

coderabbitai bot mentioned this pull request Dec 19, 2024

Ingest non-code files in code graph pipeline #380

Closed

		if data_item.startswith("file://") or data_item.startswith("/"):
		file_path = data_item.replace("file://", "")

		with open(file_path.replace("file://", ""), mode = "rb") as file:
		classified_data = ingestion.classify(file)

Cog 337 llama index support #186

Cog 337 llama index support #186

Conversation

dexters1 commented Nov 8, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 8, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

gitguardian bot commented Nov 8, 2024 • edited Loading

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 8, 2024

Choose a reason for hiding this comment

0xideas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 14, 2024

Choose a reason for hiding this comment

0xideas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xideas commented Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

dexters1 commented Nov 8, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 8, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

gitguardian bot commented Nov 8, 2024 •

edited

Loading

coderabbitai bot Nov 14, 2024 •

edited

Loading

0xideas commented Nov 14, 2024 •

edited

Loading