-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 337 llama index support #186
Conversation
Added support for LlamaIndex Document type Feature #COG-337
Added jupyter notebook which demonstrates cognee with LlamaIndex document type usage Docs #COG-337
Allow usage of metadata from LlamaIndex documents Feature #COG-337
Change name of llama index function Refactor #COG-337
Downgrade needed on tenacity and instructor modules to support llama index Chore #COG-337
…cognee into COG-337-llama-index-metadata
Added task that will have access to metadata if data is provided from different data ingestion tools Feature #COG-337
Explained why specific type checking is used instead of isinstance, as isinstace returns True for child classes as well Docs #COG-337
Added missing parameter to function call Fix #COG-337
Moved data storing from async to sync Refactor #COG-337
Refactor so ingest_data file was modified instead of having two ingest tasks Refactor #COG-337
Merged new and old data ingestion tasks into one Refactor #COG-337
Returned ingest_data and save_data_to_storage tasks Refactor #COG-337
Returned previous ignestion tasks to add function Refactor #COG-337
WalkthroughThis pull request introduces several changes across multiple files in the Changes
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | b05a07f | notebooks/cognee_llama_index.ipynb | View secret |
13947935 | Triggered | Generic High Entropy Secret | f6e8294 | cognee/shared/utils.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (10)
cognee/tasks/ingestion/save_data_item_to_storage.py (2)
6-6
: Add docstring to improve code documentation.The function lacks a docstring explaining its purpose, parameters, return value, and possible exceptions.
def save_data_item_to_storage(data_item: Union[BinaryIO, Document, str], dataset_name: str) -> str: + """Save a data item to storage based on its type. + + Args: + data_item: The data to save. Can be a file object, LlamaIndex Document, or string path/content + dataset_name: Name of the dataset to save the data under + + Returns: + str: Path where the data was saved + + Raises: + ValueError: If the data_item type is not supported + """
22-23
: Enhance error message with supported types.The current error message could be more helpful by listing the supported types.
- raise ValueError(f"Data type not supported: {type(data_item)}") + raise ValueError( + f"Data type not supported: {type(data_item)}. " + "Supported types are: BinaryIO, Document, or str (file path or content)" + )cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)
36-36
: Add null check for user.emailWhile changing from username to email makes sense, we should ensure the email field is not null to prevent unclear error messages.
- raise PermissionDeniedException(f"User {user.email} does not have {permission_type} permission on documents") + user_identifier = user.email or "Unknown user" + raise PermissionDeniedException(f"User {user_identifier} does not have {permission_type} permission on documents")cognee/tasks/ingestion/ingest_data.py (4)
Line range hint
11-24
: Consider splitting the ingest_data function for better maintainability.The function handles multiple responsibilities including pipeline setup, data ingestion, and telemetry. Consider breaking it down into smaller, focused functions.
Suggested structure:
async def setup_pipeline(destination): return dlt.pipeline( pipeline_name="file_load_from_filesystem", destination=destination, ) async def ingest_data(file_paths: list[str], dataset_name: str, user: User): destination = get_dlt_destination() pipeline = await setup_pipeline(destination) # ... rest of the code
Line range hint
31-63
: Add explicit error handling for database operations.The database operations lack explicit error handling which could lead to inconsistent states in case of failures. Consider adding try-catch blocks and explicit transaction management.
Suggested improvement:
async def handle_data_record(session, data_id, file_metadata, dataset): try: data = (await session.execute( select(Data).filter(Data.id == data_id) )).scalar_one_or_none() if data is not None: data.name = file_metadata["name"] data.raw_data_location = file_metadata["file_path"] data.extension = file_metadata["extension"] data.mime_type = file_metadata["mime_type"] await session.merge(data) else: data = Data( id=data_id, name=file_metadata["name"], raw_data_location=file_metadata["file_path"], extension=file_metadata["extension"], mime_type=file_metadata["mime_type"], ) dataset.data.append(data) await session.commit() return data except Exception as e: await session.rollback() raise RuntimeError(f"Failed to handle data record: {str(e)}")
Line range hint
26-30
: Add input validation for file paths and secure file handling.The current implementation directly uses file paths with minimal validation, which could pose security risks. Consider adding path validation and sanitization.
from pathlib import Path import os def validate_file_path(file_path: str) -> str: # Remove file:// prefix and normalize path cleaned_path = file_path.replace("file://", "") abs_path = os.path.abspath(cleaned_path) # Ensure path is within allowed directory if not abs_path.startswith('/allowed/base/path'): raise ValueError("File path is outside allowed directory") if not os.path.exists(abs_path): raise FileNotFoundError(f"File not found: {abs_path}") return abs_path # Use in the function: validated_path = validate_file_path(file_path) with open(validated_path, mode="rb") as file: # ... rest of the code
Line range hint
65-67
: Ensure atomic operations for permission assignment.The permission assignment happens after yielding data, which could lead to a race condition where data is accessible before permissions are properly set.
Consider moving the permission assignment before yielding the data:
await give_permission_on_document(user, data_id, "read") await give_permission_on_document(user, data_id, "write") yield { "id": data_id, "name": file_metadata["name"], "file_path": file_metadata["file_path"], "extension": file_metadata["extension"], "mime_type": file_metadata["mime_type"], }notebooks/cognee_llama_index.ipynb (3)
27-34
: Add error handling and data validation for robust data loading.Consider these improvements for production readiness:
- Add try-except block for data loading
- Validate the loaded data (check for required columns, non-empty values)
- Make the sample size configurable
- Consider caching the dataset locally
+try: news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" - )[:5] + ).head(5) # Using head() is more explicit than slice + + # Validate required columns + required_columns = ['title', 'text'] + if not all(col in news.columns for col in required_columns): + raise ValueError(f"Dataset missing required columns: {required_columns}") + + # Validate non-empty values + if news[required_columns].isnull().any().any(): + raise ValueError("Dataset contains null values in required columns") +except Exception as e: + raise RuntimeError(f"Failed to load news dataset: {str(e)}")
50-53
: Enhance document preparation with metadata and better formatting.The current document preparation could be improved by:
- Adding clear separation between title and text
- Including metadata for better document tracking
- Adding error handling
documents = [ - Document(text=f"{row['title']}: {row['text']}") + Document( + text=f"Title: {row['title']}\n\nContent: {row['text']}", + metadata={ + "title": row['title'], + "doc_id": f"news_{i}", + "source": "diffbot_news_dataset" + } + ) for i, row in news.iterrows() ]
1-205
: Consider converting this notebook into a production-ready tutorial.While this notebook effectively demonstrates the LlamaIndex integration, consider these improvements for production use:
- Add a requirements.txt or environment.yml file
- Include proper error handling throughout
- Add detailed documentation about prerequisites and setup
- Consider creating a proper Python package with the core functionality
- Add example outputs and expected results
Would you like me to help create a production-ready version of this tutorial?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (10)
cognee/api/v1/add/add_v2.py
(1 hunks)cognee/modules/users/permissions/methods/check_permission_on_documents.py
(1 hunks)cognee/tasks/ingestion/__init__.py
(1 hunks)cognee/tasks/ingestion/ingest_data.py
(1 hunks)cognee/tasks/ingestion/ingest_data_with_metadata.py
(1 hunks)cognee/tasks/ingestion/save_data_item_to_storage.py
(1 hunks)cognee/tasks/ingestion/save_data_to_storage.py
(2 hunks)cognee/tasks/ingestion/transform_data.py
(1 hunks)notebooks/cognee_llama_index.ipynb
(1 hunks)pyproject.toml
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/api/v1/add/add_v2.py
🧰 Additional context used
🪛 Ruff
cognee/tasks/ingestion/transform_data.py
8-8: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
14-14: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
🔇 Additional comments (10)
cognee/tasks/ingestion/__init__.py (1)
3-3
: LGTM! Verify the imported module exists.
The new import follows the consistent pattern of the module and aligns with the PR's objective of adding LlamaIndex support.
Let's verify the imported module exists:
✅ Verification successful
Import verified and correctly used across the codebase
The module save_data_item_to_storage.py
exists in the expected location and the import statement is consistently used across multiple files in the ingestion package.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence of the imported module
# Expected: The file should exist in the same directory
# Test 1: Check if the module file exists
fd --type f "save_data_item_to_storage.py" cognee/tasks/ingestion/
# Test 2: Verify there are no typos in the import statement across the codebase
rg "from.*save_data_item_to_storage import" --type py
Length of output: 525
cognee/tasks/ingestion/save_data_to_storage.py (2)
1-2
: LGTM! Import changes align with the new storage abstraction.
The switch to save_data_item_to_storage
suggests a more flexible storage approach that can accommodate different types of data items, including LlamaIndex documents.
12-13
: 🛠️ Refactor suggestion
Verify type handling and error propagation.
While the simplified code improves readability, there are a few concerns to address:
- The function signature suggests it only handles
Union[BinaryIO, str]
, but the AI summary mentions Document support. Consider updating the type hints to reflect all supported types. - There's no visible error handling for failed storage operations.
Let's verify the type handling in save_data_item_to_storage:
Consider adding error handling:
def save_data_to_storage(data: Union[BinaryIO, str], dataset_name) -> list[str]:
if not isinstance(data, list):
data = [data]
file_paths = []
for data_item in data:
- file_path = save_data_item_to_storage(data_item, dataset_name)
- file_paths.append(file_path)
+ try:
+ file_path = save_data_item_to_storage(data_item, dataset_name)
+ file_paths.append(file_path)
+ except Exception as e:
+ raise RuntimeError(f"Failed to save data item: {str(e)}") from e
return file_paths
cognee/tasks/ingestion/transform_data.py (1)
1-4
: LGTM! Imports are well-organized and specific.
The imports are properly structured with external packages first (llama_index) followed by internal modules (cognee).
cognee/tasks/ingestion/save_data_item_to_storage.py (1)
1-4
: LGTM! Imports and type hints are well-structured.
The imports are appropriate and type hints are properly defined.
cognee/modules/users/permissions/methods/check_permission_on_documents.py (1)
36-36
: Verify consistency of user identification in error messages
Let's check if other permission-related error messages in the codebase use email consistently.
✅ Verification successful
Error message format is consistent with codebase patterns
The error message using user.email
is consistent with the codebase's patterns. This is the only permission-related error message that includes user identification, and it appropriately uses the email attribute which is a standard identifier for users in the system.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for permission-related error messages to verify consistency
# Look for patterns that might use username or email in error messages
# Search for PermissionDeniedException usage
echo "Checking PermissionDeniedException messages:"
rg -A 1 "PermissionDeniedException.*User.*" --type py
# Search for similar permission-related error messages
echo -e "\nChecking other permission-related error messages:"
rg "permission.*User \{.*\}" --type py
Length of output: 506
Script:
#!/bin/bash
# Let's search more broadly for permission-related error messages and user references
echo "Checking for permission-related exceptions and messages:"
rg "Permission.*Exception.*" -B 1 -A 1 --type py
echo -e "\nChecking for user-related error messages:"
rg "User.*does not have|has no.*permission" --type py
echo -e "\nChecking for any error messages with user references:"
rg "raise.*User" -B 1 -A 1 --type py
Length of output: 2538
pyproject.toml (3)
73-73
: LGTM: llama-index-core dependency.
The addition of llama-index-core ^0.11.22 aligns with the PR objectives for adding LlamaIndex support.
49-49
: Verify impact of tenacity downgrade.
The downgrade from ^9.0.0 to ^8.4.1 is a breaking change that could affect retry mechanisms and error handling.
Let's check for tenacity usage that might be impacted:
#!/bin/bash
# Description: Search for tenacity usage patterns that might be affected by the version downgrade
# Test: Look for tenacity decorators and retry configurations
rg -A 5 "@retry"
rg -A 5 "from tenacity"
39-39
: Verify impact of instructor downgrade.
The downgrade from 1.6.3 to ^1.5.2 is a breaking change that could affect existing functionality.
Let's check for instructor usage that might be impacted:
✅ Verification successful
Let me generate additional verification scripts to check the specific instructor features being used and compare them with the changelog between versions 1.5.2 and 1.6.3.
The results will help us determine if the downgrade impacts any of the used features by comparing them with the version differences.
Instructor downgrade is safe to proceed
The codebase uses basic instructor features that are stable across versions 1.5.2 to 1.6.3:
- Core initialization methods:
from_openai()
andpatch()
- Basic mode enums:
Mode.JSON
,Mode.MD_JSON
, andMode.ANTHROPIC_TOOLS
- Standard structured output generation via OpenAI and Anthropic clients
These fundamental features were introduced before 1.5.2 and remain unchanged in the downgrade.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for instructor usage patterns that might be affected by the version downgrade
# Test: Look for instructor imports and usage
rg -A 5 "from instructor"
rg -A 5 "import instructor"
# Test: Look for specific instructor features that might have changed between versions
ast-grep --pattern 'instructor.$_'
Length of output: 9685
Script:
#!/bin/bash
# Look for imports to understand which instructor features are being used
rg -l "import instructor"
# Look for specific instructor Mode usage
rg "instructor\.Mode\."
# Look for instructor initialization patterns
rg "instructor\.(patch|from_openai)"
Length of output: 1060
cognee/tasks/ingestion/ingest_data.py (1)
6-6
: Verify the removal of get_relational_config dependency.
The removal of get_relational_config
import while retaining get_relational_engine
suggests an architectural change. Let's verify there are no remaining references to the removed import.
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str: | ||
# Specific type checking is used to ensure it's not a child class from Document | ||
if type(data_point) == Document: | ||
file_path = data_point.metadata.get("file_path") | ||
if file_path is None: | ||
file_path = save_data_to_file(data_point.text, dataset_name) | ||
return file_path | ||
return file_path | ||
elif type(data_point) == ImageDocument: | ||
if data_point.image_path is None: | ||
file_path = save_data_to_file(data_point.text, dataset_name) | ||
return file_path | ||
return data_point.image_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor to reduce code duplication and improve error handling
The function has similar logic patterns that could be consolidated, and it needs better error handling.
Consider this improved implementation:
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str:
+ """Extract or generate a file path from a LlamaIndex document.
+
+ Args:
+ data_point: A Document or ImageDocument from LlamaIndex
+ dataset_name: Name of the dataset for file path generation
+
+ Returns:
+ str: Path to the file containing the document's content
+
+ Raises:
+ TypeError: If data_point is neither Document nor ImageDocument
+ """
+ if isinstance(data_point, ImageDocument):
+ return data_point.image_path or save_data_to_file(data_point.text, dataset_name)
+ elif isinstance(data_point, Document):
+ return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name)
+ else:
+ raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}")
This refactored version:
- Uses the
or
operator for cleaner null-coalescing - Adds proper error handling for invalid types
- Reduces code duplication
- Handles ImageDocument check first to avoid isinstance ambiguity
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str: | |
# Specific type checking is used to ensure it's not a child class from Document | |
if type(data_point) == Document: | |
file_path = data_point.metadata.get("file_path") | |
if file_path is None: | |
file_path = save_data_to_file(data_point.text, dataset_name) | |
return file_path | |
return file_path | |
elif type(data_point) == ImageDocument: | |
if data_point.image_path is None: | |
file_path = save_data_to_file(data_point.text, dataset_name) | |
return file_path | |
return data_point.image_path | |
def get_data_from_llama_index(data_point: Union[Document, ImageDocument], dataset_name: str) -> str: | |
"""Extract or generate a file path from a LlamaIndex document. | |
Args: | |
data_point: A Document or ImageDocument from LlamaIndex | |
dataset_name: Name of the dataset for file path generation | |
Returns: | |
str: Path to the file containing the document's content | |
Raises: | |
TypeError: If data_point is neither Document nor ImageDocument | |
""" | |
if isinstance(data_point, ImageDocument): | |
return data_point.image_path or save_data_to_file(data_point.text, dataset_name) | |
elif isinstance(data_point, Document): | |
return data_point.metadata.get("file_path") or save_data_to_file(data_point.text, dataset_name) | |
else: | |
raise TypeError(f"Expected Document or ImageDocument, got {type(data_point).__name__}") |
🧰 Tools
🪛 Ruff
8-8: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
14-14: Use is
and is not
for type comparisons, or isinstance()
for isinstance checks
(E721)
if data_item.startswith("file://") or data_item.startswith("/"): | ||
file_path = data_item.replace("file://", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the gathered information, I can now provide the final response:
Add path validation and sanitization to prevent path traversal vulnerabilities
The current implementation lacks proper path validation, which could lead to security issues:
- The
file_path = data_item.replace("file://", "")
operation is insufficient for security - No validation is performed to ensure paths are within allowed boundaries
- No sanitization against directory traversal attacks (e.g.,
../../../etc/passwd
)
Required changes:
- Use
os.path.abspath()
and verify the resolved path is within the allowed data directory - Add path sanitization to remove any potential directory traversal sequences
- Implement path validation before file operations in
save_data_item_to_storage.py
Example vulnerable operations found:
- Direct path usage in
ingest_data.py
:open(file_path.replace("file://", ""), mode = "rb")
- Unchecked path handling in
save_data_item_to_storage.py
- Raw path concatenation in
save_data_to_file.py
:storage_path = data_directory_path + "/" + dataset_name.replace(".", "/")
🔗 Analysis chain
Add path validation for security.
Direct handling of file paths without validation could lead to path traversal vulnerabilities. Consider:
- Validating the path is within allowed boundaries
- Sanitizing the path to prevent directory traversal
Let's check if there are any path validation utilities in the codebase:
Let me search for more specific path-related security implementations:
Let me check if there are any security-related utilities or path validation functions in the codebase:
Let me check how the file paths are being used after they're processed:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for path validation utilities
rg -l "(?i)(validate|sanitize|clean|check).*path"
Length of output: 63
Script:
#!/bin/bash
# Search for path-related security checks and normalizations
rg -i "path|directory" -A 3 -B 3 cognee/tasks/ingestion/
Length of output: 11512
Script:
#!/bin/bash
# Search for security-related utilities and path validation functions
rg -i "def.*secure|sanitize|validate|normalize.*path" -g "*.py"
# Also check for any security-related imports
rg -i "from.*security|import.*pathlib|import.*os\.path" -g "*.py"
Length of output: 1185
Script:
#!/bin/bash
# Check for file operations and how paths are used
rg -A 5 "open\(.*file_path" -g "*.py"
# Also check for the save_data_to_file implementation
rg -A 10 "def save_data_to_file" -g "*.py"
Length of output: 6430
notebooks/cognee_llama_index.ipynb
Outdated
"# Create a clean slate for cognee -- reset data and system state\n", | ||
"await cognee.prune.prune_data()\n", | ||
"await cognee.prune.prune_system(metadata=True)\n", | ||
"\n", | ||
"# Add the LlamaIndex documents, and make it available for cognify\n", | ||
"await cognee.add(documents)\n", | ||
"\n", | ||
"# Use LLMs and cognee to create knowledge graph\n", | ||
"await cognee.cognify()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and progress tracking for Cognee operations.
The current implementation lacks error handling and progress feedback for potentially long-running operations.
+from tqdm.notebook import tqdm
+
# Create a clean slate for cognee -- reset data and system state
+try:
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
+ print("✓ System reset complete")
-# Add the LlamaIndex documents, and make it available for cognify
-await cognee.add(documents)
+ # Add documents with progress tracking
+ print("\nAdding documents...")
+ for doc in tqdm(documents, desc="Processing"):
+ await cognee.add([doc])
+ print("✓ Documents added successfully")
-# Use LLMs and cognee to create knowledge graph
-await cognee.cognify()
+ # Create knowledge graph
+ print("\nGenerating knowledge graph...")
+ await cognee.cognify()
+ print("✓ Knowledge graph generated successfully")
+except Exception as e:
+ print(f"❌ Error during Cognee operations: {str(e)}")
+ raise
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"# Create a clean slate for cognee -- reset data and system state\n", | |
"await cognee.prune.prune_data()\n", | |
"await cognee.prune.prune_system(metadata=True)\n", | |
"\n", | |
"# Add the LlamaIndex documents, and make it available for cognify\n", | |
"await cognee.add(documents)\n", | |
"\n", | |
"# Use LLMs and cognee to create knowledge graph\n", | |
"await cognee.cognify()" | |
from tqdm.notebook import tqdm | |
# Create a clean slate for cognee -- reset data and system state | |
try: | |
await cognee.prune.prune_data() | |
await cognee.prune.prune_system(metadata=True) | |
print("✓ System reset complete") | |
# Add documents with progress tracking | |
print("\nAdding documents...") | |
for doc in tqdm(documents, desc="Processing"): | |
await cognee.add([doc]) | |
print("✓ Documents added successfully") | |
# Create knowledge graph | |
print("\nGenerating knowledge graph...") | |
await cognee.cognify() | |
print("✓ Knowledge graph generated successfully") | |
except Exception as e: | |
print(f"❌ Error during Cognee operations: {str(e)}") | |
raise |
notebooks/cognee_llama_index.ipynb
Outdated
"from cognee import SearchType\n", | ||
"\n", | ||
"# Query cognee for summaries\n", | ||
"search_results = await cognee.search(\n", | ||
" SearchType.SUMMARIES,\n", | ||
" {'query': \"What are the main news discussed in the document?\"}\n", | ||
")\n", | ||
"# Display search results\n", | ||
"print(\"\\n Summary of main news discussed:\\n\")\n", | ||
"print(search_results[0][\"text\"])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve search results handling and display.
The current implementation only shows the first result without proper error handling or validation.
-# Query cognee for summaries
-search_results = await cognee.search(
- SearchType.SUMMARIES,
- {'query': "What are the main news discussed in the document?"}
-)
-# Display search results
-print("\n Summary of main news discussed:\n")
-print(search_results[0]["text"])
+try:
+ # Query cognee for summaries
+ search_results = await cognee.search(
+ SearchType.SUMMARIES,
+ {'query': "What are the main news discussed in the document?"}
+ )
+
+ if not search_results:
+ print("No results found")
+ return
+
+ # Display search results
+ print("\nSummary of main news discussed:\n")
+ for i, result in enumerate(search_results, 1):
+ print(f"Result {i}:")
+ print(f"Score: {result.get('score', 'N/A')}")
+ print(f"Text: {result['text']}\n")
+except Exception as e:
+ print(f"Error during search: {str(e)}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"from cognee import SearchType\n", | |
"\n", | |
"# Query cognee for summaries\n", | |
"search_results = await cognee.search(\n", | |
" SearchType.SUMMARIES,\n", | |
" {'query': \"What are the main news discussed in the document?\"}\n", | |
")\n", | |
"# Display search results\n", | |
"print(\"\\n Summary of main news discussed:\\n\")\n", | |
"print(search_results[0][\"text\"])" | |
from cognee import SearchType | |
try: | |
# Query cognee for summaries | |
search_results = await cognee.search( | |
SearchType.SUMMARIES, | |
{'query': "What are the main news discussed in the document?"} | |
) | |
if not search_results: | |
print("No results found") | |
return | |
# Display search results | |
print("\nSummary of main news discussed:\n") | |
for i, result in enumerate(search_results, 1): | |
print(f"Result {i}:") | |
print(f"Score: {result.get('score', 'N/A')}") | |
print(f"Text: {result['text']}\n") | |
except Exception as e: | |
print(f"Error during search: {str(e)}") |
with open(file_path.replace("file://", ""), mode = "rb") as file: | ||
classified_data = ingestion.classify(file) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Exception Handling for File Operations
Opening a file can raise exceptions such as FileNotFoundError
or PermissionError
. To prevent the program from crashing, add exception handling around file operations to handle these potential errors gracefully.
I can help implement appropriate exception handling if needed.
async with db_engine.get_async_session() as session: | ||
dataset = await create_dataset(dataset_name, user.id, session) | ||
|
||
data_point = (await session.execute( | ||
select(Data).filter(Data.id == data_id) | ||
)).scalar_one_or_none() | ||
|
||
if data_point is not None: | ||
data_point.name = file_metadata["name"] | ||
data_point.raw_data_location = file_metadata["file_path"] | ||
data_point.extension = file_metadata["extension"] | ||
data_point.mime_type = file_metadata["mime_type"] | ||
|
||
await session.merge(data_point) | ||
await session.commit() | ||
else: | ||
data_point = Data( | ||
id = data_id, | ||
name = file_metadata["name"], | ||
raw_data_location = file_metadata["file_path"], | ||
extension = file_metadata["extension"], | ||
mime_type = file_metadata["mime_type"], | ||
) | ||
|
||
dataset.data.append(data_point) | ||
await session.commit() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure Proper Transaction Management and Exception Handling in Database Operations
While using async sessions, it's important to handle exceptions that may occur during database operations to prevent partial commits or data inconsistencies. Consider adding exception handling and ensuring that transactions are properly managed.
I can assist in implementing robust exception handling for database transactions if you'd like.
file_path = save_data_item_to_storage(data_item, dataset_name) | ||
|
||
# Ingest data and add metadata | ||
with open(file_path.replace("file://", ""), mode = "rb") as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use a Robust Method to Parse File URIs
The current approach of using file_path.replace("file://", "")
may not handle all edge cases of file URIs, especially across different operating systems. Consider using the urllib.parse
or pathlib
modules to reliably parse the file path from the URI.
Here's how you might adjust the code:
+from urllib.parse import urlparse
+parsed_url = urlparse(file_path)
+file_system_path = parsed_url.path
-with open(file_path.replace("file://", ""), mode="rb") as file:
+with open(file_system_path, mode="rb") as file:
Committable suggestion skipped: line range outside the PR's diff.
if data_point is not None: | ||
data_point.name = file_metadata["name"] | ||
data_point.raw_data_location = file_metadata["file_path"] | ||
data_point.extension = file_metadata["extension"] | ||
data_point.mime_type = file_metadata["mime_type"] | ||
|
||
await session.merge(data_point) | ||
await session.commit() | ||
else: | ||
data_point = Data( | ||
id = data_id, | ||
name = file_metadata["name"], | ||
raw_data_location = file_metadata["file_path"], | ||
extension = file_metadata["extension"], | ||
mime_type = file_metadata["mime_type"], | ||
) | ||
|
||
dataset.data.append(data_point) | ||
await session.commit() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor to Reduce Code Duplication in Data Handling
The code blocks for updating an existing data_point
and creating a new one share similar assignments to data_point
attributes. To improve maintainability and readability, refactor the code to eliminate duplication.
Here is a suggested refactor:
if data_point is not None:
- data_point.name = file_metadata["name"]
- data_point.raw_data_location = file_metadata["file_path"]
- data_point.extension = file_metadata["extension"]
- data_point.mime_type = file_metadata["mime_type"]
-
- await session.merge(data_point)
- await session.commit()
+ pass # Existing data_point, no need to create a new one
else:
data_point = Data(
id=data_id,
- # These fields will be set below
)
dataset.data.append(data_point)
- await session.commit()
+# Set or update shared attributes
+data_point.name = file_metadata["name"]
+data_point.raw_data_location = file_metadata["file_path"]
+data_point.extension = file_metadata["extension"]
+data_point.mime_type = file_metadata["mime_type"]
+await session.merge(data_point)
+await session.commit()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if data_point is not None: | |
data_point.name = file_metadata["name"] | |
data_point.raw_data_location = file_metadata["file_path"] | |
data_point.extension = file_metadata["extension"] | |
data_point.mime_type = file_metadata["mime_type"] | |
await session.merge(data_point) | |
await session.commit() | |
else: | |
data_point = Data( | |
id = data_id, | |
name = file_metadata["name"], | |
raw_data_location = file_metadata["file_path"], | |
extension = file_metadata["extension"], | |
mime_type = file_metadata["mime_type"], | |
) | |
dataset.data.append(data_point) | |
await session.commit() | |
if data_point is not None: | |
pass # Existing data_point, no need to create a new one | |
else: | |
data_point = Data( | |
id=data_id, | |
) | |
dataset.data.append(data_point) | |
# Set or update shared attributes | |
data_point.name = file_metadata["name"] | |
data_point.raw_data_location = file_metadata["file_path"] | |
data_point.extension = file_metadata["extension"] | |
data_point.mime_type = file_metadata["mime_type"] | |
await session.merge(data_point) | |
await session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good overall, I left a few questions/suggestions. Can you think of any documentation that we should update/expand?
cognee/modules/users/permissions/methods/check_permission_on_documents.py
Show resolved
Hide resolved
@@ -0,0 +1,205 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test analogous to test_notebook for this new notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dexters1 , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, if we plan on having more notebooks in the future we can make my test_notebook action reusable as there will be little difference between notebooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this in this MR? I think we should keep track of our published notebooks behaving as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test for this notebook in this MR, but the reusable action will be made in the next sprint. I don't have time to take care of it in this sprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
.github/workflows/test_cognee_llama_index_notebook.yml (3)
15-17
: Consider using DEBUG log level for better visibility.Setting
RUNTIME__LOG_LEVEL
to ERROR might hide useful information during notebook execution. Consider using DEBUG level in the workflow for better troubleshooting capabilities.- RUNTIME__LOG_LEVEL: ERROR + RUNTIME__LOG_LEVEL: DEBUG
47-51
: Consider adding Jupyter to project dependencies.Instead of adding Jupyter during the workflow execution, consider adding it to the project's
pyproject.toml
as a development dependency.This would make the development environment setup more consistent:
# In pyproject.toml [tool.poetry.group.dev.dependencies] + jupyter = "^1.0.0"
52-63
: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards.
--execute notebooks/cognee_llama_index.ipynb \ --output executed_notebook.ipynb \ - --ExecutePreprocessor.timeout=1200 + --ExecutePreprocessor.timeout=1200 +🧰 Tools
🪛 yamllint
[error] 63-63: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
-
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (2)
-
.github/workflows/test_cognee_llama_index_notebook.yml
(1 hunks) -
pyproject.toml
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
🧰 Additional context used
🪛 actionlint
.github/workflows/test_cognee_llama_index_notebook.yml
26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}
(if-cond)
🪛 yamllint
.github/workflows/test_cognee_llama_index_notebook.yml
[error] 63-63: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (1)
.github/workflows/test_cognee_llama_index_notebook.yml (1)
1-14
: LGTM! Well-configured workflow triggers and concurrency settings.
The workflow configuration properly handles manual triggers and PR events, with appropriate concurrency controls to prevent parallel runs.
jobs: | ||
get_docs_changes: | ||
name: docs changes | ||
uses: ./.github/workflows/get_docs_changes.yml | ||
|
||
run_notebook_test: | ||
name: test | ||
needs: get_docs_changes | ||
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }} | ||
runs-on: ubuntu-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the condition syntax in the if statement.
The current condition mixes template syntax unnecessarily, which could cause evaluation issues.
- if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}
+ if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
jobs: | |
get_docs_changes: | |
name: docs changes | |
uses: ./.github/workflows/get_docs_changes.yml | |
run_notebook_test: | |
name: test | |
needs: get_docs_changes | |
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }} | |
runs-on: ubuntu-latest | |
jobs: | |
get_docs_changes: | |
name: docs changes | |
uses: ./.github/workflows/get_docs_changes.yml | |
run_notebook_test: | |
name: test | |
needs: get_docs_changes | |
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks' | |
runs-on: ubuntu-latest |
🧰 Tools
🪛 actionlint
26-26: if: condition "needs.get_docs_changes.outputs.changes_outside_docs == 'true' && ${{ github.event.label.name == 'run-checks' }}" is always evaluated to true because extra characters are around ${{ }}
(if-cond)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dexters1 This
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borisarzentar Added change, maybe we should change this for the rest of the actions as well?
Remove dictionary and use string for query in notebook and simple example Fix COG-337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
examples/python/simple_example.py (2)
Line range hint
1-40
: Consider enhancing the example documentationWhile the example is well-documented with prerequisites, it would be valuable to add a comment about LlamaIndex integration since this PR is specifically about adding LlamaIndex support.
Consider adding a comment like:
# Prerequisites: # 1. Copy `.env.template` and rename it to `.env`. # 2. Add your OpenAI API key to the `.env` file in the `LLM_API_KEY` field: # LLM_API_KEY = "your_key_here" # 3. (Optional) To minimize setup effort, set `VECTOR_DB_PROVIDER="lancedb"` in `.env". +# 4. (Optional) This example supports LlamaIndex documents for enhanced document processing
API parameter style inconsistency found in documentation and examples
The search API usage shows inconsistency in parameter style:
simple_example.py
uses named parameters:SearchType.INSIGHTS, query='Tell me about NLP'
docs/quickstart.md
uses positional parameters:"INSIGHTS", {'query': 'Tell me about NLP'}
- Test files consistently use named parameters:
SearchType.INSIGHTS, query = random_node_name
The documentation in
quickstart.md
needs to be updated to match the current API style using named parameters andSearchType
enum.🔗 Analysis chain
Line range hint
1-40
: Verify the impact of API changesSince this example demonstrates the core API usage, we should verify that all other examples and documentation are consistent with this new parameter style.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other instances of cognee.search to ensure consistent parameter style rg -l "cognee.search" --type python | while read -r file; do if [ "$file" != "examples/python/simple_example.py" ]; then echo "=== $file ===" rg -A 2 "cognee\.search" "$file" fi doneLength of output: 81
Script:
#!/bin/bash # Search for other instances of cognee.search using ripgrep without file type filter rg -l "cognee\.search"Length of output: 403
Script:
#!/bin/bash # Search for Python files in the repository fd '\.py$'Length of output: 14317
Script:
#!/bin/bash # Search for documentation files that might contain API examples fd -e md -e rstLength of output: 607
Script:
#!/bin/bash # Check the search API usage in documentation and test files echo "=== Documentation examples ===" rg -A 5 "cognee\.search" docs/ echo -e "\n=== Test files ===" rg -A 5 "cognee\.search" "cognee/tests/"Length of output: 7591
notebooks/cognee_llama_index.ipynb (2)
50-53
: Add validation for document creationThe document creation process should validate the input data to handle potential empty or malformed entries.
+def create_document(row: pd.Series) -> Document: + if not row['title'].strip() or not row['text'].strip(): + raise ValueError(f"Empty title or text in row: {row.name}") + return Document(text=f"{row['title']}: {row['text']}") + documents = [ - Document(text=f"{row['title']}: {row['text']}") - for i, row in news.iterrows() + create_document(row) + for _, row in news.iterrows() + if pd.notna(row['title']) and pd.notna(row['text']) ]
199-199
: Consider updating Python versionThe notebook uses Python 3.9.6 which is quite old. Consider updating to a newer version (3.11+) to benefit from performance improvements, security patches, and ensure compatibility with latest package versions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
examples/python/simple_example.py
(1 hunks)notebooks/cognee_llama_index.ipynb
(1 hunks)
🔇 Additional comments (2)
examples/python/simple_example.py (1)
29-31
: LGTM! API parameter style improvement
The change from dictionary-style parameters to keyword arguments makes the API more Pythonic and easier to use.
notebooks/cognee_llama_index.ipynb (1)
1-1
: Verify notebook test implementation
The conversation indicates that a test for this notebook has been added, but let's verify its implementation.
✅ Verification successful
Test implementation is properly configured ✅
The notebook test has been correctly implemented with:
- Proper workflow triggers (PR with 'run-checks' label)
- Required environment setup (Python, Poetry)
- Necessary secrets configuration (LLM_API_KEY, Graphistry credentials)
- Appropriate notebook execution timeout (1200s)
- Correct notebook path and execution parameters
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and configuration of notebook test
# Check if test file exists
echo "Checking for notebook test file..."
fd "test_cognee_llama_index_notebook.yml" .github/workflows/
# Check test configuration
echo -e "\nChecking test configuration..."
rg -A 10 "cognee_llama_index.ipynb" .github/workflows/
Length of output: 638
Script:
#!/bin/bash
# Let's check the complete test configuration to ensure it's properly set up
echo "Complete test configuration:"
cat .github/workflows/test_cognee_llama_index_notebook.yml
Length of output: 1780
"import pandas as pd\n", | ||
"from llama_index.core import Document\n", | ||
"\n", | ||
"news = pd.read_csv(\n", | ||
" \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n", | ||
")[:5]\n", | ||
"\n", | ||
"news.head()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and data validation for dataset loading
The current implementation lacks error handling for dataset loading and validation of the expected structure.
import pandas as pd
from llama_index.core import Document
+from typing import List, Dict
+
+def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame:
+ try:
+ df = pd.read_csv(url)
+ if not all(col in df.columns for col in ['title', 'text']):
+ raise ValueError("Dataset missing required columns: title, text")
+ return df.head(sample_size)
+ except Exception as e:
+ raise RuntimeError(f"Failed to load news dataset: {str(e)}")
-news = pd.read_csv(
- "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
-)[:5]
+# Load and validate dataset
+news = load_news_dataset(
+ "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv"
+)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"import pandas as pd\n", | |
"from llama_index.core import Document\n", | |
"\n", | |
"news = pd.read_csv(\n", | |
" \"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv\"\n", | |
")[:5]\n", | |
"\n", | |
"news.head()" | |
import pandas as pd | |
from llama_index.core import Document | |
from typing import List, Dict | |
def load_news_dataset(url: str, sample_size: int = 5) -> pd.DataFrame: | |
try: | |
df = pd.read_csv(url) | |
if not all(col in df.columns for col in ['title', 'text']): | |
raise ValueError("Dataset missing required columns: title, text") | |
return df.head(sample_size) | |
except Exception as e: | |
raise RuntimeError(f"Failed to load news dataset: {str(e)}") | |
# Load and validate dataset | |
news = load_news_dataset( | |
"https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" | |
) | |
news.head() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, except for the small thing in GitHub actions, for consistency's sake :)
pull_request: | ||
branches: | ||
- main | ||
types: [labeled] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, we apply [labeled, synchronize] in the other GitHub actions
(@dexters1 could you also merge the latest main?) <- actually, never mind, its not necessary! |
pyproject.toml
Outdated
@@ -69,7 +69,8 @@ fastapi-users = {version = "*", extras = ["sqlalchemy"]} | |||
alembic = "^1.13.3" | |||
asyncpg = "^0.29.0" | |||
pgvector = "^0.3.5" | |||
psycopg2 = "^2.9.10" | |||
psycopg2 = {version = "^2.9.10", optional = true} | |||
llama-index-core = "^0.11.22" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this optional please
pyproject.toml
Outdated
@@ -36,7 +36,7 @@ boto3 = "^1.26.125" | |||
botocore="^1.35.54" | |||
gunicorn = "^20.1.0" | |||
sqlalchemy = "2.0.35" | |||
instructor = "1.6.3" | |||
instructor = "^1.5.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not fix the version? it is not a very reliable lib, as ti changed a few times already
Added the following changes that were requested in pull request: Added synchronize label, Made uniform syntax in if statement in workflow, fixed instructor dependency, added llama-index to be optional Refactor COG-337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
.github/workflows/test_cognee_llama_index_notebook.yml (1)
15-17
: Consider using INFO log level for better debugging capabilities.Setting
RUNTIME__LOG_LEVEL
to ERROR might make it difficult to debug workflow failures. Consider using INFO level in the CI environment.env: - RUNTIME__LOG_LEVEL: ERROR + RUNTIME__LOG_LEVEL: INFO
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (2)
.github/workflows/test_cognee_llama_index_notebook.yml
(1 hunks)pyproject.toml
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
🧰 Additional context used
🪛 yamllint
.github/workflows/test_cognee_llama_index_notebook.yml
[error] 63-63: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (2)
.github/workflows/test_cognee_llama_index_notebook.yml (2)
1-9
: LGTM! Workflow trigger configuration is well-structured.
The workflow triggers and event types are properly configured.
18-26
: LGTM! Job structure and conditions are well-defined.
The job dependencies and conditions are properly configured.
- name: Setup Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: '3.11.x' | ||
|
||
- name: Install Poetry | ||
uses: snok/[email protected] | ||
with: | ||
virtualenvs-create: true | ||
virtualenvs-in-project: true | ||
installer-parallel: true | ||
|
||
- name: Install dependencies | ||
run: | | ||
poetry install --no-interaction --all-extras --no-root | ||
poetry add jupyter --no-interaction | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add Poetry caching for faster workflow execution.
Consider these improvements for the Python/Poetry setup:
- Add Poetry cache to speed up dependency installation
- Pin Poetry version more specifically for better reproducibility
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11.x'
+- name: Cache Poetry
+ uses: actions/cache@v3
+ with:
+ path: ~/.cache/pypoetry
+ key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
+ restore-keys: |
+ ${{ runner.os }}-poetry-
- name: Install Poetry
- uses: snok/[email protected]
+ uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- name: Setup Python | |
uses: actions/setup-python@v5 | |
with: | |
python-version: '3.11.x' | |
- name: Install Poetry | |
uses: snok/[email protected] | |
with: | |
virtualenvs-create: true | |
virtualenvs-in-project: true | |
installer-parallel: true | |
- name: Install dependencies | |
run: | | |
poetry install --no-interaction --all-extras --no-root | |
poetry add jupyter --no-interaction | |
- name: Setup Python | |
uses: actions/setup-python@v5 | |
with: | |
python-version: '3.11.x' | |
- name: Cache Poetry | |
uses: actions/cache@v3 | |
with: | |
path: ~/.cache/pypoetry | |
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }} | |
restore-keys: | | |
${{ runner.os }}-poetry- | |
- name: Install Poetry | |
uses: snok/[email protected] | |
with: | |
virtualenvs-create: true | |
virtualenvs-in-project: true | |
installer-parallel: true | |
- name: Install dependencies | |
run: | | |
poetry install --no-interaction --all-extras --no-root | |
poetry add jupyter --no-interaction |
- name: Execute Jupyter Notebook | ||
env: | ||
ENV: 'dev' | ||
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} | ||
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} | ||
run: | | ||
poetry run jupyter nbconvert \ | ||
--to notebook \ | ||
--execute notebooks/cognee_llama_index.ipynb \ | ||
--output executed_notebook.ipynb \ | ||
--ExecutePreprocessor.timeout=1200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance notebook execution error handling and fix EOF.
- The 20-minute timeout suggests potential performance concerns. Consider adding error handling and output capture.
- Add a newline at the end of file.
run: |
poetry run jupyter nbconvert \
--to notebook \
--execute notebooks/cognee_llama_index.ipynb \
--output executed_notebook.ipynb \
- --ExecutePreprocessor.timeout=1200
+ --ExecutePreprocessor.timeout=1200 \
+ || {
+ echo "::error::Notebook execution failed"
+ cat executed_notebook.ipynb
+ exit 1
+ }
+
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- name: Execute Jupyter Notebook | |
env: | |
ENV: 'dev' | |
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} | |
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} | |
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} | |
run: | | |
poetry run jupyter nbconvert \ | |
--to notebook \ | |
--execute notebooks/cognee_llama_index.ipynb \ | |
--output executed_notebook.ipynb \ | |
--ExecutePreprocessor.timeout=1200 | |
- name: Execute Jupyter Notebook | |
env: | |
ENV: 'dev' | |
LLM_API_KEY: ${{ secrets.OPENAI_API_KEY }} | |
GRAPHISTRY_USERNAME: ${{ secrets.GRAPHISTRY_USERNAME }} | |
GRAPHISTRY_PASSWORD: ${{ secrets.GRAPHISTRY_PASSWORD }} | |
run: | | |
poetry run jupyter nbconvert \ | |
--to notebook \ | |
--execute notebooks/cognee_llama_index.ipynb \ | |
--output executed_notebook.ipynb \ | |
--ExecutePreprocessor.timeout=1200 \ | |
|| { | |
echo "::error::Notebook execution failed" | |
cat executed_notebook.ipynb | |
exit 1 | |
} | |
🧰 Tools
🪛 yamllint
[error] 63-63: no new line character at the end of file
(new-line-at-end-of-file)
Resolve issue with llama-index being mandatory to run cognee Fix COG-337
Removed additional references to llama-index from core cognee lib. Added llama-index-core install from notebook Fix COG-337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (7)
cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (4)
1-2
: Add docstring with type hint documentation.The function's type hints should be documented in a docstring to improve code maintainability and IDE support.
Add a docstring like this:
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str: + """Save data item with metadata to storage. + + Args: + data_item (Union[BinaryIO, str, Any]): The data to save. Can be: + - A file-like object + - A string (file path or text content) + - A Document from llama_index + dataset_name (str): Name of the dataset + + Returns: + str: Path where the data was saved + + Raises: + ValueError: If the data type is not supported + """
10-12
: Add input validation for dataset_name and return value.The dataset_name parameter and the return value from get_data_from_llama_index should be validated.
+ if not dataset_name or not isinstance(dataset_name, str): + raise ValueError("dataset_name must be a non-empty string") + # Check if data is of type Document or any of it's subclasses if isinstance(data_item, Document): - file_path = get_data_from_llama_index(data_item, dataset_name) + file_path = get_data_from_llama_index(data_item, dataset_name) + if not file_path: + raise ValueError("Failed to get file path from Document")
25-26
: Improve error message with supported type information.The error message could be more helpful by listing the supported types.
- raise ValueError(f"Data type not supported: {type(data_item)}") + supported_types = [ + "llama_index.core.Document", + "File object with 'file' and 'filename' attributes", + "String (file path or text content)" + ] + raise ValueError( + f"Data type not supported: {type(data_item)}. " + f"Supported types are: {', '.join(supported_types)}" + )
4-28
: Consider architectural improvements.The current implementation could benefit from:
- Separation of concerns by splitting into smaller, focused functions for each data type
- Adding logging for better debugging and monitoring
- Consider implementing a strategy pattern for different data types
This would improve maintainability and make it easier to add support for new data types.
notebooks/cognee_llama_index.ipynb (3)
16-16
: Improve dependency installation and data source management
- The pip install command should check for successful installation
- The data source URL should be configurable and its availability should be verified
Consider these improvements:
- %pip install llama-index-core + %%capture pip_output + %pip install llama-index-core + if "Successfully installed" not in pip_output.stdout: + raise RuntimeError("Failed to install llama-index-core") + # Define data source URL as a configuration + NEWS_DATASET_URL = "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" + SAMPLE_SIZE = 5 - news = pd.read_csv( - "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" - )[:5] + # Validate URL and load data + try: + response = requests.head(NEWS_DATASET_URL) + response.raise_for_status() + news = pd.read_csv(NEWS_DATASET_URL).head(SAMPLE_SIZE) + except Exception as e: + raise RuntimeError(f"Failed to access dataset: {str(e)}")Also applies to: 39-41
59-62
: Add validation and progress tracking for document preparationThe document creation process should validate row contents and track progress, especially important when working with larger datasets.
+ from tqdm.notebook import tqdm + + def create_document(row): + if not isinstance(row['title'], str) or not isinstance(row['text'], str): + raise ValueError(f"Invalid row format: title and text must be strings") + return Document(text=f"{row['title']}: {row['text']}") - documents = [ - Document(text=f"{row['title']}: {row['text']}") - for i, row in news.iterrows() - ] + documents = [] + for i, row in tqdm(news.iterrows(), desc="Creating documents", total=len(news)): + try: + documents.append(create_document(row)) + except Exception as e: + print(f"Warning: Failed to create document for row {i}: {str(e)}")
224-224
: Update Python versionThe notebook uses Python 3.9.6 which is quite old. Consider updating to a more recent version (3.11+) to benefit from performance improvements, security fixes, and new features.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
cognee/tasks/ingestion/__init__.py
(1 hunks)cognee/tasks/ingestion/ingest_data_with_metadata.py
(1 hunks)cognee/tasks/ingestion/save_data_item_to_storage.py
(1 hunks)cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
(1 hunks)notebooks/cognee_llama_index.ipynb
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cognee/tasks/ingestion/init.py
- cognee/tasks/ingestion/ingest_data_with_metadata.py
- cognee/tasks/ingestion/save_data_item_to_storage.py
# data is a file object coming from upload. | ||
elif hasattr(data_item, "file"): | ||
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add robust file object validation.
The current implementation makes assumptions about the file object structure without proper validation.
- elif hasattr(data_item, "file"):
- file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
+ elif hasattr(data_item, "file") and hasattr(data_item, "filename"):
+ if not data_item.file:
+ raise ValueError("File object is empty")
+ if not data_item.filename:
+ raise ValueError("Filename is missing")
+ file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# data is a file object coming from upload. | |
elif hasattr(data_item, "file"): | |
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) | |
# data is a file object coming from upload. | |
elif hasattr(data_item, "file") and hasattr(data_item, "filename"): | |
if not data_item.file: | |
raise ValueError("File object is empty") | |
if not data_item.filename: | |
raise ValueError("Filename is missing") | |
file_path = save_data_to_file(data_item.file, dataset_name, filename=data_item.filename) |
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str: | ||
# Dynamic import is used because the llama_index module is optional. | ||
# For the same reason Any is accepted as a data item | ||
from llama_index.core import Document | ||
from .transform_data import get_data_from_llama_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for dynamic imports.
The dynamic imports could fail silently if the optional dependencies are not installed. Consider adding explicit error handling.
- from llama_index.core import Document
- from .transform_data import get_data_from_llama_index
+ try:
+ from llama_index.core import Document
+ from .transform_data import get_data_from_llama_index
+ except ImportError as e:
+ raise ImportError(
+ "Failed to import required dependencies. "
+ "Please ensure llama_index is installed for Document support."
+ ) from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str: | |
# Dynamic import is used because the llama_index module is optional. | |
# For the same reason Any is accepted as a data item | |
from llama_index.core import Document | |
from .transform_data import get_data_from_llama_index | |
def save_data_item_with_metadata_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str: | |
# Dynamic import is used because the llama_index module is optional. | |
# For the same reason Any is accepted as a data item | |
try: | |
from llama_index.core import Document | |
from .transform_data import get_data_from_llama_index | |
except ImportError as e: | |
raise ImportError( | |
"Failed to import required dependencies. " | |
"Please ensure llama_index is installed for Document support." | |
) from e |
elif isinstance(data_item, str): | ||
# data is a file path | ||
if data_item.startswith("file://") or data_item.startswith("/"): | ||
file_path = data_item.replace("file://", "") | ||
# data is text | ||
else: | ||
file_path = save_data_to_file(data_item, dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance string input validation and handling.
The current string handling could be improved with better file path validation and text content limits.
elif isinstance(data_item, str):
+ if not data_item.strip():
+ raise ValueError("Empty string input")
+
# data is a file path
if data_item.startswith("file://") or data_item.startswith("/"):
file_path = data_item.replace("file://", "")
+ # Verify file exists and is accessible
+ import os
+ if not os.path.isfile(file_path):
+ raise FileNotFoundError(f"File not found: {file_path}")
# data is text
else:
+ # Add reasonable size limit for text content
+ if len(data_item) > 10_000_000: # 10MB limit
+ raise ValueError("Text content exceeds size limit")
file_path = save_data_to_file(data_item, dataset_name)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
elif isinstance(data_item, str): | |
# data is a file path | |
if data_item.startswith("file://") or data_item.startswith("/"): | |
file_path = data_item.replace("file://", "") | |
# data is text | |
else: | |
file_path = save_data_to_file(data_item, dataset_name) | |
elif isinstance(data_item, str): | |
if not data_item.strip(): | |
raise ValueError("Empty string input") | |
# data is a file path | |
if data_item.startswith("file://") or data_item.startswith("/"): | |
file_path = data_item.replace("file://", "") | |
# Verify file exists and is accessible | |
import os | |
if not os.path.isfile(file_path): | |
raise FileNotFoundError(f"File not found: {file_path}") | |
# data is text | |
else: | |
# Add reasonable size limit for text content | |
if len(data_item) > 10_000_000: # 10MB limit | |
raise ValueError("Text content exceeds size limit") | |
file_path = save_data_to_file(data_item, dataset_name) |
"async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n", | ||
" await create_relational_db_and_tables()\n", | ||
" await create_pgvector_db_and_tables()\n", | ||
"\n", | ||
" if user is None:\n", | ||
" user = await get_default_user()\n", | ||
"\n", | ||
" await ingest_data_with_metadata(data, dataset_name, user)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve function definition and documentation
The add
function should be properly documented and defined in a module rather than a notebook cell.
- async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = "main_dataset", user: User = None):
+ async def add(
+ data: Union[BinaryIO, list[BinaryIO], str, list[str]],
+ dataset_name: str = "main_dataset",
+ user: User = None
+ ) -> None:
+ """Add data to Cognee for processing.
+
+ Args:
+ data: Input data in various formats (file, string, or lists thereof)
+ dataset_name: Name to identify the dataset (default: "main_dataset")
+ user: User performing the operation (default: None, uses default user)
+
+ Raises:
+ RuntimeError: If database setup fails
+ """
await create_relational_db_and_tables()
await create_pgvector_db_and_tables()
Consider moving this function to a proper module in the codebase.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"async def add(data: Union[BinaryIO, list[BinaryIO], str, list[str]], dataset_name: str = \"main_dataset\", user: User = None):\n", | |
" await create_relational_db_and_tables()\n", | |
" await create_pgvector_db_and_tables()\n", | |
"\n", | |
" if user is None:\n", | |
" user = await get_default_user()\n", | |
"\n", | |
" await ingest_data_with_metadata(data, dataset_name, user)\n", | |
async def add( | |
data: Union[BinaryIO, list[BinaryIO], str, list[str]], | |
dataset_name: str = "main_dataset", | |
user: User = None | |
) -> None: | |
"""Add data to Cognee for processing. | |
Args: | |
data: Input data in various formats (file, string, or lists thereof) | |
dataset_name: Name to identify the dataset (default: "main_dataset") | |
user: User performing the operation (default: None, uses default user) | |
Raises: | |
RuntimeError: If database setup fails | |
""" | |
await create_relational_db_and_tables() | |
await create_pgvector_db_and_tables() | |
if user is None: | |
user = await get_default_user() | |
await ingest_data_with_metadata(data, dataset_name, user) |
Summary by CodeRabbit
Release Notes
New Features
ingest_data_with_metadata
function for ingesting data along with metadata.save_data_item_to_storage
function to handle various data item types for storage.save_data_item_with_metadata_to_storage
function for enhanced data storage capabilities.Improvements
save_data_to_storage
function for improved efficiency.cognee.search
function.Dependencies
instructor
andtenacity
packages.llama-index-core
.psycopg2
to include an optional attribute.Chores