Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cog 184 add falkordb #192

Merged
merged 26 commits into from
Nov 11, 2024
Merged

Feat/cog 184 add falkordb #192

merged 26 commits into from
Nov 11, 2024

Conversation

borisarzentar
Copy link
Contributor

@borisarzentar borisarzentar commented Nov 11, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new asynchronous functions for processing datasets and extracting graphs from code and data.
    • Added functionality for translating text using AWS Translate.
    • Enhanced document classification and chunk extraction processes.
    • Added new classes and methods for managing entities and types within the graph structure.
  • Improvements

    • Updated error handling and logging mechanisms across various functions.
    • Improved the clarity of import statements and function parameters.
    • Enhanced the functionality of existing methods for better data handling and serialization.
    • Updated workflow configurations to trigger on pull request synchronization.
  • Bug Fixes

    • Resolved issues related to identifier handling in graph queries and connections.
  • Removals

    • Deprecated several unused files and functions related to previous implementations, including context retrieval and answer generation functionalities.

Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)
  • notebooks/cognee_demo.ipynb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This pull request introduces several changes across multiple files, primarily focusing on enhancing documentation and implementing new functionalities. Key updates include detailed comments in the README.md for better clarity, the addition of an asynchronous pipeline in code_graph_pipeline.py, and modifications to existing functions in cognify_v2.py and search_v2.py. Additionally, several classes and methods related to graph and vector database interactions have been added or updated, while some files have been removed, indicating a restructuring of the codebase.

Changes

File Path Change Summary
README.md Added comments to code snippets in the "Basic Usage" section for clarity.
cognee/api/v1/cognify/code_graph_pipeline.py Introduced an asynchronous pipeline for processing datasets, including new functions and a custom exception.
cognee/api/v1/cognify/cognify_v2.py Updated task execution flow, removing parallel execution and modifying task handling.
cognee/api/v1/search/search_v2.py Changed import path for query_chunks function; no functional changes.
cognee/infrastructure/databases/graph/falkordb/adapter.py Deleted file containing FalcorDBAdapter class, which provided methods for graph database interaction.
cognee/infrastructure/databases/graph/get_graph_engine.py Updated to use a new approach for handling the "falkordb" provider, removing error handling.
cognee/infrastructure/databases/graph/graph_db_interface.py Replaced abstract method graph with query for parameterized queries.
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py Updated node handling to use UUIDs and refactored methods for adding nodes.
cognee/infrastructure/databases/graph/networkx/adapter.py Added new methods for graph data retrieval and updated existing methods to use UUIDs.
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py Introduced a new adapter class for FalkorDB with methods for data management and querying.
cognee/infrastructure/databases/vector/__init__.py Removed import for DataPoint, indicating it is no longer needed.
cognee/modules/chunking/TextChunker.py Updated constructor and error handling for DocumentChunk instantiation.
cognee/modules/data/operations/translate_text.py Added a new function for translating text using AWS Translate service.
cognee/modules/search/CogneeSearch.py Deleted file containing the CogneeSearch class and related functionality.
cognee/modules/search/graph/search_adjacent.py Deleted file with functionality for finding graph neighbors.
cognee/modules/search/graph/search_cypher.py Deleted file for executing Cypher queries against a graph database.
cognee/modules/search/llm/extraction/categorize_relevant_category.py Deleted function for categorizing queries based on summaries.
cognee/modules/storage/utils/__init__.py Introduced a JSON encoder class and utility functions for handling data models.
cognee/shared/SourceCodeGraph.py Transitioned classes to inherit from DataPoint and updated attributes for various graph-related classes.
cognee/shared/utils.py Enhanced utility functions for preparing nodes and edges, including updated parameter handling.
cognee/tasks/__init__.py Removed multiple import statements related to document processing and ontology inference.

Possibly related PRs

Suggested reviewers

  • Vasilije1990
  • dexters1

Poem

🐇 In the garden of code, where rabbits play,
New paths are forged, brightening the way.
With comments and clarity, our work is refined,
In the world of data, new treasures we find.
So hop along, friends, let's celebrate cheer,
For each little change brings us closer, I fear! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 61

🧹 Outside diff range and nitpick comments (77)
cognee/modules/data/processing/document_types/Document.py (1)

Line range hint 8-9: Consider implementing abstract method

The read method is currently a pass-through implementation. Since this is a base class, consider:

  1. Making it an abstract method using @abstractmethod
  2. Adding docstring to document the expected behavior
+from abc import abstractmethod
 from cognee.infrastructure.engine import DataPoint

 class Document(DataPoint):
     type: str
     name: str
     raw_data_location: str

+    @abstractmethod
     def read(self, chunk_size: int) -> str:
+        """Read document content in chunks.
+        
+        Args:
+            chunk_size: Size of each chunk to read
+            
+        Returns:
+            str: Content chunk from the document
+        """
         pass
cognee/infrastructure/databases/vector/models/ScoredResult.py (1)

6-8: Consider adding field documentation

While the type change from str to UUID is a good improvement for type safety, consider adding docstrings to clarify the purpose of each field, especially the payload dictionary's expected content and structure.

Example improvement:

 class ScoredResult(BaseModel):
+    """Represents a scored result from a vector search operation.
+
+    Attributes:
+        id (UUID): Unique identifier for the result
+        score (float): Similarity score (lower is better)
+        payload (Dict[str, Any]): Additional metadata associated with the result
+    """
     id: UUID
     score: float # Lower score is better
     payload: Dict[str, Any]
cognee/tasks/documents/extract_chunks_from_documents.py (1)

4-4: Consider making chunk_size configurable via settings

The default chunk_size of 1024 bytes might not be optimal for all use cases. Consider:

  1. Making it configurable through environment variables or settings
  2. Adjusting based on available system memory
  3. Adding documentation about choosing appropriate chunk sizes
cognee/modules/engine/models/EntityType.py (1)

6-6: Consider renaming the 'type' field to avoid shadowing built-in type

The attribute name 'type' could potentially cause confusion or issues as it shadows Python's built-in type() function.

Consider renaming to something more specific like entity_type or type_category.

-    type: str
+    entity_type: str
cognee/modules/engine/models/Entity.py (1)

10-12: Consider additional index fields

The current implementation only indexes the name field. Consider whether is_a should also be indexed for efficient entity type filtering.

cognee/modules/chunking/models/DocumentChunk.py (1)

12-14: Consider indexing additional fields for search optimization

The metadata configuration only indexes the text field. Consider whether other fields like chunk_index should also be indexed for performance optimization, especially if they're used in queries or filtering operations.

Consider expanding the index_fields if these fields are frequently queried:

     _metadata: Optional[dict] = {
-        "index_fields": ["text"],
+        "index_fields": ["text", "chunk_index", "cut_type"],
     }
cognee/tasks/chunks/query_chunks.py (1)

Line range hint 1-17: Consider these improvements for robustness.

While the basic functionality looks good, consider these enhancements:

  1. Make the search limit configurable instead of hard-coding to 5
  2. Add error handling for vector engine operations
  3. Add input validation for the query parameter

Here's a suggested implementation:

+from typing import Optional
 from cognee.infrastructure.databases.vector import get_vector_engine
+from cognee.config import settings  # Assuming you have a settings module
+
+DEFAULT_SEARCH_LIMIT = 5

-async def query_chunks(query: str) -> list[dict]:
+async def query_chunks(query: str, limit: Optional[int] = None) -> list[dict]:
     """
     Parameters:
     - query (str): The query string to filter nodes by.
+    - limit (Optional[int]): Maximum number of results to return. Defaults to settings or 5.

     Returns:
     - list(dict): A list of objects providing information about the chunks related to query.
+    
+    Raises:
+    - ValueError: If query is empty or limit is negative
+    - VectorEngineError: If search operation fails
     """
+    if not query or not query.strip():
+        raise ValueError("Query string cannot be empty")
+
+    search_limit = limit or getattr(settings, 'VECTOR_SEARCH_LIMIT', DEFAULT_SEARCH_LIMIT)
+    if search_limit < 1:
+        raise ValueError("Search limit must be positive")
+
     vector_engine = get_vector_engine()
+    try:
+        found_chunks = await vector_engine.search(
+            "DocumentChunk_text",
+            query,
+            limit=search_limit
+        )
+        return [result.payload for result in found_chunks]
+    except Exception as e:
+        raise VectorEngineError(f"Failed to search chunks: {str(e)}") from e
-    found_chunks = await vector_engine.search("DocumentChunk_text", query, limit = 5)
-    chunks = [result.payload for result in found_chunks]
-    return chunks
cognee/modules/data/processing/document_types/ImageDocument.py (1)

13-13: Document the TextChunker parameter change

The change from self.id to self as the first parameter of TextChunker suggests that the chunker now operates on the entire document object. Consider adding a comment explaining this design choice for better maintainability.

-        chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)
+        # TextChunker now takes the entire document object to access all document properties
+        chunker = TextChunker(self, chunk_size = chunk_size, get_text = lambda: text)
cognee/modules/data/processing/document_types/PdfDocument.py (2)

8-20: Add method documentation

The read method lacks documentation explaining its purpose, parameters, and return value.

  def read(self, chunk_size: int):
+     """Read PDF document and yield text chunks.
+
+     Args:
+         chunk_size (int): The maximum size of each text chunk
+
+     Yields:
+         str: Text chunks from the PDF document
+
+     Raises:
+         ValueError: If raw_data_location is not set
+         FileNotFoundError: If PDF file doesn't exist
+         RuntimeError: If PDF file cannot be read
+     """

18-20: Improve resource cleanup with context manager

The file cleanup could be more robust by using a context manager pattern.

  def read(self, chunk_size: int):
-     file = PdfReader(self.raw_data_location)
+     with open(self.raw_data_location, 'rb') as pdf_file:
+         file = PdfReader(pdf_file)
+         def get_text():
+             for page in file.pages:
+                 page_text = page.extract_text()
+                 yield page_text
 
-     def get_text():
-         for page in file.pages:
-             page_text = page.extract_text()
-             yield page_text
+         chunker = TextChunker(self, chunk_size=chunk_size, get_text=get_text)
+         yield from chunker.read()
-     chunker = TextChunker(self, chunk_size = chunk_size, get_text = get_text)
-
-     yield from chunker.read()
-
-     file.stream.close()
cognee/infrastructure/databases/vector/pgvector/serialize_data.py (1)

4-5: Update docstring to reflect UUID handling capability.

The current docstring only mentions datetime objects, but the function now handles UUID objects as well.

-    """Recursively convert datetime objects in dictionaries/lists to ISO format."""
+    """Recursively serialize data by converting datetime objects to ISO format and UUID objects to strings."""
cognee/modules/data/processing/document_types/TextDocument.py (1)

Line range hint 8-16: Add error handling and use context manager for file operations

The file operations lack proper error handling and resource management. Consider these improvements:

  1. Handle potential IOError/OSError exceptions
  2. Use context manager with contextlib.contextmanager for better resource management
-    def get_text():
-        with open(self.raw_data_location, mode = "r", encoding = "utf-8") as file:
-            while True:
-                text = file.read(1024)
-
-                if len(text.strip()) == 0:
-                    break
-
-                yield text
+    @contextmanager
+    def get_text():
+        try:
+            with open(self.raw_data_location, mode="r", encoding="utf-8") as file:
+                while True:
+                    text = file.read(1024)
+                    if not text.strip():
+                        break
+                    yield text
+        except IOError as e:
+            logger.error(f"Error reading file {self.raw_data_location}: {e}")
+            raise
cognee/tasks/graph/extract_graph_from_code.py (2)

17-17: Consider returning processing status information

The function currently returns the original data_chunks without any indication of which chunks were successfully processed. Consider enhancing the return type to include processing status.

-    return data_chunks
+    return {
+        'chunks': data_chunks,
+        'processing_status': [
+            {'chunk_index': i, 'success': chunk_graphs[i] is not None}
+            for i in range(len(data_chunks))
+        ]
+    }

8-8: Improve type hints specificity

The type hint for graph_model could be more specific about what kind of BaseModel is expected.

-async def extract_graph_from_code(data_chunks: list[DocumentChunk], graph_model: Type[BaseModel]):
+async def extract_graph_from_code(
+    data_chunks: list[DocumentChunk],
+    graph_model: Type[BaseModel]  # TODO: Replace BaseModel with your specific graph model class
+) -> dict[str, list]:
cognee/infrastructure/engine/models/DataPoint.py (2)

7-8: Add docstring to explain MetaData's purpose and usage

The MetaData TypedDict would benefit from documentation explaining the purpose of index_fields and how they are used in the data model.

 class MetaData(TypedDict):
+    """Metadata configuration for DataPoint models.
+    
+    Attributes:
+        index_fields: List of field names that should be indexed for searching/embedding
+    """
     index_fields: list[str]

17-18: Decide on attribute privacy strategy

The commented-out configuration suggests uncertainty about attribute privacy. If _metadata should be private, we should either:

  1. Enable the private attributes configuration, or
  2. Use property decorators to control access

Please clarify the intended privacy model for this class. If you need help implementing either approach, I can provide specific code examples.

cognee/infrastructure/databases/vector/config.py (1)

11-11: Consider using Redis default port and adding port validation

Since FalkorDB is Redis-based, consider:

  1. Using Redis default port 6379 instead of 1234
  2. Adding port range validation (1-65535)
-    vector_db_port: int = 1234
+    vector_db_port: int = 6379
+
+    @property
+    def vector_db_port(self) -> int:
+        return self._vector_db_port
+
+    @vector_db_port.setter
+    def vector_db_port(self, value: int) -> None:
+        if not 1 <= value <= 65535:
+            raise ValueError("Port must be between 1 and 65535")
+        self._vector_db_port = value
cognee/tasks/chunks/remove_disconnected_chunks.py (3)

4-4: Add docstring to clarify function behavior.

The function lacks documentation explaining its purpose, parameters, and return value.

Add a descriptive docstring:

 async def remove_disconnected_chunks(data_chunks: list[DocumentChunk]) -> list[DocumentChunk]:
+    """Remove disconnected chunks from the graph.
+
+    Args:
+        data_chunks (list[DocumentChunk]): List of document chunks to process
+
+    Returns:
+        list[DocumentChunk]: The input chunks (unchanged)
+    """

Line range hint 4-27: Consider essential improvements for robustness and performance.

  1. Error Handling: Add try-catch blocks for graph operations to handle potential failures gracefully.
  2. Performance: Consider batching graph operations to reduce the number of await calls.
  3. Return Value: The function returns the input unchanged, which might be misleading. Consider returning only the connected chunks or adding a success indicator.

Here's a suggested implementation:

 async def remove_disconnected_chunks(data_chunks: list[DocumentChunk]) -> list[DocumentChunk]:
+    """Remove disconnected chunks from the graph.
+
+    Args:
+        data_chunks (list[DocumentChunk]): List of document chunks to process
+
+    Returns:
+        list[DocumentChunk]: The input chunks (unchanged)
+
+    Raises:
+        GraphEngineError: If graph operations fail
+    """
     graph_engine = await get_graph_engine()
     document_ids = set((data_chunk.document_id for data_chunk in data_chunks))
     obsolete_chunk_ids = []
 
-    for document_id in document_ids:
-        chunks = await graph_engine.get_successors(document_id, edge_label = "has_chunk")
-
-        for chunk in chunks:
-            previous_chunks = await graph_engine.get_predecessors(chunk["uuid"], edge_label = "next_chunk")
-
-            if len(previous_chunks) == 0:
-                obsolete_chunk_ids.append(chunk["uuid"])
-
-    if len(obsolete_chunk_ids) > 0:
-        await graph_engine.delete_nodes(obsolete_chunk_ids)
-
-    disconnected_nodes = await graph_engine.get_disconnected_nodes()
-    if len(disconnected_nodes) > 0:
-        await graph_engine.delete_nodes(disconnected_nodes)
+    try:
+        # Batch get_successors for all document_ids
+        chunks_by_doc = await graph_engine.get_successors_batch(
+            list(document_ids),
+            edge_label="has_chunk"
+        )
+        
+        # Collect all chunk UUIDs
+        chunk_uuids = [
+            chunk["uuid"]
+            for doc_chunks in chunks_by_doc.values()
+            for chunk in doc_chunks
+        ]
+        
+        # Batch get_predecessors for all chunks
+        previous_chunks_by_uuid = await graph_engine.get_predecessors_batch(
+            chunk_uuids,
+            edge_label="next_chunk"
+        )
+        
+        # Identify obsolete chunks
+        obsolete_chunk_ids = [
+            uuid for uuid, prev_chunks in previous_chunks_by_uuid.items()
+            if len(prev_chunks) == 0
+        ]
+        
+        # Delete obsolete chunks if any
+        if obsolete_chunk_ids:
+            await graph_engine.delete_nodes(obsolete_chunk_ids)
+        
+        # Delete disconnected nodes
+        disconnected_nodes = await graph_engine.get_disconnected_nodes()
+        if disconnected_nodes:
+            await graph_engine.delete_nodes(disconnected_nodes)
+            
+    except Exception as e:
+        raise GraphEngineError(f"Failed to remove disconnected chunks: {str(e)}") from e
 
     return data_chunks

Batch operations are not available in the current implementation

Based on the complete interface and implementation review:

  • The GraphDBInterface does not define get_successors_batch or get_predecessors_batch methods
  • The FalkorDB implementation only has batch operations for search, but not for graph traversal
  • The current implementation would need to be extended to support batch operations for graph traversal
🔗 Analysis chain

Line range hint 4-27: Verify availability of batch operations in graph engine.

The suggested performance optimization assumes the existence of batch operations (get_successors_batch, get_predecessors_batch). Let's verify if these methods are available or need to be implemented.


Let me check the graph engine implementation more specifically to verify the batch operations.


Let me check the specific implementation of get_successors and get_predecessors methods.


Let me check the complete GraphDBInterface and FalkorDB implementation to understand all available methods.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for batch operation methods in the graph engine implementation
ast-grep --pattern 'get_successors_batch' 
ast-grep --pattern 'get_predecessors_batch'

# Search for similar batch operations that might be used instead
rg -l "batch" --type py

Length of output: 1008


Script:

#!/bin/bash
# Check the FalkorDB adapter implementation for batch operations
rg -A 5 "class FalkorDBAdapter" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Search for any batch-related methods in the FalkorDB adapter
rg "batch|get_successors|get_predecessors" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 432


Script:

#!/bin/bash
# Search for get_successors and get_predecessors implementation
ast-grep --pattern 'async def get_successors' 
ast-grep --pattern 'async def get_predecessors'

# Also check the GraphDBInterface definition
rg -A 10 "class GraphDBInterface" 

Length of output: 1078


Script:

#!/bin/bash
# Get the complete GraphDBInterface definition
cat cognee/infrastructure/databases/graph/graph_db_interface.py

# Get the complete FalkorDB implementation focusing on graph operations
rg -A 5 "async def" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 6508

cognee/modules/data/operations/detect_language.py (2)

17-18: Extract magic number to a constant.

The 100-character limit should be defined as a module-level constant for better maintainability.

Apply this diff:

import logging
from langdetect import detect, LangDetectException

+# Maximum text length for language detection to ensure efficient processing
+MAX_DETECTION_LENGTH = 100
+
logger = logging.getLogger(__name__)

async def detect_language(text: str) -> str | None:
    # Trim the text to the first 100 characters
-    trimmed_text = text[:100]
+    trimmed_text = text[:MAX_DETECTION_LENGTH]

30-36: Consolidate error handling returns.

The duplicate return None statements can be consolidated by moving it after both except blocks.

Apply this diff:

     except LangDetectException as e:
         logger.error(f"Language detection error: {e}")
-
     except Exception as e:
         logger.error(f"Unexpected error: {e}")

-    return None
+        return None
cognee/modules/graph/utils/get_model_instance_from_graph.py (2)

1-6: Add docstring documentation for better clarity.

The function signature is well-typed but lacks documentation. Consider adding a docstring that explains:

  • Purpose of the function
  • Parameter descriptions
  • Return value description
  • Example usage
 def get_model_instance_from_graph(nodes: list[DataPoint], edges: list, entity_id: str):
+    """Reconstructs a model instance from its graph representation.
+
+    Args:
+        nodes (list[DataPoint]): List of DataPoint nodes representing model instances
+        edges (list): List of edges connecting nodes, each edge is [source_id, target_id, label, properties?]
+        entity_id (str): ID of the root entity to reconstruct
+
+    Returns:
+        Model instance reconstructed from the graph representation
+    """

6-29: Consider performance optimizations for large graphs.

For better scalability:

  1. Consider implementing caching for frequently accessed model instances
  2. For large graphs, consider implementing batch processing or streaming to reduce memory usage
  3. Add performance metrics/logging for monitoring reconstruction time

Would you like help implementing any of these optimizations?

cognee/infrastructure/databases/graph/graph_db_interface.py (1)

6-7: LGTM! Consider enhancing type hints and documentation.

The new query method with parameterized queries is a good improvement over the previous graph() method. Consider these enhancements:

  1. Add return type hint
  2. Use more specific type for params
  3. Add docstring explaining the expected query format
     @abstractmethod
-    async def query(self, query: str, params: dict):
+    async def query(self, query: str, params: dict[str, Any]) -> Any:
+        """Execute a graph database query with parameters.
+        
+        Args:
+            query: The query string in the graph database's native query language
+            params: Dictionary of query parameters to prevent injection
+        
+        Returns:
+            Query results in the format specific to the implementation
+        """
         raise NotImplementedError
cognee/infrastructure/databases/graph/get_graph_engine.py (1)

Line range hint 1-42: Consider enhancing the factory pattern implementation

The current implementation mixes several concerns that could be better separated:

  1. Consider extracting provider-specific initialization logic into separate factory classes
  2. Implement a proper strategy pattern for different graph database providers
  3. Consider adding a registration mechanism for new providers

This would make the code more maintainable and easier to extend with new providers in the future.

Example structure:

class GraphEngineFactory:
    _providers = {}
    
    @classmethod
    def register_provider(cls, name: str, provider_class: Type[GraphDBInterface]):
        cls._providers[name] = provider_class
    
    @classmethod
    async def create(cls, config: GraphConfig) -> GraphDBInterface:
        provider = cls._providers.get(config.graph_database_provider)
        if not provider:
            return cls._get_fallback_provider(config)
        return await provider.create(config)
🧰 Tools
🪛 Ruff

20-20: Do not use bare except

(E722)

cognee/modules/storage/utils/__init__.py (1)

34-46: Improve robustness and readability of property extraction

The current implementation has a few areas that could be enhanced:

  1. The list check assumes non-empty lists without validation
  2. Direct iteration over data_point could be replaced with .items()
  3. Complex condition could be extracted into a helper function

Consider this refactor:

 def get_own_properties(data_point: DataPoint):
     properties = {}
 
-    for field_name, field_value in data_point:
-        if field_name == "_metadata" \
-            or isinstance(field_value, dict) \
-            or isinstance(field_value, DataPoint) \
-            or (isinstance(field_value, list) and isinstance(field_value[0], DataPoint)):
+    def is_complex_type(value) -> bool:
+        return (isinstance(value, (dict, DataPoint)) or
+                (isinstance(value, list) and value and isinstance(value[0], DataPoint)))
+
+    for field_name, field_value in data_point.model_dump().items():
+        if field_name == "_metadata" or is_complex_type(field_value):
             continue
 
         properties[field_name] = field_value
 
     return properties
cognee/infrastructure/engine/__tests__/model_to_graph_to_model.test.py (3)

23-23: Use class variable syntax for _metadata

Using dict() creates a new instance for each class. For class variables, use the proper class variable syntax.

-    _metadata: dict = dict(index_fields = ["name"])
+    _metadata: dict = {"index_fields": ["name"]}

Also applies to: 39-39


38-38: Fix inconsistent spelling of 'license'

The field name uses British spelling ('licence') while the sample data uses American spelling ('license'). Maintain consistency across the codebase.

-        driving_licence: Optional[dict]
+        driving_license: Optional[dict]

55-60: Use dynamic dates for driving license

The driving license dates are hardcoded to 2025. This could cause the test to fail after that date. Consider using dynamic dates relative to the current date.

+from datetime import datetime, timedelta
+
+current_date = datetime.now()
+expiry_date = (current_date + timedelta(years=3)).strftime("%Y-%m-%d")
+
 driving_licence = {
     "issued_by": "PU Vrsac",
-    "issued_on": "2025-11-06",
+    "issued_on": current_date.strftime("%Y-%m-%d"),
     "number": "1234567890",
-    "expires_on": "2025-11-06",
+    "expires_on": expiry_date,
 }
cognee/tasks/graph/query_graph_connections.py (3)

Line range hint 5-5: Remove unused parameter exploration_levels

The parameter exploration_levels is declared but never used in the function implementation.

-async def query_graph_connections(query: str, exploration_levels = 1) -> list[(str, str, str)]:
+async def query_graph_connections(query: str) -> list[(str, str, str)]:

30-31: Extract magic numbers into named constants

The hardcoded values for limit and score threshold should be configuration parameters.

+SEARCH_RESULT_LIMIT = 5
+RELEVANCE_THRESHOLD = 0.5

 vector_engine = get_vector_engine()
 results = await asyncio.gather(
-    vector_engine.search("Entity_name", query_text = query, limit = 5),
-    vector_engine.search("EntityType_name", query_text = query, limit = 5),
+    vector_engine.search("Entity_name", query_text = query, limit = SEARCH_RESULT_LIMIT),
+    vector_engine.search("EntityType_name", query_text = query, limit = SEARCH_RESULT_LIMIT),
 )
 results = [*results[0], *results[1]]
-relevant_results = [result for result in results if result.score < 0.5][:5]
+relevant_results = [result for result in results if result.score < RELEVANCE_THRESHOLD][:SEARCH_RESULT_LIMIT]

51-54: Make unique_id construction more robust

The current unique_id construction might be susceptible to edge cases where the components contain spaces.

-        unique_id = f"{node_connection[0]['id']} {node_connection[1]['relationship_name']} {node_connection[2]['id']}"
+        unique_id = f"{node_connection[0]['id']}::{node_connection[1]['relationship_name']}::{node_connection[2]['id']}"
cognee/infrastructure/databases/vector/create_vector_engine.py (2)

5-5: Consider using int type for port and add validation

The vector_db_port is defined as a string, but ports are typically integers within a specific range (0-65535). Consider:

  1. Changing the type to int
  2. Adding validation to ensure the port is within valid range
-    vector_db_port: str
+    vector_db_port: int

Line range hint 9-58: Standardize error handling across providers

Currently, only Weaviate has explicit configuration validation. Consider adding similar validation for other providers to ensure consistent error handling:

  • PGVector: Validate relational config
  • FalkorDB: Validate URL and port
  • Qdrant: Add error for missing config (similar to Weaviate)

Example implementation for FalkorDB:

     elif config["vector_db_provider"] == "falkordb":
         from ..hybrid.falkordb.FalkorDBAdapter import FalkorDBAdapter
 
+        if config["vector_db_url"] is None or config["vector_db_port"] is None:
+            raise EnvironmentError("FalkorDB is not configured!")
+
         return FalkorDBAdapter(
             database_url = config["vector_db_url"],
             database_port = config["vector_db_port"],
             embedding_engine = embedding_engine,
         )
cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py (1)

Line range hint 124-130: Fix inconsistent relationship naming and direction.

There are two issues in the subtype relationship edge creation:

  1. The edge type is "is_subtype_of" but the relationship_name is "contains"
  2. The edge direction seems reversed - for "is_subtype_of", the subtype should point to the type, not vice versa

Apply this fix:

                    "is_subtype_of",
                    dict(
-                       relationship_name="contains",
+                       relationship_name="is_subtype_of",
-                       source_node_id=str(classification_type_id),
-                       target_node_id=str(classification_subtype_id),
+                       source_node_id=str(classification_subtype_id),
+                       target_node_id=str(classification_type_id),
                    ),
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4)

13-19: Consider making IndexSchema more flexible

The IndexSchema implementation could be more flexible to handle multiple index fields and different data types.

Consider this improvement:

 class IndexSchema(DataPoint):
     uuid: str
     text: str
 
     _metadata: dict = {
-        "index_fields": ["text"]
+        "index_fields": ["text"],
+        "field_types": {"text": str},
+        "vectorize_fields": ["text"]
     }

93-93: Remove commented-out batch implementation code

The commented-out batch implementation code should be removed if it's no longer needed. If it's kept for future reference, consider moving it to documentation.

Also applies to: 102-116


Line range hint 1-194: Standardize error handling across all methods

The class needs a more consistent approach to error handling. Some methods have try-catch blocks while others don't handle errors at all.

Consider:

  1. Creating custom exception classes for different error types
  2. Adding consistent error handling to all async methods
  3. Implementing proper error logging throughout

Example custom exception:

class WeaviateAdapterError(Exception):
    """Base exception for WeaviateAdapter errors."""
    pass

class CollectionNotFoundError(WeaviateAdapterError):
    """Raised when a collection is not found."""
    pass

Async implementation needs improvement for non-blocking operations

The WeaviateAdapter's async implementation has several blocking operations that should be handled differently:

  • The adapter wraps synchronous Weaviate client operations in asyncio.Future() which doesn't provide true asynchronous behavior
  • In contrast, PGVectorAdapter properly uses async context managers and awaits database operations
  • Critical points to fix:
    • has_collection(): Direct client call wrapped in Future
    • create_collection(): Direct client call wrapped in Future
    • retrieve(): Direct client call wrapped in Future
    • search(): Direct blocking call to client
    • delete_data_points(): Direct client call wrapped in Future

The implementation should either:

  1. Use Weaviate's async client if available
  2. Or use asyncio.to_thread() to properly offload blocking operations to a thread pool
🔗 Analysis chain

Line range hint 1-194: Verify async implementation efficiency

Some operations might be blocking despite the async interface. Let's verify the implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential blocking operations in async methods
echo "Checking for potential blocking operations in async methods..."
rg -n "self\.client\." cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py

# Check for other adapter implementations to compare async patterns
echo "Checking async patterns in other adapters..."
fd -t f ".*Adapter\.py" cognee/infrastructure/databases/vector/ -x rg -l "async def"

Length of output: 917


Script:

#!/bin/bash
# Let's check the async patterns in PGVectorAdapter for comparison
echo "Checking PGVectorAdapter async implementation..."
rg -A 5 "async def" cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

# Let's also check how client operations are handled in PGVectorAdapter
echo "Checking client operations in PGVectorAdapter..."
rg "self\." cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

# Let's examine the interface requirements
echo "Checking VectorDBInterface..."
rg -A 2 "async def" cognee/infrastructure/databases/vector/vector_db_interface.py

Length of output: 6003

cognee/modules/pipelines/operations/run_tasks.py (3)

10-10: Update logger string to match new type hint

The type hint has been updated to use list[Task], but the logger string still shows the old format [Task]. Update it for consistency:

-logger = logging.getLogger("run_tasks(tasks: [Task], data)")
+logger = logging.getLogger("run_tasks(tasks: list[Task], data)")

Also applies to: 8-8


Line range hint 25-27: Standardize telemetry call parameter naming

There are inconsistencies in how telemetry parameters are passed:

# Some calls use named parameters
send_telemetry("Generator Task Completed", user_id=user.id, additional_properties={...})

# Others use positional parameters
send_telemetry("Async Generator Task Started", user.id, {...})

Standardize all telemetry calls to use named parameters for clarity:

send_telemetry(
    event_name,
    user_id=user.id,
    additional_properties={...}
)

Also applies to: 84-86, 124-126


Line range hint 22-147: Consider refactoring task execution patterns

The code contains significant duplication in error handling, logging, and telemetry across different task types. This makes maintenance more difficult and increases the chance of inconsistencies.

Consider introducing helper functions to handle common patterns:

async def execute_task(task_type: str, task: Task, args: list, next_processor, user: User):
    logger.info(f"{task_type} task started: `{task.executable.__name__}`")
    send_telemetry(f"{task_type} Task Started", user_id=user.id, additional_properties={
        "task_name": task.executable.__name__
    })
    
    try:
        result = await process_task(task_type, task, args)
        await next_processor(result)
        
        logger.info(f"{task_type} task completed: `{task.executable.__name__}`")
        send_telemetry(f"{task_type} Task Completed", user_id=user.id, additional_properties={
            "task_name": task.executable.__name__
        })
    except Exception as error:
        logger.error(
            f"{task_type} task errored: `{task.executable.__name__}`\n{str(error)}\n",
            exc_info=True
        )
        send_telemetry(f"{task_type} Task Errored", user_id=user.id, additional_properties={
            "task_name": task.executable.__name__
        })
        raise

This would significantly reduce code duplication and make the code more maintainable.

cognee/tasks/graph/infer_data_ontology.py (4)

Line range hint 89-157: Consider refactoring the add_graph_ontology method

The method is handling two distinct flows (file-based and document-based) which makes it complex and harder to maintain. Consider splitting it into two separate methods for better maintainability and testability.

Here's a suggested refactoring approach:

- async def add_graph_ontology(self, file_path: str = None, documents: list = None):
+ async def add_graph_ontology_from_documents(self, documents: list) -> str:
+     """Add graph ontology from documents content."""
+     initial_chunks_and_ids = await self._process_documents(documents)
+     ontology = await extract_ontology(str(initial_chunks_and_ids), GraphOntology)
+     await self._add_nodes_and_edges(ontology)
+     return ontology.root_node_id
+
+ async def add_graph_ontology_from_file(self, file_path: str, documents: list):
+     """Add graph ontology from a JSON or CSV file."""
+     dataset_level_information = documents[0][1]
+     valid_ids = {item["id"] for item in dataset_level_information}
+     data = await self.load_data(file_path)
+     await self._process_file_data(data, valid_ids)

Also, consider extracting common node/edge property creation logic into helper methods to reduce code duplication.


Line range hint 64-81: Enhance error handling in load_data method

While the error handling has been improved, consider making it more specific by:

  1. Using custom exceptions for different error cases
  2. Adding validation for empty files
  3. Moving file type validation to a separate method

Here's a suggested improvement:

class OntologyFileError(Exception):
    """Base exception for ontology file operations."""
    pass

class UnsupportedFileTypeError(OntologyFileError):
    """Exception raised for unsupported file types."""
    pass

class EmptyFileError(OntologyFileError):
    """Exception raised for empty files."""
    pass

async def validate_file_type(file_path: str) -> str:
    if not file_path.endswith(('.json', '.csv')):
        raise UnsupportedFileTypeError(f"Unsupported file format: {file_path}")
    return file_path.split('.')[-1]

async def load_data(self, file_path: str) -> Union[List[Dict[str, Any]], Dict[str, Any]]:
    """Load data from a JSON or CSV file."""
    try:
        file_type = await self.validate_file_type(file_path)
        async with aiofiles.open(file_path, mode="r") as f:
            content = await f.read()
            if not content.strip():
                raise EmptyFileError(f"File is empty: {file_path}")
            
            if file_type == "json":
                return json.loads(content)
            else:  # csv
                reader = csv.DictReader(content.splitlines())
                return list(reader)
    except OntologyFileError as e:
        raise
    except Exception as e:
        raise RuntimeError(f"Failed to load data from {file_path}: {e}")

Line range hint 134-137: Improve node ID validation

While the validation check is good, consider:

  1. Validating all node IDs upfront before processing
  2. Providing more informative error messages
  3. Adding logging for skipped/invalid nodes

Here's a suggested improvement:

# Add this method to OntologyEngine class
async def validate_node_ids(self, nodes: pd.DataFrame, valid_ids: set) -> None:
    """Validate all node IDs before processing."""
    invalid_ids = set(nodes['node_id'].unique()) - valid_ids
    if invalid_ids:
        logger.error(f"Found {len(invalid_ids)} invalid node IDs")
        raise ValueError(
            f"Invalid node IDs found: {invalid_ids}. "
            f"Valid IDs must be one of: {valid_ids}"
        )

Then update the validation in add_graph_ontology:

- if node_id not in valid_ids:
-     raise ValueError(f"Node ID {node_id} not found in the dataset")
+ # At the start of processing
+ await self.validate_node_ids(df, valid_ids)

Line range hint 1-157: Add comprehensive tests for ontology processing

Given the complexity of the ontology processing logic, consider adding comprehensive tests covering:

  1. Different input formats (JSON, CSV)
  2. Edge cases (empty files, invalid node IDs)
  3. Both file-based and document-based processing paths
  4. Error handling scenarios

Would you like me to help create a test suite for this module?

cognee/shared/utils.py (2)

139-142: Extract node size configuration to settings.

The keywords list for determining node sizes is hardcoded. Consider moving this configuration to a settings file or environment variables for better maintainability.

+# In config.py or settings.py
+NODE_SIZE_CONFIG = {
+    "default_size": 10,
+    "larger_size": 20,
+    "keywords": ["DOCUMENT", "User"]
+}

 # In current file
-            keywords = ["DOCUMENT", "User"]
-            node_size = larger_size if any(keyword in str(node) for keyword in keywords) else default_size
+            node_size = NODE_SIZE_CONFIG["larger_size"] if any(
+                keyword in str(node) for keyword in NODE_SIZE_CONFIG["keywords"]
+            ) else NODE_SIZE_CONFIG["default_size"]

175-179: Remove commented out code and redundant pass statement.

The commented code for color encoding and the redundant pass statement should be removed if this functionality is no longer needed. If the color functionality will be implemented later, consider adding a TODO comment with a ticket reference.

         if include_color:
-            pass
-            # unique_layers = nodes["layer_description"].unique()
-            # color_palette = generate_color_palette(unique_layers)
-            # plotter = plotter.encode_point_color("layer_description", categorical_mapping=color_palette,
-            #                                      default_mapping="silver")
+            # TODO: Implement color encoding in ticket COG-XXX
+            pass
cognee/modules/data/operations/translate_text.py (4)

17-18: Move imports to the top of the file for better organization.

Importing modules inside a function can lead to repeated imports and is generally discouraged unless necessary. Consider moving the boto3 and botocore.exceptions imports to the module level.

Apply this diff:

+import boto3
+from botocore.exceptions import BotoCoreError, ClientError

...

-    import boto3
-    from botocore.exceptions import BotoCoreError, ClientError

27-33: Handle AWS Translate response more explicitly.

Using result.get("TranslatedText", "No translation found.") may mask issues if the TranslatedText key is missing. Since AWS Translate should always return TranslatedText on success, consider accessing it directly and handling any potential exceptions.

Apply this diff:

...

-        return result.get("TranslatedText", "No translation found.")
+        return result["TranslatedText"]

...

+    except KeyError as e:
+        logger.error(f"KeyError occurred: {e}")
+        return None

35-41: Improve error handling by raising exceptions.

Returning None on exceptions may make it difficult for the calling code to handle errors appropriately. Consider re-raising the exceptions after logging, allowing the calling code to deal with them directly.

Apply this diff:

...

     except BotoCoreError as e:
         logger.error(f"BotoCoreError occurred: {e}")
-        return None
+        raise

     except ClientError as e:
         logger.error(f"ClientError occurred: {e}")
-        return None
+        raise

5-15: Document exceptions in the docstring.

Including an "Exceptions" section in the docstring helps users understand what errors to expect and how to handle them.

Update the docstring as follows:

     Returns:
-    str: Translated text or an error message.
+    str: Translated text.

+    Raises:
+        ValueError: If input validation fails.
+        BotoCoreError: If an error occurs within the AWS SDK.
+        ClientError: If AWS Translate returns an error.
cognee/shared/SourceCodeGraph.py (2)

31-31: Inconsistent use of List and list in type annotations

At line 31, has_methods is annotated using list["Function"] instead of List["Function"]. For consistency and clarity, it's recommended to use List from the typing module for type annotations.

Apply this diff to fix the inconsistency:

 class Class(DataPoint):
     id: str
     name: str
     type: Literal["Class"] = "Class"
     description: str
     constructor_parameters: List[Variable]
     extended_from_class: Optional["Class"] = None
-    has_methods: list["Function"]
+    has_methods: List["Function"]

43-43: Unnecessary use of Union with a single type

At line 43, instantiated_by is annotated as Union["Function"]. Since there is only one type, using Union is unnecessary. If instantiated_by can only be a Function, you can annotate it directly. If it can be None, consider using Optional["Function"].

Apply this diff to correct the type annotation:

 class ClassInstance(DataPoint):
     id: str
     name: str
     type: Literal["ClassInstance"] = "ClassInstance"
     description: str
     from_class: Class
-    instantiated_by: Union["Function"]
+    instantiated_by: "Function"
cognee/modules/chunking/TextChunker.py (1)

50-50: Simplify list indexing with negative indices

Instead of using self.paragraph_chunks[len(self.paragraph_chunks) - 1], you can use self.paragraph_chunks[-1] for a more idiomatic and concise approach.

Apply this diff:

 # At line 50
-                    cut_type = self.paragraph_chunks[len(self.paragraph_chunks) - 1]["cut_type"],
+                    cut_type = self.paragraph_chunks[-1]["cut_type"],

 # At line 67
-                    cut_type = self.paragraph_chunks[len(self.paragraph_chunks) - 1]["cut_type"],
+                    cut_type = self.paragraph_chunks[-1]["cut_type"],

Also applies to: 67-67

cognee/modules/graph/utils/get_graph_from_model.py (5)

27-31: Use tuples instead of string concatenation for edge_key to ensure uniqueness and improve performance.

Currently, edge_key is constructed by concatenating strings:

edge_key = str(edge[0]) + str(edge[1]) + edge[2]

This approach can lead to key collisions and is less efficient than using tuples as dictionary keys.

Consider using a tuple:

-edge_key = str(edge[0]) + str(edge[1]) + edge[2]
+edge_key = (edge[0], edge[1], edge[2])

Update the if condition and the dictionary assignment accordingly:

-if str(edge_key) not in added_edges:
+if edge_key not in added_edges:
    edges.append(edge)
-   added_edges[str(edge_key)] = True
+   added_edges[edge_key] = True

65-68: Consistently use tuples for edge_key to avoid key collisions and improve code clarity.

Again, constructing edge_key by concatenating strings may lead to collisions and is less efficient. Recommend using tuples as keys.

Update the code as follows:

-edge_key = str(data_point.id) + str(property_node.id) + field_name
+edge_key = (data_point.id, property_node.id, field_name)

And update the if condition and the dictionary assignment:

-if str(edge_key) not in added_edges:
+if edge_key not in added_edges:
    edges.append((
        data_point.id,
        property_node.id,
        field_name,
        {
            "source_node_id": data_point.id,
            "target_node_id": property_node.id,
            "relationship_name": field_name,
            "updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
            "metadata": {
                "type": "list"
            },
        }
    ))
    added_edges[edge_key] = True

41-41: Use ISO 8601 format for updated_at timestamps for consistency and precision.

Currently, the timestamp is formatted without time zone information:

"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),

Consider using .isoformat() to include time zone information and ensure consistency:

-"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
+"updated_at": datetime.now(timezone.utc).isoformat(),

72-72: Use ISO 8601 format for updated_at timestamps for consistency and precision.

As above, update the timestamp formatting to include time zone information.

Apply this diff:

-"updated_at": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
+"updated_at": datetime.now(timezone.utc).isoformat(),

99-100: Use a set for destination_nodes to improve lookup performance.

Currently, destination_nodes is a list, and checking if str(node.id) in destination_nodes is an O(N) operation.

Convert destination_nodes to a set for O(1) lookup:

-destination_nodes = [str(property_edge[1]) for property_edge in property_edges]
+destination_nodes = {str(property_edge[1]) for property_edge in property_edges}
cognee/api/v1/cognify/code_graph_pipeline.py (1)

90-91: Replace print statements with proper logging

Using print() statements in production code is not recommended. Consider using the logger to log pipeline results for better control over output and log levels.

Apply this diff to replace print() with logger.info():

     async for result in pipeline:
-        print(result)
+        logger.info(result)
cognee/tasks/graph/extract_graph_from_data.py (2)

27-35: Optimize node processing by using sets

Using a dictionary processed_nodes with string keys and True values is less efficient and may be less readable. Consider using a set to track processed node IDs.

Refactor as follows:

-    processed_nodes = {}
+    processed_node_ids = set()

...

-        if str(type_node_id) not in processed_nodes:
+        if str(type_node_id) not in processed_node_ids:

...

-            processed_nodes[str(type_node_id)] = True
+            processed_node_ids.add(str(type_node_id))

...

-        if str(entity_node_id) not in processed_nodes:
+        if str(entity_node_id) not in processed_node_ids):

...

-            processed_nodes[str(entity_node_id)] = True
+            processed_node_ids.add(str(entity_node_id))

12-14: Handle exceptions in asynchronous graph extraction

When using asyncio.gather, if one of the coroutines raises an exception, it cancels the remaining tasks, which may not be desirable. To prevent this, consider using return_exceptions=True or handling exceptions within extract_content_graph.

Apply this diff to handle exceptions:

-    chunk_graphs = await asyncio.gather(
+    chunk_graphs = await asyncio.gather(
         *[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
-    )
+    , return_exceptions=True
+    )

Then, handle exceptions in the results:

for idx, result in enumerate(chunk_graphs):
    if isinstance(result, Exception):
        # Handle the exception (e.g., log it)
        chunk_graphs[idx] = None  # Or handle as appropriate
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)

13-19: Clarify the use of _metadata as a class attribute

The _metadata dictionary is declared as a class attribute in IndexSchema. If _metadata is intended to store instance-specific information, it should be defined as an instance attribute to prevent shared state across instances. Additionally, leading underscores typically denote private attributes. Consider adjusting its usage for clarity and to avoid potential issues.

Consider defining _metadata within the __init__ method:

 class IndexSchema(DataPoint):
     text: str

-    _metadata: dict = {
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self._metadata = {
             "index_fields": ["text"]
         }

128-129: Validate collection naming to avoid conflicts

Using f"{index_name}_{index_property_name}" as the collection name could lead to naming collisions if index_name or index_property_name contain underscores or duplicate values. Ensure that the combination of these parameters results in unique collection names and consider sanitizing or validating the inputs.

You might implement a standardized naming convention or use a delimiter less likely to appear in names:

 await self.create_collection(f"{index_name}__{index_property_name}")

2-2: Remove unused imports if not necessary

Ensure that all imported modules, such as UUID, are being utilized in the code. If UUID is no longer needed due to changes in the codebase, consider removing the import to keep the code clean.

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)

76-77: Avoid reassigning payload_schema to enhance clarity

Reassigning payload_schema after calling self.get_data_point_schema(payload_schema) can be confusing. Consider using a new variable name for the transformed schema to improve code readability.


100-109: Simplify dynamic typing in create_lance_data_point

The use of generics with LanceDataPoint[str, self.get_data_point_schema(type(data_point))] introduces complexity and may lead to type-checking issues. Defining LanceDataPoint with explicit types can enhance maintainability.


167-176: Re-evaluate normalization when distances are equal

Setting all normalized values to 0 when max_value == min_value might not accurately reflect similarity. Consider assigning a neutral value like 0.5 to indicate equal distances or maintain the original distances to preserve relative differences.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

Line range hint 91-96: Unused parameter payload_schema in create_collection

The payload_schema parameter is passed to create_collection, but it is not utilized within the method. To maintain clean and understandable code, consider removing the unused parameter or implementing its functionality if it is intended for future use.

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (2)

153-153: Remove unnecessary f prefix in string

The string at line 153 is marked as an f-string but doesn't contain any placeholders. The f prefix is unnecessary and can be removed.

Apply this diff to fix the issue:

-            f"MATCH (node) WHERE node.id IN $node_ids RETURN node",
+            "MATCH (node) WHERE node.id IN $node_ids RETURN node",
🧰 Tools
🪛 Ruff

153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)


241-241: Remove unnecessary f prefix in string

The string at line 241 is marked as an f-string but doesn't contain any placeholders. Removing the f prefix cleans up the code.

Apply this diff to fix the issue:

-            f"MATCH (node) WHERE node.id IN $node_ids DETACH DELETE node",
+            "MATCH (node) WHERE node.id IN $node_ids DETACH DELETE node",
🧰 Tools
🪛 Ruff

241-241: f-string without any placeholders

Remove extraneous f prefix

(F541)

cognee/infrastructure/databases/graph/networkx/adapter.py (2)

Line range hint 138-154: Ensure consistent use of node_id parameter types across methods

In the methods get_predecessors and get_successors, the node_id parameter is typed as UUID, while other methods like has_node, delete_node, and get_edges use node_id: str. This inconsistency could lead to type errors or unexpected behavior. Consider standardizing the node_id parameter to use UUID across all methods for consistency.

For example, update the has_node method signature:

-    async def has_node(self, node_id: str) -> bool:
+    async def has_node(self, node_id: UUID) -> bool:

And ensure all usages of node_id are appropriately updated.


49-49: Optimize file save operations to improve performance

Calling await self.save_graph_to_file(self.filename) after every node or edge addition or deletion can lead to performance bottlenecks due to frequent disk I/O operations. Consider batching these operations or implementing a buffering mechanism to reduce the number of writes.

One approach is to save the graph at strategic points or after a certain number of operations. Alternatively, you could introduce a flag to indicate when the graph has been modified and needs to be saved.

Example modification:

# Remove individual save calls
-    await self.save_graph_to_file(self.filename)

# Implement a periodic save mechanism or manual trigger

This change would require adjusting the workflow to ensure data integrity without compromising performance.

Also applies to: 56-56, 97-97, 108-108

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (2)

204-205: Avoid redundant conversion and ensure consistent typing of node identifiers

In the add_edges method, you are converting edge[0] and edge[1] to strings multiple times. Ensure that node identifiers are consistently handled as strings throughout the method to improve clarity and prevent potential type-related issues.

Refactor the code to reduce redundancy:

edges = [{
-  "from_node": str(edge[0]),
-  "to_node": str(edge[1]),
+  "from_node": edge_id_strs[0],
+  "to_node": edge_id_strs[1],
   "relationship_name": edge[2],
   "properties": {
       **(edge[3] if edge[3] else {}),
-      "source_node_id": str(edge[0]),
-      "target_node_id": str(edge[1]),
+      "source_node_id": edge_id_strs[0],
+      "target_node_id": edge_id_strs[1],
   },
} for edge_id_strs, edge in [ (list(map(str, edge[:2])), edge) for edge in edges ]]

This refactoring ensures that all node IDs are converted to strings once and reused, enhancing code readability.

Also applies to: 209-210


400-409: Efficiently serialize properties without unnecessary empty lines

The serialize_properties method contains several unnecessary empty lines that can be removed to improve readability.

Apply this diff to clean up the method:

 def serialize_properties(self, properties = dict()):
     serialized_properties = {}
-    
     for property_key, property_value in properties.items():
         if isinstance(property_value, UUID):
             serialized_properties[property_key] = str(property_value)
             continue
         serialized_properties[property_key] = property_value
-    
     return serialized_properties
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f097fcf and d733bfd.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (82)
  • README.md (1 hunks)
  • cognee/api/v1/cognify/code_graph_pipeline.py (1 hunks)
  • cognee/api/v1/cognify/cognify_v2.py (2 hunks)
  • cognee/api/v1/search/search_v2.py (1 hunks)
  • cognee/infrastructure/databases/graph/falkordb/adapter.py (0 hunks)
  • cognee/infrastructure/databases/graph/get_graph_engine.py (1 hunks)
  • cognee/infrastructure/databases/graph/graph_db_interface.py (1 hunks)
  • cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (9 hunks)
  • cognee/infrastructure/databases/graph/networkx/adapter.py (9 hunks)
  • cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (1 hunks)
  • cognee/infrastructure/databases/vector/__init__.py (0 hunks)
  • cognee/infrastructure/databases/vector/config.py (2 hunks)
  • cognee/infrastructure/databases/vector/create_vector_engine.py (3 hunks)
  • cognee/infrastructure/databases/vector/falkordb/FalkorDBAdapter.py (0 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (7 hunks)
  • cognee/infrastructure/databases/vector/models/DataPoint.py (0 hunks)
  • cognee/infrastructure/databases/vector/models/ScoredResult.py (1 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (5 hunks)
  • cognee/infrastructure/databases/vector/pgvector/serialize_data.py (1 hunks)
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (5 hunks)
  • cognee/infrastructure/databases/vector/vector_db_interface.py (1 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4 hunks)
  • cognee/infrastructure/engine/__init__.py (1 hunks)
  • cognee/infrastructure/engine/__tests__/model_to_graph_to_model.test.py (1 hunks)
  • cognee/infrastructure/engine/models/DataPoint.py (1 hunks)
  • cognee/modules/chunking/TextChunker.py (2 hunks)
  • cognee/modules/chunking/__init__.py (0 hunks)
  • cognee/modules/chunking/models/DocumentChunk.py (1 hunks)
  • cognee/modules/data/extraction/__init__.py (1 hunks)
  • cognee/modules/data/extraction/knowledge_graph/__init__.py (1 hunks)
  • cognee/modules/data/operations/detect_language.py (1 hunks)
  • cognee/modules/data/operations/translate_text.py (1 hunks)
  • cognee/modules/data/processing/document_types/AudioDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/modules/data/processing/document_types/ImageDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/__init__.py (1 hunks)
  • cognee/modules/engine/models/Entity.py (1 hunks)
  • cognee/modules/engine/models/EntityType.py (1 hunks)
  • cognee/modules/engine/models/__init__.py (1 hunks)
  • cognee/modules/engine/utils/__init__.py (1 hunks)
  • cognee/modules/engine/utils/generate_edge_name.py (1 hunks)
  • cognee/modules/engine/utils/generate_node_id.py (1 hunks)
  • cognee/modules/engine/utils/generate_node_name.py (1 hunks)
  • cognee/modules/graph/utils.py (0 hunks)
  • cognee/modules/graph/utils/__init__.py (1 hunks)
  • cognee/modules/graph/utils/get_graph_from_model.py (1 hunks)
  • cognee/modules/graph/utils/get_model_instance_from_graph.py (1 hunks)
  • cognee/modules/pipelines/operations/run_tasks.py (2 hunks)
  • cognee/modules/search/CogneeSearch.py (0 hunks)
  • cognee/modules/search/graph/search_adjacent.py (0 hunks)
  • cognee/modules/search/graph/search_cypher.py (0 hunks)
  • cognee/modules/search/graph/search_similarity.py (0 hunks)
  • cognee/modules/search/graph/search_summary.py (0 hunks)
  • cognee/modules/search/llm/extraction/categorize_relevant_category.py (0 hunks)
  • cognee/modules/search/llm/extraction/categorize_relevant_summary.py (0 hunks)
  • cognee/modules/search/llm/get_relevant_summary.py (0 hunks)
  • cognee/modules/search/vector/bm25.py (0 hunks)
  • cognee/modules/search/vector/fusion.py (0 hunks)
  • cognee/modules/search/vector/search_traverse.py (0 hunks)
  • cognee/modules/storage/utils/__init__.py (1 hunks)
  • cognee/shared/SourceCodeGraph.py (1 hunks)
  • cognee/shared/utils.py (4 hunks)
  • cognee/tasks/__init__.py (0 hunks)
  • cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py (3 hunks)
  • cognee/tasks/chunk_translate/translate_chunk.py (0 hunks)
  • cognee/tasks/chunk_update_check/chunk_update_check.py (0 hunks)
  • cognee/tasks/chunks/__init__.py (1 hunks)
  • cognee/tasks/chunks/__tests__/chunk_by_paragraph.test.py (1 hunks)
  • cognee/tasks/chunks/query_chunks.py (1 hunks)
  • cognee/tasks/chunks/remove_disconnected_chunks.py (1 hunks)
  • cognee/tasks/classify_documents/classify_documents.py (0 hunks)
  • cognee/tasks/documents/__init__.py (1 hunks)
  • cognee/tasks/documents/classify_documents.py (1 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
  • cognee/tasks/graph/__init__.py (1 hunks)
  • cognee/tasks/graph/chunks_into_graph.py (0 hunks)
  • cognee/tasks/graph/extract_graph_from_code.py (1 hunks)
  • cognee/tasks/graph/extract_graph_from_data.py (1 hunks)
  • cognee/tasks/graph/infer_data_ontology.py (1 hunks)
  • cognee/tasks/graph/query_graph_connections.py (3 hunks)
⛔ Files not processed due to max files limit (17)
  • cognee/tasks/save_chunks_to_store/save_chunks_to_store.py
  • cognee/tasks/source_documents_to_chunks/source_documents_to_chunks.py
  • cognee/tasks/storage/init.py
  • cognee/tasks/storage/add_data_points.py
  • cognee/tasks/storage/index_data_points.py
  • cognee/tasks/storage/save_to_vector_storage.py
  • cognee/tasks/summarization/models/TextSummary.py
  • cognee/tasks/summarization/query_summaries.py
  • cognee/tasks/summarization/summarize_text.py
  • cognee/tests/test_code_generation.py
  • cognee/tests/test_data/code.txt
  • cognee/tests/test_library.py
  • cognee/tests/test_neo4j.py
  • cognee/tests/test_pgvector.py
  • cognee/tests/test_qdrant.py
  • cognee/tests/test_weaviate.py
  • pyproject.toml
💤 Files with no reviewable changes (22)
  • cognee/infrastructure/databases/graph/falkordb/adapter.py
  • cognee/infrastructure/databases/vector/init.py
  • cognee/infrastructure/databases/vector/falkordb/FalkorDBAdapter.py
  • cognee/infrastructure/databases/vector/models/DataPoint.py
  • cognee/modules/chunking/init.py
  • cognee/modules/graph/utils.py
  • cognee/modules/search/CogneeSearch.py
  • cognee/modules/search/graph/search_adjacent.py
  • cognee/modules/search/graph/search_cypher.py
  • cognee/modules/search/graph/search_similarity.py
  • cognee/modules/search/graph/search_summary.py
  • cognee/modules/search/llm/extraction/categorize_relevant_category.py
  • cognee/modules/search/llm/extraction/categorize_relevant_summary.py
  • cognee/modules/search/llm/get_relevant_summary.py
  • cognee/modules/search/vector/bm25.py
  • cognee/modules/search/vector/fusion.py
  • cognee/modules/search/vector/search_traverse.py
  • cognee/tasks/init.py
  • cognee/tasks/chunk_translate/translate_chunk.py
  • cognee/tasks/chunk_update_check/chunk_update_check.py
  • cognee/tasks/classify_documents/classify_documents.py
  • cognee/tasks/graph/chunks_into_graph.py
✅ Files skipped from review due to trivial changes (14)
  • README.md
  • cognee/api/v1/search/search_v2.py
  • cognee/infrastructure/databases/vector/vector_db_interface.py
  • cognee/infrastructure/engine/init.py
  • cognee/modules/data/extraction/init.py
  • cognee/modules/data/extraction/knowledge_graph/init.py
  • cognee/modules/data/processing/document_types/init.py
  • cognee/modules/engine/models/init.py
  • cognee/modules/engine/utils/init.py
  • cognee/modules/graph/utils/init.py
  • cognee/tasks/chunks/init.py
  • cognee/tasks/chunks/tests/chunk_by_paragraph.test.py
  • cognee/tasks/documents/init.py
  • cognee/tasks/graph/init.py
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/chunks/query_chunks.py (3)
Learnt from: borisarzentar
PR: topoteretes/cognee#144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-08T16:12:50.908Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.
Learnt from: borisarzentar
PR: topoteretes/cognee#144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-08T16:12:49.108Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.
Learnt from: borisarzentar
PR: topoteretes/cognee#144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-07T11:20:44.876Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.
🪛 Ruff
cognee/api/v1/cognify/code_graph_pipeline.py

40-40: Use is and is not for type comparisons, or isinstance() for isinstance checks

(E721)

cognee/infrastructure/databases/graph/networkx/adapter.py

256-256: Do not use bare except

(E722)


270-270: Do not use bare except

(E722)

cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

108-108: Do not use bare except

(E722)


153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)


241-241: f-string without any placeholders

Remove extraneous f prefix

(F541)

cognee/modules/storage/utils/__init__.py

18-18: Module level import not at top of file

(E402)

🔇 Additional comments (58)
cognee/modules/engine/utils/generate_node_name.py (1)

1-2: Verify FalkorDB node naming requirements

Let's ensure our node name generation aligns with FalkorDB's specific requirements.

✅ Verification successful

Let me gather more specific information about node name handling in FalkorDB adapter.


Based on the search results, I can now see how nodes are handled in FalkorDB. Let me check one more aspect about string handling in queries.


Based on the gathered information, I can now provide the final response:

Node name generation appears compatible with FalkorDB requirements

The current implementation of generate_node_name() which converts to lowercase and removes single quotes aligns well with FalkorDB's handling of node names. From the codebase analysis:

  • FalkorDB adapter properly escapes string values in queries using single quotes
  • Node labels and properties are handled safely through the stringify_properties method
  • No additional name restrictions are enforced in the FalkorDB adapter implementation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for FalkorDB node naming patterns or constraints in the codebase
rg -l "FalkorDB" | xargs rg -i "node.*name|label.*name"

# Look for any existing node creation patterns
ast-grep --pattern 'CREATE ($node_name)'

Length of output: 1077


Script:

#!/bin/bash
# Look for node creation and naming patterns in FalkorDB adapter
rg -A 5 -B 5 "CREATE" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Check for any node name validation or processing
rg -A 3 -B 3 "node.*name|label.*name" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Look for any error handling related to node names
rg -i "invalid.*name|name.*invalid" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 2611


Script:

#!/bin/bash
# Look for any string escaping or sanitization in FalkorDB queries
rg -A 5 "stringify|escape|sanitize|replace" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Check if there are any FalkorDB specific configuration or constants
cat cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py | head -n 30

Length of output: 1914

cognee/modules/engine/utils/generate_edge_name.py (1)

1-2: Verify consistency with node name generation

Since this is part of a larger change involving node/edge naming utilities, we should verify consistency with the node name generation approach.

#!/bin/bash
# Description: Compare edge name generation with node name generation

# Look for node name generation implementation
echo "Searching for node name generation..."
rg "generate_node_name" -A 5

# Look for other name generation utilities
echo "Searching for similar utilities..."
fd -g "*.py" -x grep -l "generate.*name"
cognee/modules/engine/utils/generate_node_id.py (1)

1-4: Verify consistent node ID generation across the codebase

Since this is a refactored utility moved from cognee/modules/graph/utils.py, we should verify that the new implementation maintains consistency with existing node IDs in the system.

✅ Verification successful

Node ID generation is consistent across the codebase

The verification shows that the implementation in generate_node_id.py maintains consistency with existing node ID generation patterns in the codebase:

  • The function properly normalizes node IDs by converting to lowercase and replacing spaces/apostrophes, which is crucial for consistent ID generation
  • All direct usages of uuid5 with NAMESPACE_OID in other files serve different purposes (chunk IDs, dataset IDs) and don't conflict with node ID generation
  • The generate_node_id function is correctly imported and used in graph-related tasks (extract_graph_from_data.py and infer_data_ontology.py) for generating consistent node identifiers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any other node ID generation implementations or usages
echo "Searching for other node ID generation patterns..."
rg -l "generate.*node.*id" --type py

echo "\nSearching for direct UUID5 usage with NAMESPACE_OID..."
rg -l "uuid5.*NAMESPACE_OID" --type py

echo "\nSearching for potential test files..."
fd "test.*node.*" --type py

Length of output: 1060


Script:

#!/bin/bash
# Let's examine the actual implementation in the files that use uuid5 with NAMESPACE_OID
echo "Checking uuid5 implementations in other files:"
rg -A 3 "uuid5.*NAMESPACE_OID" --type py

echo "\nChecking how generate_node_id is imported and used:"
rg -A 2 "from.*generate_node_id import|import.*generate_node_id" --type py

echo "\nChecking actual usage of generate_node_id function:"
rg -A 2 "generate_node_id\(" --type py

Length of output: 10351

cognee/modules/data/processing/document_types/Document.py (1)

4-6: Verify attribute access patterns after id removal

The removal of the id attribute and renaming of title to name could impact existing code that references these attributes. Let's verify all access patterns.

#!/bin/bash
# Search for any remaining references to document.id or document.title
rg -A 2 "document\.(?:id|title)\b" 

# Search for references to the new name attribute
rg -A 2 "document\.name\b"
cognee/infrastructure/databases/vector/models/ScoredResult.py (2)

2-2: LGTM: UUID import added correctly

The UUID import is appropriately added to support the type change.


6-6: Verify UUID migration impact

The change from str to UUID for the id field may require data migration or updates in other components that interact with ScoredResult.

Let's check for potential impacts:

✅ Verification successful

Let me analyze the usage of ScoredResult to verify the impact. I'll check the adapter implementations to see how they handle the id field.


UUID type change is safely handled across adapters

The change from str to UUID for the id field in ScoredResult is already properly handled in all vector database adapters:

  • QDrant: Uses UUID(result.id) for conversion
  • PGVector: Uses UUID(result.id) and UUID(str(row.id))
  • Weaviate: Uses UUID(result.uuid)
  • LanceDB: Uses UUID(result["id"])

All adapters are already converting their string IDs to UUID when constructing ScoredResult objects, making this type change backward compatible.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find potential places that might need updates due to the UUID change

# Look for direct usage of ScoredResult
echo "=== Direct ScoredResult usage ==="
rg -l "ScoredResult" --type py

# Look for string IDs being passed that might need conversion to UUID
echo -e "\n=== Potential string ID usage ==="
ast-grep --pattern 'id: str'

# Look for potential serialization/deserialization of ScoredResult
echo -e "\n=== Potential serialization points ==="
rg "json|dumps|loads.*ScoredResult" --type py

Length of output: 4134


Script:

#!/bin/bash
# Check how the id field is handled in adapter implementations
echo "=== QDrant Adapter Implementation ==="
rg -A 5 "ScoredResult" cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py

echo -e "\n=== PGVector Adapter Implementation ==="
rg -A 5 "ScoredResult" cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

echo -e "\n=== Weaviate Adapter Implementation ==="
rg -A 5 "ScoredResult" cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py

echo -e "\n=== LanceDB Adapter Implementation ==="
rg -A 5 "ScoredResult" cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py

Length of output: 3307

cognee/tasks/documents/extract_chunks_from_documents.py (2)

1-1: LGTM!

The import statement is clean and specific.


6-6: Verify Document.read() implementation

Let's verify that the Document class properly implements the read method with chunk_size parameter.

✅ Verification successful

Based on the shell script results, I can now generate the final response as we have found the relevant implementations. The Document class is properly implemented with specialized document type classes (TextDocument, PdfDocument, ImageDocument, AudioDocument) that handle the read method with chunk_size parameter appropriately:

Document.read() implementation verified and working as expected

The codebase shows a well-structured implementation where:

  • Base Document class in modules/data/processing/document_types/Document.py defines the interface
  • Each document type implements read() with chunk_size parameter:
    • TextDocument: Reads file in 1024-byte chunks
    • PdfDocument: Processes page by page using TextChunker
    • ImageDocument: Transcribes image and uses TextChunker
    • AudioDocument: Transcribes audio and uses TextChunker

All implementations properly handle the chunk_size parameter by utilizing the TextChunker class which enforces the specified chunk size constraints.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Document class implementation and its read method
ast-grep --pattern 'class Document {
  $$$
  read($$$) {
    $$$
  }
  $$$
}'

# Backup: Search using ripgrep
rg -A 10 "class Document.*:|def read.*:"

Length of output: 8723

cognee/modules/engine/models/EntityType.py (2)

1-2: LGTM!

The imports are clean and necessary for the implementation.


9-11: Verify index field configuration for FalkorDB

The metadata configuration specifies 'name' as an indexed field. Let's verify if this aligns with FalkorDB's indexing capabilities and if additional fields should be indexed for optimal query performance.

✅ Verification successful

Let me gather more specific information about the FalkorDB indexing implementation and other DataPoint models.


Based on the analyzed code, I can now provide a final response about the index field configuration.

Index field configuration is correctly implemented for FalkorDB

The 'name' field indexing in EntityType is consistent with FalkorDB's implementation and the system's design:

  • The base DataPoint class properly supports index_fields in its metadata
  • FalkorDB adapter correctly handles indexed fields during node creation and vector indexing
  • The configuration aligns with related Entity model which also indexes the 'name' field
  • The implementation follows the pattern used across other DataPoint implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other EntityType-related index configurations in the codebase
# and FalkorDB integration code

# Look for FalkorDB index configurations
rg -l "index_fields|createIndex|CREATE INDEX" --type py

# Look for similar DataPoint implementations
ast-grep --pattern 'class $_ (DataPoint) {
  $$$
  _metadata: dict = {
    $$$
  }
}'

Length of output: 921


Script:

#!/bin/bash
# Check FalkorDB adapter implementation for index handling
cat cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

# Check base DataPoint class for index field handling
cat cognee/infrastructure/engine/models/DataPoint.py

# Check Entity model since EntityType is related to it
cat cognee/modules/engine/models/Entity.py

Length of output: 10652

cognee/modules/engine/models/Entity.py (1)

1-3: Verify circular import dependencies

The relative import of EntityType from the same directory could potentially create circular dependencies if EntityType.py also imports Entity.

✅ Verification successful

No circular import dependencies detected

The codebase analysis shows that while __init__.py imports both Entity and EntityType, there is no circular dependency as EntityType.py does not import from Entity. The imports are clean and well-structured:

  • EntityType.py only imports DataPoint and DocumentChunk
  • The relative import of EntityType in Entity.py is safe to use
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for potential circular imports
echo "Checking EntityType.py for potential circular imports with Entity:"
rg -l "from.*Entity import" "$(dirname $(fd -t f "EntityType.py$"))"

Length of output: 325


Script:

#!/bin/bash
# Let's check the actual content of both files to verify the imports
echo "Content of __init__.py:"
cat "$(fd -t f "__init__.py$" cognee/modules/engine/models)"

echo -e "\nContent of EntityType.py:"
cat "$(fd -t f "EntityType.py$")"

Length of output: 658

cognee/modules/chunking/models/DocumentChunk.py (1)

1-5: Verify DataPoint base class usage across the codebase

The transition from BaseModel to DataPoint is a significant architectural change. Let's ensure this pattern is consistently applied across related classes.

✅ Verification successful

DataPoint inheritance pattern is consistently applied

The transition from BaseModel to DataPoint is correctly implemented and follows a consistent pattern across the codebase. The evidence shows:

  • No remaining BaseModel usage in the chunking modules
  • DocumentChunk and Document both properly inherit from DataPoint
  • DataPoint is widely and consistently used across the codebase (20+ classes)
  • The inheritance structure aligns with the architectural pattern seen in other components (Entity, TextSummary, etc.)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining BaseModel usage that should be DataPoint
# and verify consistent DataPoint usage in related classes

# Check for any remaining BaseModel usage in chunking modules
rg -l "class.*\(BaseModel\)" "cognee/modules/chunking"

# Check DataPoint usage in related document processing classes
rg -l "class.*\(DataPoint\)" "cognee/modules/data/processing"

Length of output: 174


Script:

#!/bin/bash
# Let's check the actual implementation of both DocumentChunk and Document classes
# to understand the inheritance pattern better

# Check DocumentChunk implementation
echo "=== DocumentChunk.py ==="
cat cognee/modules/chunking/models/DocumentChunk.py

echo -e "\n=== Document.py ==="
cat cognee/modules/data/processing/document_types/Document.py

# Also check if there are any other DataPoint usages
echo -e "\n=== Other DataPoint usages ==="
rg "class.*\(DataPoint\)" -A 2

Length of output: 5701

cognee/modules/data/processing/document_types/AudioDocument.py (2)

13-13: Verify TextChunker interface changes

The TextChunker instantiation has been modified to accept the document instance instead of just the ID. Please ensure this change is consistent across all document types.

#!/bin/bash
# Description: Check TextChunker usage across all document types
# Expected: All document types should use the new interface

# Check TextChunker instantiation pattern in all document types
rg -A 2 "TextChunker\(" --type py

# Verify TextChunker class definition accepts document instance
ast-grep --pattern 'class TextChunker {
  $$$
  def __init__($_, document, $_) {
    $$$
  }
  $$$
}'

Line range hint 1-16: Verify attribute availability in Document base class

The code uses self.raw_data_location but this attribute was removed from AudioDocument. Please ensure the Document base class provides this attribute.

✅ Verification successful

The previous scripts didn't show us the Document base class definition. Let's search for it specifically:


The raw_data_location attribute is properly defined in the base class

The Document base class (in cognee/modules/data/processing/document_types/Document.py) clearly defines the raw_data_location attribute that AudioDocument inherits and uses. The attribute removal from AudioDocument is intentional as it's now provided by the parent class.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if Document base class provides raw_data_location attribute
# Expected: Find the attribute definition in Document class

ast-grep --pattern 'class Document {
  $$$
  raw_data_location
  $$$
}'

# Also check for property decorator or __init__ assignment
rg -A 5 "raw_data_location" --type py

Length of output: 1556


Script:

#!/bin/bash
# Search for Document base class definition and its attributes
ast-grep --pattern 'class Document {
  $$$
}'

# Also check for __init__ method and property definitions
rg -A 10 "class Document" --type py

Length of output: 3051

cognee/tasks/chunks/query_chunks.py (1)

13-13: Verify the "DocumentChunk_text" collection exists in the vector engine.

The search target has been changed from "chunks" to "DocumentChunk_text". Let's verify this collection exists and is properly populated.

✅ Verification successful

Let me gather more information about the DocumentChunk_text collection setup and usage.


Let me check the vector database adapters to understand how collections are handled.


Let me check one final thing about the collection initialization and the DocumentChunk model.


Collection name change is correct and properly configured

The change from "chunks" to "DocumentChunk_text" is valid because:

  1. The DocumentChunk model explicitly defines "text" as an index field in its metadata
  2. All vector database adapters support dynamic collection creation
  3. The collection name follows the pattern {model_name}_{index_field} which is consistent with the codebase's indexing strategy
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for references to DocumentChunk_text collection and its setup

# Check for collection setup/creation
rg -A 5 "DocumentChunk_text" 

# Check for vector engine implementation details
ast-grep --pattern 'class $_ implements VectorEngine {
  $$$
  search($collection, $query, $$$) {
    $$$
  }
  $$$
}'

Length of output: 492


Script:

#!/bin/bash
# Check for collection initialization or setup
rg -l "DocumentChunk" 

# Check for vector engine implementations and collection setup
ast-grep --pattern 'class DocumentChunk {
  $$$
}'

# Check for any migration or setup scripts
fd -e py -e sql | xargs rg -l "collection|index.*DocumentChunk"

Length of output: 1257


Script:

#!/bin/bash
# Check vector database adapters for collection setup
rg -A 10 "DocumentChunk.*text" cognee/infrastructure/databases/vector/

# Check the base interface
cat cognee/infrastructure/databases/vector/vector_db_interface.py

# Look for collection setup in database adapters
rg -A 5 "create.*collection|setup.*collection" cognee/infrastructure/databases/vector/

Length of output: 11726


Script:

#!/bin/bash
# Check DocumentChunk model and its text field
cat cognee/modules/chunking/models/DocumentChunk.py

# Check for collection initialization in startup or setup code
rg -l "init|setup|start" | xargs rg -l "DocumentChunk.*text"

# Check for any migration or initialization scripts
fd migration init setup | xargs rg -l "DocumentChunk"

Length of output: 1236

cognee/modules/data/processing/document_types/ImageDocument.py (1)

13-13: Verify initialization of required attributes

The code uses self.raw_data_location in the read method, but its initialization is not visible in this class. Please ensure this attribute is properly initialized by the parent Document class.

✅ Verification successful

Initialization of raw_data_location is properly handled

The verification shows that raw_data_location is correctly managed:

  • It's defined as a required field in the base Document class: raw_data_location: str
  • All document instances (including ImageDocument) receive this attribute during instantiation in classify_documents.py
  • The attribute is properly used across all document types for their respective read() operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check Document parent class for raw_data_location initialization
ast-grep --pattern 'class Document {
  $$$
  def __init__($$$) {
    $$$
    self.raw_data_location = $$$
    $$$
  }
  $$$
}'

# Check for any other initialization patterns
rg -l "raw_data_location" --type py

Length of output: 710


Script:

#!/bin/bash
# Check Document class implementation for raw_data_location
rg -A 10 "class Document" cognee/modules/data/processing/document_types/Document.py

# Check Data model class since it's referenced in multiple files
rg -A 10 "class Data" cognee/modules/data/models/Data.py

# Check if raw_data_location is passed during document creation
rg "raw_data_location" -A 3 -B 3

Length of output: 10193

cognee/modules/data/processing/document_types/PdfDocument.py (1)

16-16: Verify TextChunker API compatibility

The TextChunker instantiation now passes the entire document instance instead of just the ID. Please ensure this change is compatible with TextChunker's implementation.

✅ Verification successful

TextChunker API is compatible with document instance

The TextChunker class is correctly implemented to accept and utilize the document instance. The constructor signature __init__(self, document, get_text: callable, chunk_size: int = 1024) matches the usage pattern, and the document instance is properly used throughout the class for:

  • Storing the document reference (self.document)
  • Generating chunk IDs using document ID
  • Setting document reference in chunk metadata
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check TextChunker's constructor signature and usage patterns

# Search for TextChunker class definition and its usage
ast-grep --pattern 'class TextChunker {
  $$$
  __init__($$$) {
    $$$
  }
  $$$
}'

# Find other document types using TextChunker
rg "TextChunker\(" --type py

Length of output: 762


Script:

#!/bin/bash
# Let's check the TextChunker's constructor implementation
rg -A 10 "class TextChunker" cognee/modules/chunking/TextChunker.py

# Also check how the document instance is used within TextChunker
rg "self\." cognee/modules/chunking/TextChunker.py

Length of output: 2213

cognee/infrastructure/databases/vector/pgvector/serialize_data.py (4)

1-2: LGTM! Clean and necessary imports.


6-7: LGTM! Clean dictionary handling with proper recursion.


8-9: LGTM! Clean list handling with proper recursion.


10-15: Consider handling additional edge cases.

The type handling looks good, but consider handling these additional cases:

  1. None values
  2. Custom objects with __str__ method

Let's verify the usage of this function in the codebase:

cognee/modules/data/processing/document_types/TextDocument.py (2)

Line range hint 1-21: Verify parent class contract after constructor removal

The removal of the constructor means this class relies on the parent Document class for initialization. Need to verify if required attributes are properly initialized.

#!/bin/bash
# Description: Check Document parent class implementation and usage
# Expected: Document class should initialize raw_data_location

# Check Document class implementation
ast-grep --pattern 'class Document {
  $$$
  def __init__($params) {
    $$$
  }
  $$$
}'

# Find all Document subclasses to ensure consistent pattern
rg -l "class \w+\(Document\)"

18-18: Verify TextChunker compatibility with the new signature

The TextChunker instantiation has been changed to pass self instead of self.id. This is a breaking change that needs verification.

✅ Verification successful

Let me gather more specific information about the TextChunker implementation.


TextChunker implementation is compatible with the changes

The TextChunker class is correctly implemented to accept a document instance as its first parameter, as shown in its constructor: def __init__(self, document, get_text: callable, chunk_size: int = 1024). The change from self.id to self in TextDocument is valid and matches the expected interface.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify TextChunker implementation accepts Document instance
# Expected: TextChunker should have updated its constructor to accept Document instance

# Check TextChunker implementation
ast-grep --pattern 'class TextChunker {
  $$$
  def __init__($params) {
    $$$
  }
  $$$
}'

# Check for any other usages of TextChunker in the codebase
rg -l "TextChunker"

Length of output: 754


Script:

#!/bin/bash
# Check TextChunker's constructor implementation
ast-grep --pattern 'class TextChunker:
  $$$
  def __init__($params):
    $$$
  }' cognee/modules/chunking/TextChunker.py

# Get the full TextChunker class implementation for context
rg -A 10 "class TextChunker" cognee/modules/chunking/TextChunker.py

Length of output: 477

cognee/tasks/graph/extract_graph_from_code.py (2)

1-6: LGTM: Imports are well-organized and specific

The imports are clear, specific, and properly organized with standard library imports first, followed by internal modules.


9-11: Verify the concurrency impact on the system

The concurrent processing of chunks could potentially overwhelm the system if the input list is very large. Consider implementing batch processing or adding a concurrency limit.

cognee/infrastructure/engine/models/DataPoint.py (1)

10-24: Verify DataPoint usage across codebase

Since this is a base class, let's verify its implementation in derived classes.

✅ Verification successful

DataPoint class implementation appears to be used correctly

The verification shows that:

  • The _metadata field is accessed consistently across the codebase, primarily for indexing fields and data extraction
  • get_embeddable_data() is used appropriately in vector database operations for embedding data points
  • No direct inheritance of DataPoint was found, suggesting it's used as a standalone model rather than a base class

The implementation is safe and working as intended. The commented-out Config class can remain commented as the current usage pattern doesn't require private attribute protection.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all classes that inherit from DataPoint
echo "Classes inheriting from DataPoint:"
ast-grep --pattern 'class $NAME($$$DataPoint$$$)' 

# Check for any direct access to _metadata
echo -e "\nDirect _metadata access:"
rg '\._metadata\b' --type python

# Check for get_embeddable_data usage
echo -e "\nget_embeddable_data usage:"
rg '\.get_embeddable_data\b' --type python

Length of output: 406


Script:

#!/bin/bash
# Find Python files first
echo "Python files in the repository:"
fd '\.py$'

# Then use the file paths to search for patterns
echo -e "\nClasses inheriting from DataPoint:"
fd '\.py$' --exec ast-grep --pattern 'class $NAME($$$DataPoint$$$)' {}

echo -e "\nDirect _metadata access:"
fd '\.py$' --exec rg '\._metadata\b' {}

echo -e "\nget_embeddable_data usage:"
fd '\.py$' --exec rg '\.get_embeddable_data\b' {}

Length of output: 15903

cognee/infrastructure/databases/vector/config.py (1)

20-20: LGTM!

The dictionary serialization is correctly updated to include the new port configuration.

cognee/tasks/documents/classify_documents.py (1)

1-2: LGTM! Clean and focused imports.

The imports are well-organized and specific to the module's needs.

cognee/tasks/chunks/remove_disconnected_chunks.py (1)

1-2: LGTM! Clean import organization.

The imports are well-structured, with infrastructure imports preceding module-specific ones.

cognee/modules/data/operations/detect_language.py (2)

3-3: LGTM! Good logging practice.

Using logging.getLogger(__name__) is the recommended way to create module-level loggers.


24-28: LGTM! Clear and well-documented language mapping.

The special case handling for Croatian to Serbian mapping is clear and well-commented.

cognee/modules/graph/utils/get_model_instance_from_graph.py (2)

7-10: LGTM!

The node mapping implementation is efficient and provides O(1) lookup complexity.


12-28: ⚠️ Potential issue

Add validation and error handling for edge processing.

The edge processing logic needs additional safeguards:

  1. Edge array access could raise IndexError
  2. Missing nodes in node_map could raise KeyError
  3. In-place node_map mutation might cause issues with cyclic relationships
  4. No validation of edge_type values

Consider applying these improvements:

 for edge in edges:
+    if len(edge) < 3:
+        raise ValueError(f"Edge must have at least 3 elements: {edge}")
+    
+    source_id, target_id = edge[0], edge[1]
+    if source_id not in node_map:
+        raise KeyError(f"Source node {source_id} not found in nodes")
+    if target_id not in node_map:
+        raise KeyError(f"Target node {target_id} not found in nodes")
+    
     source_node = node_map[edge[0]]
     target_node = node_map[edge[1]]
     edge_label = edge[2]
     edge_properties = edge[3] if len(edge) == 4 else {}
     edge_metadata = edge_properties.get("metadata", {})
     edge_type = edge_metadata.get("type")
+    
+    if edge_type and edge_type not in ["list", "single"]:
+        raise ValueError(f"Invalid edge type: {edge_type}")

Also consider:

  1. Creating a new node_map copy instead of mutating in-place
  2. Adding cycle detection to prevent infinite recursion

Let's verify if there are any cyclic relationships in the test cases:

cognee/infrastructure/databases/graph/graph_db_interface.py (1)

6-7: Verify proper parameter usage in implementations.

Since this is a protocol class, we should verify that implementations properly use the parameterized queries to prevent injection attacks.

✅ Verification successful

Let me check the specific implementations of the query method in the identified files.


Security review of query parameter handling in implementations

Based on the implementations found:

  • NetworkX adapter has an empty implementation (pass statement) which is safe by default
  • Neo4j adapter correctly uses parameterized queries by passing parameters to session.run()
  • FalkorDB adapter was found but no query implementation was visible

Let's verify the FalkorDB implementation specifically:


Query parameter handling is properly implemented across adapters

All implementations handle query parameters securely:

  • Neo4j adapter uses parameterized queries via session.run(query, parameters=params)
  • FalkorDB adapter properly passes params to graph.query(query, params)
  • NetworkX adapter has an empty implementation (pass) which is safe by default

The parameterization is correctly used in critical operations like:

  • Node retrieval: Using $node_ids parameter
  • Connection queries: Using $node_id parameter
  • Edge existence checks: Using $edges parameter
  • Node deletion: Using $node_ids parameter
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find implementations of GraphDBInterface to verify query parameter usage

# Find potential implementation files
echo "Looking for GraphDBInterface implementations..."
rg -l "class.*[\(: ]GraphDBInterface" 

# Look for query method implementations
echo -e "\nChecking query method implementations..."
ast-grep --pattern 'class $CLASS {
  $$$
  async def query(self, query: str, params: dict):
    $IMPL
  $$$
}'

Length of output: 601


Script:

#!/bin/bash
# Check implementations in the identified files
echo "=== NetworkX Adapter Implementation ==="
rg -A 10 "async def query" cognee/infrastructure/databases/graph/networkx/adapter.py

echo -e "\n=== Neo4j Driver Implementation ==="
rg -A 10 "async def query" cognee/infrastructure/databases/graph/neo4j_driver/adapter.py

echo -e "\n=== FalkorDB Implementation ==="
rg -A 10 "async def query" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 1207


Script:

#!/bin/bash
# Get full FalkorDB implementation to check query method
echo "=== FalkorDB Full Implementation ==="
cat cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 9549

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

35-35: Verify NetworkX adapter usage across the codebase

The NetworkX adapter import has been moved down as a fallback option. Let's verify if this adapter is still being used elsewhere in the codebase.

✅ Verification successful

Let me gather more context about the NetworkX adapter's implementation and usage patterns.


Based on the comprehensive review of the implementation, I can now provide the final response:

NetworkX adapter's position as fallback is correctly implemented

The NetworkX adapter is properly maintained as a fallback option with its import placement being appropriate because:

  • It serves as the default fallback when Neo4j and FalkorDB providers are not available
  • The implementation is complete and actively maintained with proper file persistence
  • The adapter is still actively used in the codebase, referenced in configuration and engine initialization
  • All necessary interfaces and functionality are properly implemented
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for NetworkX adapter usage across the codebase

# Search for NetworkX adapter imports and usage
echo "Searching for NetworkX adapter imports and usage..."
rg -l "NetworkXAdapter" 

# Search for graph file path configuration usage
echo "Searching for graph file path configuration usage..."
rg "graph_file_path"

Length of output: 831


Script:

#!/bin/bash
# Check NetworkXAdapter implementation and its dependencies
echo "Checking NetworkXAdapter implementation:"
cat cognee/infrastructure/databases/graph/networkx/adapter.py

# Check the get_graph_engine implementation to understand the fallback logic
echo -e "\nChecking get_graph_engine implementation:"
cat cognee/infrastructure/databases/graph/get_graph_engine.py

Length of output: 13139

cognee/modules/storage/utils/__init__.py (2)

8-15: LGTM! Well-implemented custom JSON encoder

The implementation correctly handles datetime and UUID serialization with proper fallback to the parent encoder.


20-32: Consider adding type validation for include_fields values

While the function implementation is solid, it might be beneficial to validate that the values in include_fields are proper type annotations to prevent runtime errors when creating the model.

Consider adding validation:

 def copy_model(model: DataPoint, include_fields: dict = {}, exclude_fields: list = []):
+    for field_type in include_fields.values():
+        if not isinstance(field_type, tuple) or len(field_type) != 2:
+            raise ValueError("include_fields values must be tuples of (type, default)")
     fields = {
cognee/tasks/graph/query_graph_connections.py (3)

25-26: LGTM! Proper handling of node IDs

The defensive str() conversion ensures compatibility with different ID types.


40-40: LGTM! Efficient parallel connection gathering

Good use of asyncio.gather for parallel execution of connection queries.


30-31: Verify vector store schema changes

The search indices have been changed from "entities"/"classification" to "Entity_name"/"EntityType_name". Ensure all existing data has been migrated.

cognee/infrastructure/databases/vector/create_vector_engine.py (2)

Line range hint 31-50: Verify async implementation and add error handling

A few concerns about the PGVector adapter implementation:

  1. The connection string uses asyncpg, but it's unclear if PGVectorAdapter supports async operations
  2. Consider handling missing relational configuration
  3. Sensitive information (password) in connection string could be logged

Let's verify the async implementation:

✅ Verification successful

Let me gather more information about the PGVectorAdapter implementation.


Based on the gathered information, I can now provide the final response:

PGVectorAdapter fully supports async operations with proper error handling

The implementation is correctly using async/await throughout:

  • Uses create_async_engine and async_sessionmaker in constructor
  • All methods are properly implemented as async (embed_data, has_collection, create_collection, etc.)
  • Proper error handling in RelationalConfig through Pydantic's BaseSettings

The connection string using asyncpg is appropriate for this async implementation. Regarding the security concern about password in connection string - this is a standard practice when using SQLAlchemy, and the password is obtained from environment variables through RelationalConfig.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if PGVectorAdapter has async methods
ast-grep --pattern 'class PGVectorAdapter {
  $$$
  async def $_($$$) {
    $$$
  }
  $$$
}'

# Check for error handling in get_relational_config
rg -A 5 "def get_relational_config"

Length of output: 297


Script:

#!/bin/bash
# Check PGVectorAdapter implementation
rg -A 20 "class PGVectorAdapter"

# Check if there are any async operations in the file
rg "async|await" --type py

# Check RelationalConfig implementation
rg -A 10 "class RelationalConfig"

Length of output: 105200


51-57: Verify FalkorDBAdapter interface implementation

The adapter is imported from the hybrid package, suggesting it supports both vector and graph operations. Let's verify it implements all required interfaces.

✅ Verification successful

FalkorDBAdapter correctly implements VectorDBInterface

The FalkorDBAdapter properly implements all required methods from VectorDBInterface:

  • Collection operations: has_collection, create_collection
  • Data point operations: create_data_points, retrieve
  • Search operations: search, batch_search
  • Deletion operations: delete_data_points, prune

Additionally, it extends functionality by implementing GraphDBInterface, making it a true hybrid database adapter.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if FalkorDBAdapter implements required interfaces
ast-grep --pattern 'class FalkorDBAdapter implements $_Interface {
  $$$
}'

# Verify required methods are implemented
ast-grep --pattern 'class FalkorDBAdapter {
  $$$
  def $_vector_method($$$) {
    $$$
  }
  $$$
}'

Length of output: 179


Script:

#!/bin/bash
# Let's first locate the FalkorDBAdapter file
fd "FalkorDBAdapter.py" --type f

# Then examine its content to verify interface implementation
rg -A 2 "^class FalkorDBAdapter" 

# Check for vector operation methods
rg -A 1 "def (add_vectors|search_vectors|delete_vectors)" 

# Look for any base class or interface it might be inheriting from
rg "class.*\(" --type py | grep -i "vector"

Length of output: 3402


Script:

#!/bin/bash
# Let's examine the VectorDBInterface to see required methods
rg -A 10 "class VectorDBInterface" cognee/infrastructure/databases/vector/vector_db_interface.py

# Now check FalkorDBAdapter for these required methods
rg -B 1 -A 3 "def (add_vectors|search_vectors|delete_vectors)" cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 563


Script:

#!/bin/bash
# Let's check the complete VectorDBInterface definition
cat cognee/infrastructure/databases/vector/vector_db_interface.py

# And then check the complete FalkorDBAdapter implementation
cat cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py

Length of output: 11151

cognee/tasks/chunk_naive_llm_classifier/chunk_naive_llm_classifier.py (1)

68-68: Verify vector engine compatibility with index_fields parameter.

The change from embed_field to index_fields parameter in DataPoint instantiation could impact vector indexing functionality. Please ensure:

  1. The vector engine supports this new parameter
  2. Search and retrieval operations remain functional
  3. All other DataPoint instantiations are updated accordingly

Also applies to: 107-107

✅ Verification successful

Based on the results, I can generate the final response as we have comprehensive information about the parameter usage across the codebase.

Parameter change is consistent with vector engine implementation

The change from embed_field to index_fields is verified to be correct and consistent:

  • No remaining usage of embed_field found in the codebase
  • index_fields is properly implemented in the DataPoint model (cognee/infrastructure/engine/models/DataPoint.py)
  • All vector engine adapters (LanceDB, Weaviate, QDrant, PGVector, FalkorDB) support and utilize the index_fields parameter
  • The parameter is consistently used across all DataPoint instantiations with the same pattern ["text"] or ["name"]
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining usage of embed_field and verify index_fields adoption

echo "Checking for remaining embed_field usage..."
rg "embed_field" --type py

echo "Verifying index_fields adoption..."
rg "index_fields" --type py

# Check for DataPoint instantiations
echo "Checking DataPoint instantiations..."
ast-grep --pattern 'DataPoint[$_]($_)'

Length of output: 3525

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

87-90: LGTM! Clean property handling

The explicit conversion of 'id' to 'uuid' in properties is a good practice for maintaining consistency with Weaviate's expectations.


171-173: LGTM! Proper UUID handling

The conversion of string IDs to UUID objects is correct and maintains type consistency across the system.

cognee/tasks/graph/infer_data_ontology.py (1)

23-23: Verify the utility functions import path change

The import path for generate_node_id and generate_node_name has been moved from graph.utils to engine.utils. Let's verify this change is consistent across the codebase.

✅ Verification successful

Let me gather more information to verify if these utility functions actually exist in the new location and if there are any other files that might be using these functions with different import patterns.


Let me check one final thing to verify if there are any imports of these functions in the other file that uses them.


Import path change is correctly implemented

The utility functions generate_node_id and generate_node_name are properly located in the cognee/modules/engine/utils/ directory and all files using these functions (infer_data_ontology.py and extract_graph_from_data.py) are consistently importing them from the new location cognee.modules.engine.utils. No references to the old import path remain in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining imports from the old path
rg "from cognee.modules.graph.utils import (generate_node_id|generate_node_name)"

# Search for imports from the new path to confirm consistency
rg "from cognee.modules.engine.utils import (generate_node_id|generate_node_name)"

Length of output: 287


Script:

#!/bin/bash
# Check if the utility functions exist in the new location
rg "def (generate_node_id|generate_node_name)" --type py

# Look for any other usage patterns of these functions
rg "(generate_node_id|generate_node_name)\(" --type py

Length of output: 1999


Script:

#!/bin/bash
# Check imports in extract_graph_from_data.py
rg "from.*import.*(generate_node_id|generate_node_name)" cognee/tasks/graph/extract_graph_from_data.py

Length of output: 201

cognee/shared/utils.py (3)

3-3: LGTM! Explicit imports improve code clarity.

The change to import specific datetime components follows Python best practices and supports the timezone-aware datetime usage.


Line range hint 48-56: LGTM! Improved timestamp handling with UTC timezone.

The change ensures consistent timezone-aware timestamps for telemetry events, which is crucial for accurate event tracking across different regions.


162-164: Verify graph visualization compatibility with FalkorDB.

Since this PR is adding FalkorDB support (based on PR title), ensure that the graph visualization parameters align with FalkorDB's edge and node property names.

✅ Verification successful

Based on the search results, I can generate the final response as I have enough information to verify the compatibility:

Graph visualization parameters are compatible with FalkorDB's edge properties

The visualization code in utils.py uses "source_node", "target_node", and "relationship_name" which perfectly aligns with FalkorDB's edge property naming in FalkorDBAdapter.py. The adapter explicitly maps edges using these same property names:

  • from_nodesource_node
  • to_nodetarget_node
  • relationship_namerelationship_name
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for FalkorDB-specific edge property names in the codebase

# Search for FalkorDB edge property definitions
rg -l "FalkorDB|falkordb" | xargs rg -A 5 "edge.*property|relationship.*name"

# Search for graph visualization calls
rg "render_graph|prepare_edges" -A 5

Length of output: 8098

cognee/modules/chunking/TextChunker.py (1)

45-45: Ensure self.document.id is not None before using it

When generating the UUID for the chunk id, the code assumes that self.document.id is not None. If self.document.id is None, this could raise an exception. It's important to verify that self.document.id is set and valid before using it.

Run the following script to check if self.document.id is ever None:

Ensure that self.document.id is assigned a valid value before it's used.

Also applies to: 62-62

cognee/api/v1/cognify/cognify_v2.py (3)

12-12: Import Statement Added Correctly

The import of run_tasks is appropriate and necessary for the pipeline execution.


18-20: Imports Added for Task Functions

The added imports for classify_documents, check_permissions_on_documents, extract_chunks_from_documents, extract_graph_from_data, and add_data_points are appropriate and correspond to their usage in the updated pipeline tasks.


87-94: Pipeline Tasks Configured Appropriately

The updated tasks in the pipeline reflect the new processing logic and are correctly configured. The sequence of tasks—classifying documents, checking permissions, extracting chunks, adding data points, extracting graphs, and summarizing text—is logically ordered and aligns with the intended data processing flow.

cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)

108-109: Ensure consistent ID types across the system

Converting data_point.id to a string when assigning to id in PointStruct might introduce inconsistencies if other parts of the system expect different ID types (e.g., UUIDs). Verify that all components use the same ID format to prevent potential issues.

cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

227-234: Verify proper field handling in copy_model

Ensure that the copy_model function accurately includes the id field as (str, ...) and excludes _metadata. Incorrect configurations may lead to unexpected behaviors in the generated schema.

Run the following script to check the fields in the copied models:

✅ Verification successful

Based on the gathered information, I can now provide a final response. The code shows that:

  1. The base DataPoint class (in cognee/infrastructure/engine/models/DataPoint.py) has id as UUID and _metadata as an optional field
  2. The copy_model function (in cognee/modules/storage/utils/__init__.py) correctly handles field copying with include/exclude functionality
  3. All DataPoint implementations consistently have id as a required field and _metadata as an optional dictionary

The copy_model configuration is correct

The implementation in LanceDBAdapter.py properly:

  • Includes the id field as (str, ...) to ensure it's a required string field
  • Excludes _metadata as specified
  • Aligns with the base DataPoint model structure and its implementations throughout the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the 'id' field in copied models is of type 'str' and '_metadata' is excluded.

ast-grep --pattern '
class $_(BaseModel) {
  id: str
}
' --ignore '*.pyc' .

Length of output: 325


Script:

#!/bin/bash
# Let's first find the copy_model function definition
rg "def copy_model" -A 10

# Then let's search for BaseModel classes with id field
rg "class.*BaseModel.*id.*str" -A 5

# Also search for any usage of copy_model to understand the pattern
rg "copy_model\(" -A 3

Length of output: 3005


Script:

#!/bin/bash
# Let's examine the DataPoint class and its implementations to understand the base schema
rg "class.*DataPoint" -A 10

# Let's also check if there are any test files that verify the copy_model functionality
rg "test.*copy_model" -A 10

Length of output: 21207

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)

161-165: Ensure result.id is a valid UUID string before conversion

When converting result.id to a UUID with UUID(result.id), it's important to ensure that result.id is indeed a valid UUID string. If result.id is already a UUID or not in the correct format, this conversion may raise an exception. Consider adding validation or exception handling to manage potential errors during the conversion.


Line range hint 235-239: Review the conversion of row.id to UUID

In the search method, you're converting row.id to a string before passing it to UUID(). If row.id is not a string (e.g., an integer or already a UUID), this could cause unexpected behavior. Verify the type of row.id to ensure that the conversion to UUID is appropriate and won't result in errors.


116-116: Confirm correct serialization of data_point payload

Using serialize_data(data_point.model_dump()) ensures that the payload is properly serialized before storage. Confirm that serialize_data handles all necessary data types within data_point.model_dump() to prevent any serialization issues.

Comment on lines +3 to +4
def generate_node_id(node_id: str) -> str:
return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix return type to match type hint

The function declares it returns str but uuid5() returns a UUID object. This violates the type hint and could cause issues in strongly typed contexts.

Apply this fix:

def generate_node_id(node_id: str) -> str:
-    return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
+    return str(uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", "")))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def generate_node_id(node_id: str) -> str:
return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
def generate_node_id(node_id: str) -> str:
return str(uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", "")))

🛠️ Refactor suggestion

Add input validation and handle more special characters

The function should validate input and handle additional special characters that could cause issues in node IDs.

Consider this enhanced implementation:

def generate_node_id(node_id: str) -> str:
+    if not node_id:
+        raise ValueError("Node ID cannot be empty")
+    
+    # Replace or remove common special characters that might cause issues
+    processed_id = (
+        node_id.lower()
+        .replace(" ", "_")
+        .replace("'", "")
+        .replace("-", "_")
+        .replace(".", "_")
+        .replace("/", "_")
+        .replace("\\", "_")
+    )
+    
+    # Remove any remaining non-alphanumeric characters except underscore
+    processed_id = "".join(c for c in processed_id if c.isalnum() or c == "_")
+    
+    return str(uuid5(NAMESPACE_OID, processed_id))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def generate_node_id(node_id: str) -> str:
return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))
def generate_node_id(node_id: str) -> str:
if not node_id:
raise ValueError("Node ID cannot be empty")
# Replace or remove common special characters that might cause issues
processed_id = (
node_id.lower()
.replace(" ", "_")
.replace("'", "")
.replace("-", "_")
.replace(".", "_")
.replace("/", "_")
.replace("\\", "_")
)
# Remove any remaining non-alphanumeric characters except underscore
processed_id = "".join(c for c in processed_id if c.isalnum() or c == "_")
return str(uuid5(NAMESPACE_OID, processed_id))

@@ -1,10 +1,8 @@
from uuid import UUID
from typing import Protocol
from cognee.infrastructure.engine import DataPoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Breaking changes detected in Document class usage

The verification reveals incompatibility between the new Document implementation and its current usage in the codebase:

  • Several files still reference document.id which is no longer available after the Protocol to DataPoint migration:
    • cognee/tasks/documents/check_permissions_on_documents.py
    • cognee/modules/chunking/TextChunker.py
    • cognee/api/v1/cognify/code_graph_pipeline.py
    • cognee/api/v1/cognify/cognify_v2.py

The change from Protocol to DataPoint needs to be accompanied by updates to all code that depends on the removed id attribute.

🔗 Analysis chain

Verify the impact of Protocol removal

The change from Protocol to DataPoint inheritance represents a significant architectural shift. While this aligns with the broader refactoring effort, we should ensure all implementations of Document continue to provide required functionality.

Also applies to: 3-3


The previous results showed Document usage and id/title references, but we need to verify:

  1. The actual Document subclass implementations to ensure they're compatible
  2. The Protocol removal impact
  3. The DataPoint base class implementation to understand what functionality it provides

This will help us make a definitive assessment about the architectural change impact.


The previous results showed all Document subclasses implement a consistent interface with type and read() method, but we need to:

  1. Locate and understand the DataPoint base class since it wasn't in the expected location
  2. Verify how Document implements/uses DataPoint's functionality
  3. Check for any breaking changes in Document usage patterns

This will help us make a final assessment about the compatibility and impact of this change.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all classes that inherit from Document to verify they remain compatible
ast-grep --pattern 'class $_(Document):
  $$$'

# Find all locations where Document is instantiated to verify usage patterns
rg -l "Document\(" 

Length of output: 3825


Script:

#!/bin/bash
# Let's examine the actual implementations of Document subclasses
ast-grep --pattern 'class $_(Document):
  $$$'

# Check for any direct Protocol usage in Document-related files
rg -A 2 "Protocol" cognee/modules/data/processing/document_types/

# Check DataPoint implementation to understand the new base class
rg -A 5 "class DataPoint" cognee/infrastructure/engine.py

Length of output: 5462


Script:

#!/bin/bash
# Let's find the DataPoint class as it seems to have moved
fd -g "*engine*.py" -x rg -A 5 "class DataPoint" {}

# Check the Document class implementation to understand what it inherits from DataPoint
rg -A 10 "class Document" cognee/modules/data/processing/document_types/Document.py

# Look for any breaking changes in Document usage
rg -A 2 "document\.(id|title|name)" --type py

Length of output: 1738

Comment on lines +4 to +7
async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024):
for document in documents:
for document_chunk in document.read(chunk_size = chunk_size):
yield document_chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add documentation and improve error handling

While the implementation is efficient using async generators, there are several improvements needed:

  1. Add docstring explaining:
    • Function purpose
    • Parameter descriptions
    • Return value type
    • Example usage
  2. Add error handling for document.read()
  3. Add input validation
  4. Add return type hint using AsyncGenerator

Here's the suggested implementation:

-async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024):
-    for document in documents:
-        for document_chunk in document.read(chunk_size = chunk_size):
-            yield document_chunk
+from typing import AsyncGenerator
+
+async def extract_chunks_from_documents(
+    documents: list[Document],
+    chunk_size: int = 1024
+) -> AsyncGenerator[str, None]:
+    """Extract text chunks from a list of documents.
+
+    Args:
+        documents: List of Document objects to process
+        chunk_size: Size of each chunk in bytes (default: 1024)
+
+    Yields:
+        Text chunks from the documents
+
+    Raises:
+        ValueError: If chunk_size <= 0 or documents list is empty
+    """
+    if not documents:
+        raise ValueError("Documents list cannot be empty")
+    if chunk_size <= 0:
+        raise ValueError("Chunk size must be positive")
+
+    for document in documents:
+        try:
+            for document_chunk in document.read(chunk_size=chunk_size):
+                yield document_chunk
+        except Exception as e:
+            # Log the error and continue with next document
+            logging.error(f"Error processing document: {str(e)}")
+            continue

Committable suggestion skipped: line range outside the PR's diff.

query = f"""
MATCH (from_node:`{from_node}`)-[relationship:`{edge_label}`]->(to_node:`{to_node}`)
MATCH (from_node:`{str(from_node)}`)-[relationship:`{edge_label}`]->(to_node:`{str(to_node)}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent Cypher injection by avoiding direct interpolation of variables

Directly interpolating variables into Cypher queries can lead to injection vulnerabilities. Instead, use query parameters to safely pass variables into your queries.

Refactor the queries to use parameters:

For has_edge method:

-    query = f"""
-        MATCH (from_node:`{str(from_node)}`)-[relationship:`{edge_label}`]->(to_node:`{str(to_node)}`)
-        RETURN COUNT(relationship) > 0 AS edge_exists
-    """
+    query = """
+        MATCH (from_node)-[relationship]->(to_node)
+        WHERE from_node.id = $from_node_id AND to_node.id = $to_node_id AND type(relationship) = $edge_label
+        RETURN COUNT(relationship) > 0 AS edge_exists
+    """
    params = {
-       # No parameters were previously defined
+       "from_node_id": str(from_node),
+       "to_node_id": str(to_node),
+       "edge_label": edge_label,
    }
    edge_exists = await self.query(query, params)

For add_edge method:

-    query = f"""MATCH (from_node:`{str(from_node)}` {{id: $from_node}}), 
-     (to_node:`{str(to_node)}` {{id: $to_node}})
-     MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
-     ON CREATE SET r += $properties, r.updated_at = timestamp()
-     ON MATCH SET r += $properties, r.updated_at = timestamp()
-     RETURN r"""
+    query = """MATCH (from_node {id: $from_node}),
+     (to_node {id: $to_node})
+     MERGE (from_node)-[r]->(to_node)
+     ON CREATE SET r += $properties, r.updated_at = timestamp(), r.type = $relationship_name
+     ON MATCH SET r += $properties, r.updated_at = timestamp()
+     RETURN r"""
    params = {
-       "from_node": str(from_node),
-       "to_node": str(to_node),
+       "from_node": str(from_node),
+       "to_node": str(to_node),
+       "relationship_name": relationship_name,
        "properties": serialized_properties
    }

By using parameters, you enhance security and maintainability of your code.

Also applies to: 177-182

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (5)
cognee/tests/test_neo4j.py (1)

39-40: Make the test more robust and data-independent.

The test is tightly coupled to specific search terms and data. Consider these improvements:

  1. Add assertions to verify the content of search results, not just their existence
  2. Add error handling for failed searches
  3. Make the test data-independent by using test fixtures or parameterization

Example improvement:

-    random_node = (await vector_engine.search("Entity_name", "Quantum computer"))[0]
-    random_node_name = random_node.payload["text"]
+    try:
+        search_results = await vector_engine.search("Entity_name", "Quantum computer")
+        assert len(search_results) > 0, "No search results found"
+        random_node = search_results[0]
+        assert "text" in random_node.payload, "Missing 'text' in payload"
+        random_node_name = random_node.payload["text"]
+        assert random_node_name, "Empty node name"
+    except Exception as e:
+        logging.error(f"Search failed: {str(e)}")
+        raise
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)

Line range hint 341-355: Optimize connection queries by combining them

Instead of running two separate queries for predecessors and successors, consider using a single query with UNION to reduce database round trips.

Consider this optimization:

 async def get_connections(self, node_id: UUID) -> list:
-    predecessors_query = """
-    MATCH (node)<-[relation]-(neighbour)
-    WHERE node.id = $node_id
-    RETURN neighbour, relation, node
-    """
-    successors_query = """
-    MATCH (node)-[relation]->(neighbour)
-    WHERE node.id = $node_id
-    RETURN node, relation, neighbour
-    """
+    query = """
+    MATCH (node)
+    WHERE node.id = $node_id
+    CALL {
+        WITH node
+        MATCH (node)<-[relation]-(neighbour)
+        RETURN neighbour, relation, node
+        UNION
+        MATCH (node)-[relation]->(neighbour)
+        RETURN node, relation, neighbour
+    }
+    RETURN neighbour, relation, node
+    """

-    predecessors, successors = await asyncio.gather(
-        self.query(predecessors_query, dict(node_id = str(node_id))),
-        self.query(successors_query, dict(node_id = str(node_id))),
-    )
+    results = await self.query(query, dict(node_id = str(node_id)))
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (3)

13-18: Clarify the Purpose and Usage of IndexSchema

The IndexSchema class extends DataPoint and introduces a text field along with _metadata specifying index_fields. Ensure that this class aligns with your indexing strategy and that all necessary fields are correctly defined for Weaviate's schema requirements.


90-95: Avoid Mutating the Original Properties Dictionary

Modifying the properties dictionary in place might lead to unintended side effects if data_point is reused elsewhere. Consider creating a copy of the properties before making modifications to prevent potential issues.

Apply this diff to create a copy before modification:

- properties = data_point.model_dump()
+ properties = data_point.model_dump().copy()

if "id" in properties:
    properties["uuid"] = str(data_point.id)
    del properties["id"]

103-116: Catch Specific Exceptions During Data Insertion

In the create_data_points method, catching a generic Exception may hide unexpected errors. It's better to catch specific exceptions related to Weaviate operations to handle known error cases effectively.

Apply this diff to catch specific exceptions:

- except Exception as error:
+ except weaviate.WeaviateException as error:

Ensure that WeaviateException is imported from the Weaviate client library:

from weaviate import WeaviateException
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d733bfd and 08a8442.

📒 Files selected for processing (4)
  • cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (10 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4 hunks)
  • cognee/tests/test_neo4j.py (1 hunks)
  • cognee/tests/test_weaviate.py (1 hunks)
🔇 Additional comments (6)
cognee/tests/test_neo4j.py (1)

40-40: Verify payload schema changes across the codebase.

The change from accessing "name" to "text" in the payload suggests a schema change. Let's verify this change is consistent.

#!/bin/bash
# Search for payload access patterns
echo "Searching for old 'name' payload access:"
rg -l "payload\[\"name\"\]|payload\['name'\]" --type python

echo -e "\nSearching for new 'text' payload access:"
rg -l "payload\[\"text\"\]|payload\['text'\]" --type python

# Look for any schema definitions
echo -e "\nSearching for payload schema definitions:"
rg -l "payload.*=.*{.*name|payload.*=.*{.*text" --type python
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (3)

66-69: Set updated_at timestamp on node creation

The updated_at timestamp is only set in the ON MATCH clause. For consistency, it should also be set during node creation.


139-144: Use parameterized queries for edge operations

Direct string interpolation in Cypher queries can lead to injection vulnerabilities.


400-409: LGTM! Clean UUID serialization implementation

The property serialization logic properly handles UUID conversion while maintaining type safety.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

121-131: Previous Comments Still Apply Regarding Error Handling and Naming

The previous review comments about adding error handling and using a more robust naming convention in create_vector_index and index_data_points are still applicable. Specifically, checking for the existence of a collection before creation and handling empty data points would improve robustness.


175-177: Verify Correctness of Score Adjustment in Search Results

The score is adjusted using 1 - float(result.metadata.score). Please verify that this calculation correctly reflects the intended scoring mechanism. In some cases, Weaviate returns distance metrics where a lower score indicates higher similarity, so inverting the score might not be necessary.

@@ -36,8 +36,8 @@ async def main():

from cognee.infrastructure.databases.vector import get_vector_engine
vector_engine = get_vector_engine()
random_node = (await vector_engine.search("entities", "AI"))[0]
random_node_name = random_node.payload["name"]
random_node = (await vector_engine.search("Entity_name", "Quantum computer"))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Schema change is inconsistent across the codebase

There appears to be an inconsistency in the schema usage:

  • Most test files and tasks use Entity_name for vector searches
  • The demo notebook (notebooks/cognee_demo.ipynb) still uses the old entities schema
  • The payload access has changed from payload["name"] to payload["text"]
  • notebooks/cognee_demo.ipynb: Update vector search to use Entity_name instead of entities and update payload access from name to text
🔗 Analysis chain

Verify schema changes across the codebase.

The change from searching "entities" to "Entity_name" suggests a schema change. Let's verify this change is consistent across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old schema
echo "Searching for old 'entities' references in vector searches:"
rg -l "vector_engine.*search.*entities" --type python

echo -e "\nSearching for new 'Entity_name' usage:"
rg -l "vector_engine.*search.*Entity_name" --type python

# Check for any potential schema definitions
echo -e "\nSearching for schema definitions:"
rg -l "Entity_name.*=|entities.*=" --type python

Length of output: 786


Script:

#!/bin/bash
# Search for any remaining references to the old schema
echo "Searching for old 'entities' references in vector searches:"
rg "vector_engine.*search.*entities" -A 2 -B 2

echo -e "\nSearching for new 'Entity_name' usage:"
rg "vector_engine.*search.*Entity_name" -A 2 -B 2

# Look for schema or model definitions
echo -e "\nSearching for schema/model definitions:"
rg "class.*Entity|Entity_name" -A 5 -B 5

Length of output: 9628

Comment on lines +38 to +39
random_node = (await vector_engine.search("Entity_name", "Quantum computer"))[0]
random_node_name = random_node.payload["text"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Schema changes need attention: inconsistent field names across codebase

The verification reveals inconsistencies that need to be addressed:

  • entities in utils.py and data_models.py refers to NLP/data model concepts, not the schema field - these are fine
  • The schema field Entity_name is consistently used in all test files and query_graph_connections.py
  • The payload field text is consistently used in all test files
  • However, EntityType_name appears in query_graph_connections.py which suggests a broader schema pattern that should be documented

The schema changes in the test file are correct, but the presence of both Entity_name and EntityType_name in production code suggests a need to:

  1. Document the schema field naming convention
  2. Consider adding schema validation to prevent accidental use of incorrect field names
🔗 Analysis chain

LGTM! The test is now more focused and aligned with the test data.

The changes make the test more relevant by searching for "Quantum computer" which exists in the test data, improving the test's reliability.

Let's verify if these schema changes are consistently applied across the codebase:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the schema changes across the codebase
# Check for any remaining references to the old schema ("entities" field and "name" in payload)

echo "Checking for old field name 'entities'..."
rg -l "entities" --type py

echo "Checking for payload references using 'name'..."
rg -l 'payload\["name"\]' --type py

echo "Checking for new field name 'Entity_name'..."
rg -l "Entity_name" --type py

echo "Checking for payload references using 'text'..."
rg -l 'payload\["text"\]' --type py

Length of output: 944


Script:

#!/bin/bash
# Let's check the actual usage context of "entities" in the files found
echo "Checking usage of 'entities' in utils.py:"
rg "entities" -C 2 cognee/shared/utils.py

echo -e "\nChecking usage of 'entities' in data_models.py:"
rg "entities" -C 2 cognee/shared/data_models.py

echo -e "\nChecking Entity_name usage in tasks/graph/query_graph_connections.py:"
rg "Entity_name" -C 2 cognee/tasks/graph/query_graph_connections.py

Length of output: 1268

Comment on lines 79 to 85
async def add_nodes(self, nodes: list[DataPoint]) -> None:
query = """
UNWIND $nodes AS node
MERGE (n {id: node.node_id})
ON CREATE SET n += node.properties
ON MATCH SET n += node.properties
ON MATCH SET n.updated_at = timestamp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent label injection in add_nodes query

The apoc.create.addLabels call could be vulnerable to injection if node_id contains malicious input. Consider validating the node_id format before using it as a label.

Apply this diff to add validation:

 async def add_nodes(self, nodes: list[DataPoint]) -> None:
+    def validate_label(label: str) -> bool:
+        return bool(re.match(r'^[A-Za-z0-9_]+$', label))
+
     query = """
     UNWIND $nodes AS node
+    WITH node WHERE $validate_label(node.node_id)
     MERGE (n {id: node.node_id})
     ON CREATE SET n += node.properties
     ON MATCH SET n += node.properties
     ON MATCH SET n.updated_at = timestamp()
     WITH n, node.node_id AS label
     CALL apoc.create.addLabels(n, [label]) YIELD node AS labeledNode
     RETURN ID(labeledNode) AS internal_id, labeledNode.id AS nodeId
     """

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (13)
.github/workflows/test_qdrant.yml (1)

8-8: LGTM! Enhanced workflow trigger configuration.

The addition of the 'synchronize' event type is a good improvement. This ensures that tests will run both when the PR is labeled and when new commits are pushed, providing better continuous integration coverage.

This change follows GitHub Actions best practices by ensuring that tests are re-run whenever the code changes, helping maintain code quality throughout the PR lifecycle.

.github/workflows/test_weaviate.yml (1)

8-8: Consider optimizing GitHub Actions usage.

While running tests on both 'labeled' and 'synchronize' events improves reliability, it might lead to redundant workflow runs. Consider adding a conditional check to skip the workflow if the commit is very recent (e.g., within the last few minutes) to optimize GitHub Actions usage.

Example optimization:

jobs:
  check_recent_runs:
    runs-on: ubuntu-latest
    outputs:
      should_run: ${{ steps.check.outputs.should_run }}
    steps:
      - id: check
        run: |
          # Check if there's been a successful run in the last 5 minutes
          recent_run=$(gh run list --workflow=test_weaviate.yml --json conclusion,createdAt --jq '[.[] | select(.conclusion=="success" and (now - fromdate(.createdAt) < 300)) ] | length')
          echo "should_run=$([ "$recent_run" -eq 0 ] && echo 'true' || echo 'false')" >> $GITHUB_OUTPUT

  run_weaviate_integration_test:
    needs: [check_recent_runs, get_docs_changes]
    if: needs.check_recent_runs.outputs.should_run == 'true' && needs.get_docs_changes.outputs.changes_outside_docs == 'true' && github.event.label.name == 'run-checks'
    # ... rest of the job configuration
.github/workflows/test_neo4j.yml (1)

Line range hint 1-58: Consider adding workflow status badges to README.

Since this is a critical integration test workflow, it would be helpful to add a workflow status badge to your README.md file. This provides quick visibility into the build status.

Add this markdown to your README.md:

![Neo4j Tests](https://github.com/{owner}/{repo}/actions/workflows/test_neo4j.yml/badge.svg)
.github/workflows/test_pgvector.yml (1)

8-8: LGTM! Good improvement to workflow triggers.

Adding the 'synchronize' event type ensures that tests run automatically when new commits are pushed to the PR, not just when labels are added. This helps catch issues earlier in the development cycle.

Consider also adding the 'opened' event type to run tests immediately when PRs are created, providing even faster feedback to developers.

.github/workflows/test_python_3_10.yml (1)

8-8: Consider adding a comment to explain the event types.

For better maintainability, it would be helpful to document why both event types are needed.

    branches:
      - main
-    types: [labeled, synchronize]
+    # Trigger on:
+    # - labeled: when "run-checks" label is added
+    # - synchronize: when PR branch is updated
+    types: [labeled, synchronize]
.github/workflows/test_python_3_11.yml (1)

8-8: LGTM! Adding 'synchronize' trigger improves CI workflow.

The addition of the 'synchronize' event type is a good practice as it ensures tests run automatically when new commits are pushed to the PR. Note that the workflow will still only execute if the 'run-checks' label is present, maintaining controlled test execution.

pyproject.toml (3)

61-61: Consider pinning lancedb to exact version.

The upgrade from 0.8.0 to 0.15.0 is a significant jump that might introduce breaking changes. Consider pinning to exact version (removing the ^) for better reproducibility.

-lancedb = "0.15.0"
+lancedb = "=0.15.0"

70-72: Review database adapter dependencies organization.

The addition of asyncpg and pgvector alongside psycopg2 suggests a move towards async PostgreSQL support. However:

  1. These dependencies should ideally be optional and moved to extras to keep the base installation light
  2. Version constraints should be aligned with the minimum supported Python version (3.9)

Consider reorganizing these dependencies:

-asyncpg = "^0.29.0"
-pgvector = "^0.3.5"
-psycopg2 = "^2.9.10"
+psycopg2 = {version = "^2.9.10", optional = true}
+asyncpg = {version = "^0.29.0", optional = true}
+pgvector = {version = "^0.3.5", optional = true}

83-83: Remove unnecessary blank line.

There's an extra blank line that doesn't follow the file's formatting pattern.

-
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (3)

85-123: Improve error handling specificity

The current generic error handling could be more specific to handle different types of failures (e.g., batch processing errors vs single update errors).

Consider this improvement:

         try:
             if len(data_points) > 1:
                 with collection.batch.dynamic() as batch:
                     for data_point in data_points:
                         batch.add_object(
                             uuid = data_point.uuid,
                             vector = data_point.vector,
                             properties = data_point.properties,
                             references = data_point.references,
                         )
             else:
                 data_point: DataObject = data_points[0]
                 return collection.data.update(
                     uuid = data_point.uuid,
                     vector = data_point.vector,
                     properties = data_point.properties,
                     references = data_point.references,
                 )
-        except Exception as error:
-            logger.error("Error creating data points: %s", str(error))
-            raise error
+        except weaviate.exceptions.WeaviateBatchError as error:
+            logger.error("Batch processing failed: %s", str(error))
+            raise
+        except weaviate.exceptions.WeaviateUpdateError as error:
+            logger.error("Single data point update failed: %s", str(error))
+            raise
+        except Exception as error:
+            logger.error("Unexpected error while creating data points: %s", str(error))
+            raise

182-184: Document score calculation logic

The score inversion (1 - score) needs documentation to explain why this transformation is necessary.

Add a comment explaining the score calculation:

             ScoredResult(
                 id = UUID(str(result.uuid)),
                 payload = result.properties,
+                # Weaviate returns distance-based scores (0 = perfect match), 
+                # converting to similarity score (1 = perfect match)
                 score = 1 - float(result.metadata.score)
             ) for result in search_result.objects

Line range hint 21-59: Add connection management capabilities

Consider adding connection management features to handle disconnections and health checks.

Some suggestions:

  1. Add a health check method to verify the connection status
  2. Implement connection retry logic
  3. Add connection pooling for better resource management
  4. Consider implementing a connection timeout

Would you like me to provide a detailed implementation for these features?

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)

404-413: Consider optimizing property serialization for large datasets

The current implementation creates a new dictionary for each property set. For large datasets, consider using a generator pattern or batch processing.

 def serialize_properties(self, properties = dict()):
-    serialized_properties = {}
+    return {
+        k: str(v) if isinstance(v, UUID) else v
+        for k, v in properties.items()
+    }
-    for property_key, property_value in properties.items():
-        if isinstance(property_value, UUID):
-            serialized_properties[property_key] = str(property_value)
-            continue
-
-        serialized_properties[property_key] = property_value
-
-    return serialized_properties
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 08a8442 and ed9036a.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • .github/workflows/test_neo4j.yml (1 hunks)
  • .github/workflows/test_notebook.yml (1 hunks)
  • .github/workflows/test_pgvector.yml (1 hunks)
  • .github/workflows/test_python_3_10.yml (1 hunks)
  • .github/workflows/test_python_3_11.yml (1 hunks)
  • .github/workflows/test_python_3_9.yml (1 hunks)
  • .github/workflows/test_qdrant.yml (1 hunks)
  • .github/workflows/test_weaviate.yml (1 hunks)
  • cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (10 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (5 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (4 hunks)
  • cognee/tests/test_pgvector.py (2 hunks)
  • pyproject.toml (2 hunks)
🔇 Additional comments (14)
.github/workflows/test_weaviate.yml (1)

8-8: LGTM! Enhanced PR trigger configuration.

Adding the 'synchronize' event type is a good improvement as it ensures tests run automatically when new commits are pushed to the PR, maintaining code quality throughout the PR lifecycle.

.github/workflows/test_neo4j.yml (1)

8-8: LGTM! Enhanced workflow trigger configuration.

Adding the 'synchronize' event type is a good improvement as it ensures tests run automatically when new commits are pushed to the PR, while maintaining the manual control through labels.

.github/workflows/test_notebook.yml (1)

8-8: LGTM! Good improvement to workflow automation.

Adding the 'synchronize' event type ensures tests run automatically when new commits are pushed to the PR, reducing manual intervention while maintaining the label-based control.

Let's verify this change is consistent across other workflow files:

.github/workflows/test_python_3_9.yml (1)

8-8: LGTM! Enhanced workflow trigger configuration.

Adding the 'synchronize' event type is a good improvement as it ensures tests run automatically when new commits are pushed to the PR, providing faster feedback on code changes.

Let's verify this change is consistent across other workflow files:

✅ Verification successful

✓ Confirmed: PR trigger types are consistent across all workflow files

The verification shows that all test workflow files have identical pull request trigger configurations with both labeled and synchronize event types, ensuring consistent behavior across the test suite.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if all test workflow files have consistent PR trigger types

# Search for pull_request trigger types in all test workflow files
echo "Checking PR trigger types in test workflow files:"
rg --type yaml "pull_request:(\n|\s)+.*\n.*types:" .github/workflows/test_*.yml -A 2

Length of output: 649


Script:

#!/bin/bash
# Description: Check if all test workflow files have consistent PR trigger types
# Using simpler pattern matching approach

echo "Checking PR trigger types in test workflow files:"
for file in .github/workflows/test_*.yml; do
  echo "=== $file ==="
  grep -A 2 "pull_request:" "$file"
done

Length of output: 2077


Script:

#!/bin/bash
# Description: Check for 'types' configuration in pull_request triggers
# Using a wider context to capture the types field

echo "Checking PR trigger types configuration in test workflow files:"
for file in .github/workflows/test_*.yml; do
  echo "=== $file ==="
  grep -A 4 "pull_request:" "$file"
done

Length of output: 2385

.github/workflows/test_python_3_10.yml (1)

8-8: LGTM! Good improvement to the CI workflow.

Adding the 'synchronize' event type ensures that tests automatically run when the PR branch is updated, improving the feedback loop for developers.

pyproject.toml (2)

80-80: Verify postgres extras completeness.

The postgres extras section has been expanded to include all database-related dependencies. However, let's verify that all necessary dependencies for PostgreSQL support are included.

#!/bin/bash
# Search for PostgreSQL-related imports in the codebase
rg -t python "(?:import|from) (?:psycopg2|asyncpg|pgvector)"

# Search for other potential PostgreSQL-related packages
rg -t python "(?:import|from) (?:sqlalchemy\.dialects\.postgresql|psycopg|aiopg)"

45-45: Verify compatibility with SQLAlchemy version.

The dlt package upgrade to ^1.3.0 with SQLAlchemy extras needs verification against the fixed SQLAlchemy version (2.0.35) specified in dependencies.

✅ Verification successful

SQLAlchemy integration is compatible with dlt upgrade

The codebase extensively uses SQLAlchemy with version 2.0.35, and the dlt package with SQLAlchemy extras (^1.3.0) is compatible with this setup. This is evidenced by:

  1. The codebase uses modern SQLAlchemy 2.0 features (async support, new style ORM)
  2. No version conflicts are present in pyproject.toml
  3. The extensive SQLAlchemy usage across multiple modules shows stable integration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check dlt package dependencies compatibility
curl -s https://pypi.org/pypi/dlt/1.3.0/json | jq '.info.requires_dist[] | select(contains("sqlalchemy"))'

Length of output: 208


Script:

#!/bin/bash
# Let's try a different approach to check SQLAlchemy usage and version requirements

# Check for SQLAlchemy imports and usage in the codebase
rg --type-add 'python:*.py' -t python 'from sqlalchemy|import sqlalchemy'

# Look for any SQLAlchemy version specifications in requirements files
fd -e txt -e in requirements | xargs cat

# Check the actual pyproject.toml content for SQLAlchemy version
cat pyproject.toml

Length of output: 8744

cognee/tests/test_pgvector.py (1)

44-44: LGTM! Style improvement.

The spacing around the assignment operator improves readability and follows PEP 8 style guidelines.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

4-18: LGTM! Well-structured data model implementation.

The IndexSchema class is well-defined with proper type hints and clear metadata specification.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

18-23: LGTM: Well-structured index schema definition.

The IndexSchema class provides a clean abstraction for text indexing with proper metadata configuration.


Line range hint 1-250: Verify vector operations across different vector sizes.

The implementation assumes vector sizes will match between stored and query vectors. Let's verify this assumption.

✅ Verification successful

Based on the search results, I can now generate the final response:

Vector size validation is properly implemented

The implementation correctly handles vector size consistency:

  • Vector dimensions are properly initialized during collection creation using vector_size = self.embedding_engine.get_vector_size()
  • The vector size is consistently applied when creating Vector columns: vector = Column(Vector(vector_size))
  • All vector operations are performed on vectors created with matching dimensions from the same embedding engine instance
  • Other adapters (FalkorDB, QDrant, LanceDB) follow the same pattern of dimension validation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for vector size mismatches or validations across the codebase

# Check for vector size validations
rg -A 2 "vector_size|embedding_size|dimension" 

# Check for potential vector operations without size validation
ast-grep --pattern 'cosine_distance($x)' 

Length of output: 6714

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (3)

Line range hint 345-359: LGTM! Proper UUID handling implemented

The implementation correctly handles UUID conversion and uses parameterized queries as recommended.


78-92: ⚠️ Potential issue

Implement label validation in add_nodes method

As noted in past reviews, the label handling in add_nodes could be vulnerable to injection. Consider implementing the suggested validation.

 async def add_nodes(self, nodes: list[DataPoint]) -> None:
+    def validate_label(label: str) -> bool:
+        return bool(re.match(r'^[A-Za-z0-9_]+$', label))
+
     query = """
     UNWIND $nodes AS node
+    WITH node WHERE $validate_label(node.node_id)
     MERGE (n {id: node.node_id})
     ON CREATE SET n += node.properties, n.updated_at = timestamp()
     ON MATCH SET n += node.properties, n.updated_at = timestamp()
     WITH n, node.node_id AS label
     CALL apoc.create.addLabels(n, [label]) YIELD node AS labeledNode
     RETURN ID(labeledNode) AS internal_id, labeledNode.id AS nodeId
     """

63-64: Consider filtering sensitive data before storing

The model_dump() method might expose sensitive data. Consider implementing a whitelist of allowed properties or explicitly excluding sensitive fields.

Comment on lines +68 to +69
random_node = (await vector_engine.search("Entity_name", "Quantum computer"))[0]
random_node_name = random_node.payload["text"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistency found in payload field access between demo notebook and tests

The search parameter "Entity_name" is consistently used across the codebase, but there's an inconsistency in accessing the payload field:

  • Demo notebook (notebooks/cognee_demo.ipynb) uses payload["name"]
  • All test files use payload["text"]

This needs to be synchronized to use either "name" or "text" consistently.

🔗 Analysis chain

Verify vector engine search parameter changes across the codebase.

The search parameters have been updated:

  • Search term changed from "entities" to "Entity_name"
  • Payload field changed from "name" to "text"

Let's ensure these changes are consistent with the vector engine's API and other usages.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other occurrences of vector engine search calls to ensure consistency

# Search for vector engine search calls
rg -A 2 "vector_engine\.search\(" 

# Search for references to the old field names
rg '"entities"'
rg '\.payload\["name"\]'

# Search for references to the new field names
rg '"Entity_name"'
rg '\.payload\["text"\]'

Length of output: 3518

Comment on lines +128 to +138
async def create_vector_index(self, index_name: str, index_property_name: str):
await self.create_collection(f"{index_name}_{index_property_name}")

async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
await self.create_data_points(f"{index_name}_{index_property_name}", [
IndexSchema(
id = data_point.id,
text = data_point.get_embeddable_data(),
) for data_point in data_points
])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation checks to index methods

The index methods should validate inputs and verify preconditions.

Consider these improvements:

     async def create_vector_index(self, index_name: str, index_property_name: str):
+        if not index_name or not index_property_name:
+            raise ValueError("Index name and property name must not be empty")
+        collection_name = f"{index_name}_{index_property_name}"
+        if await self.has_collection(collection_name):
+            logger.warning(f"Index {collection_name} already exists")
+            return
         await self.create_collection(f"{index_name}_{index_property_name}")

     async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
+        if not data_points:
+            logger.warning("No data points to index")
+            return
+        collection_name = f"{index_name}_{index_property_name}"
+        if not await self.has_collection(collection_name):
+            raise ValueError(f"Index {collection_name} does not exist")
         await self.create_data_points(f"{index_name}_{index_property_name}", [
             IndexSchema(
                 id = data_point.id,
                 text = data_point.get_embeddable_data(),
             ) for data_point in data_points
         ])

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +60 to +74
if not self.client.collections.exists(collection_name):
future.set_result(
self.client.collections.create(
name = collection_name,
properties = [
wvcc.Property(
name = "text",
data_type = wvcc.DataType.TEXT,
skip_vectorization = True
)
]
)
)
)
else:
future.set_result(self.get_collection(collection_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for collection creation

While the idempotent check is good, the collection creation needs error handling to catch potential Weaviate API errors.

Consider wrapping the collection creation in a try-catch:

         if not self.client.collections.exists(collection_name):
+            try:
                 future.set_result(
                     self.client.collections.create(
                         name = collection_name,
                         properties = [
                             wvcc.Property(
                                 name = "text",
                                 data_type = wvcc.DataType.TEXT,
                                 skip_vectorization = True
                             )
                         ]
                     )
                 )
+            except Exception as e:
+                logger.error(f"Failed to create collection {collection_name}: {str(e)}")
+                raise
         else:
             future.set_result(self.get_collection(collection_name))

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +148 to +162
# Get PGVectorDataPoint Table from database
PGVectorDataPoint = await self.get_table(collection_name)

async with self.get_async_session() as session:
results = await session.execute(
select(PGVectorDataPoint).where(PGVectorDataPoint.c.id.in_(data_point_ids))
)
results = results.all()

return [
ScoredResult(id=result.id, payload=result.payload, score=0)
for result in results
ScoredResult(
id = UUID(result.id),
payload = result.payload,
score = 0
) for result in results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve type safety and error handling for UUID conversion.

The UUID conversion could fail with invalid strings, and the return type hint is missing.

Consider these improvements:

-async def retrieve(self, collection_name: str, data_point_ids: List[str]):
+async def retrieve(self, collection_name: str, data_point_ids: List[str]) -> List[ScoredResult]:
     # Get PGVectorDataPoint Table from database
     PGVectorDataPoint = await self.get_table(collection_name)
 
     async with self.get_async_session() as session:
         results = await session.execute(
             select(PGVectorDataPoint).where(PGVectorDataPoint.c.id.in_(data_point_ids))
         )
         results = results.all()
 
         return [
             ScoredResult(
-                id = UUID(result.id),
+                id = UUID(str(result.id)) if isinstance(result.id, str) else result.id,
                 payload = result.payload,
                 score = 0
             ) for result in results
         ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get PGVectorDataPoint Table from database
PGVectorDataPoint = await self.get_table(collection_name)
async with self.get_async_session() as session:
results = await session.execute(
select(PGVectorDataPoint).where(PGVectorDataPoint.c.id.in_(data_point_ids))
)
results = results.all()
return [
ScoredResult(id=result.id, payload=result.payload, score=0)
for result in results
ScoredResult(
id = UUID(result.id),
payload = result.payload,
score = 0
) for result in results
async def retrieve(self, collection_name: str, data_point_ids: List[str]) -> List[ScoredResult]:
# Get PGVectorDataPoint Table from database
PGVectorDataPoint = await self.get_table(collection_name)
async with self.get_async_session() as session:
results = await session.execute(
select(PGVectorDataPoint).where(PGVectorDataPoint.c.id.in_(data_point_ids))
)
results = results.all()
return [
ScoredResult(
id = UUID(str(result.id)) if isinstance(result.id, str) else result.id,
payload = result.payload,
score = 0
) for result in results
]

Comment on lines +200 to +212
# Extract distances and find min/max for normalization
for vector in closest_items:
# TODO: Add normalization of similarity score
vector_list.append(vector)

# Create and return ScoredResult objects
return [
ScoredResult(
id = UUID(str(row.id)),
payload = row.payload,
score = row.similarity
) for row in vector_list
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address similarity score normalization and add error handling.

The code has a TODO for score normalization and lacks error handling for the cosine distance calculation.

Consider implementing score normalization and error handling:

 # Extract distances and find min/max for normalization
+min_similarity = float('inf')
+max_similarity = float('-inf')
 for vector in closest_items:
-    # TODO: Add normalization of similarity score
+    if vector.similarity is not None:
+        min_similarity = min(min_similarity, vector.similarity)
+        max_similarity = max(max_similarity, vector.similarity)
     vector_list.append(vector)

+# Normalize scores between 0 and 1 if we have valid similarities
+score_range = max_similarity - min_similarity if max_similarity > min_similarity else 1
 return [
     ScoredResult(
         id = UUID(str(row.id)),
         payload = row.payload,
-        score = row.similarity
+        score = 1 - ((row.similarity - min_similarity) / score_range) if row.similarity is not None else 0
     ) for row in vector_list
 ]

Would you like me to create an issue to track the implementation of similarity score normalization?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Extract distances and find min/max for normalization
for vector in closest_items:
# TODO: Add normalization of similarity score
vector_list.append(vector)
# Create and return ScoredResult objects
return [
ScoredResult(
id = UUID(str(row.id)),
payload = row.payload,
score = row.similarity
) for row in vector_list
]
# Extract distances and find min/max for normalization
min_similarity = float('inf')
max_similarity = float('-inf')
for vector in closest_items:
if vector.similarity is not None:
min_similarity = min(min_similarity, vector.similarity)
max_similarity = max(max_similarity, vector.similarity)
vector_list.append(vector)
# Normalize scores between 0 and 1 if we have valid similarities
score_range = max_similarity - min_similarity if max_similarity > min_similarity else 1
return [
ScoredResult(
id = UUID(str(row.id)),
payload = row.payload,
score = 1 - ((row.similarity - min_similarity) / score_range) if row.similarity is not None else 0
) for row in vector_list
]

Comment on lines +123 to +132
async def create_vector_index(self, index_name: str, index_property_name: str):
await self.create_collection(f"{index_name}_{index_property_name}")

async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
await self.create_data_points(f"{index_name}_{index_property_name}", [
IndexSchema(
id = data_point.id,
text = data_point.get_embeddable_data(),
) for data_point in data_points
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation for index and property names.

The concatenation of index_name and index_property_name to form collection names could cause issues if these contain special characters or SQL-unsafe characters.

Consider adding validation:

 async def create_vector_index(self, index_name: str, index_property_name: str):
+    if not all(name.isalnum() or name == '_' for name in [index_name, index_property_name]):
+        raise ValueError("Index and property names must contain only alphanumeric characters and underscores")
     await self.create_collection(f"{index_name}_{index_property_name}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def create_vector_index(self, index_name: str, index_property_name: str):
await self.create_collection(f"{index_name}_{index_property_name}")
async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
await self.create_data_points(f"{index_name}_{index_property_name}", [
IndexSchema(
id = data_point.id,
text = data_point.get_embeddable_data(),
) for data_point in data_points
])
async def create_vector_index(self, index_name: str, index_property_name: str):
if not all(name.isalnum() or name == '_' for name in [index_name, index_property_name]):
raise ValueError("Index and property names must contain only alphanumeric characters and underscores")
await self.create_collection(f"{index_name}_{index_property_name}")
async def index_data_points(self, index_name: str, index_property_name: str, data_points: list[DataPoint]):
await self.create_data_points(f"{index_name}_{index_property_name}", [
IndexSchema(
id = data_point.id,
text = data_point.get_embeddable_data(),
) for data_point in data_points
])

Comment on lines +82 to +117
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name = collection_name,
payload_schema = type(data_points[0]),
)

vector_size = self.embedding_engine.get_vector_size()

class PGVectorDataPoint(Base):
__tablename__ = collection_name
__table_args__ = {"extend_existing": True}
# PGVector requires one column to be the primary key
primary_key: Mapped[int] = mapped_column(
primary_key=True, autoincrement=True
)
id: Mapped[type(data_points[0].id)]
payload = Column(JSON)
vector = Column(Vector(vector_size))
data_vectors = await self.embed_data(
[data_point.get_embeddable_data() for data_point in data_points]
)

def __init__(self, id, payload, vector):
self.id = id
self.payload = payload
self.vector = vector
vector_size = self.embedding_engine.get_vector_size()

pgvector_data_points = [
PGVectorDataPoint(
id=data_point.id,
vector=data_vectors[data_index],
payload=serialize_datetime(data_point.payload.dict()),
)
for (data_index, data_point) in enumerate(data_points)
]
class PGVectorDataPoint(Base):
__tablename__ = collection_name
__table_args__ = {"extend_existing": True}
# PGVector requires one column to be the primary key
primary_key: Mapped[int] = mapped_column(
primary_key=True, autoincrement=True
)
id: Mapped[type(data_points[0].id)]
payload = Column(JSON)
vector = Column(Vector(vector_size))

def __init__(self, id, payload, vector):
self.id = id
self.payload = payload
self.vector = vector

pgvector_data_points = [
PGVectorDataPoint(
id = data_point.id,
vector = data_vectors[data_index],
payload = serialize_data(data_point.model_dump()),
)
for (data_index, data_point) in enumerate(data_points)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation for empty data_points list.

The method assumes data_points is non-empty when accessing data_points[0]. This could raise an IndexError if an empty list is provided.

Add this validation at the start of the method:

 async def create_data_points(
     self, collection_name: str, data_points: List[DataPoint]
 ):
+    if not data_points:
+        raise ValueError("data_points list cannot be empty")
     if not await self.has_collection(collection_name):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name = collection_name,
payload_schema = type(data_points[0]),
)
vector_size = self.embedding_engine.get_vector_size()
class PGVectorDataPoint(Base):
__tablename__ = collection_name
__table_args__ = {"extend_existing": True}
# PGVector requires one column to be the primary key
primary_key: Mapped[int] = mapped_column(
primary_key=True, autoincrement=True
)
id: Mapped[type(data_points[0].id)]
payload = Column(JSON)
vector = Column(Vector(vector_size))
data_vectors = await self.embed_data(
[data_point.get_embeddable_data() for data_point in data_points]
)
def __init__(self, id, payload, vector):
self.id = id
self.payload = payload
self.vector = vector
vector_size = self.embedding_engine.get_vector_size()
pgvector_data_points = [
PGVectorDataPoint(
id=data_point.id,
vector=data_vectors[data_index],
payload=serialize_datetime(data_point.payload.dict()),
)
for (data_index, data_point) in enumerate(data_points)
]
class PGVectorDataPoint(Base):
__tablename__ = collection_name
__table_args__ = {"extend_existing": True}
# PGVector requires one column to be the primary key
primary_key: Mapped[int] = mapped_column(
primary_key=True, autoincrement=True
)
id: Mapped[type(data_points[0].id)]
payload = Column(JSON)
vector = Column(Vector(vector_size))
def __init__(self, id, payload, vector):
self.id = id
self.payload = payload
self.vector = vector
pgvector_data_points = [
PGVectorDataPoint(
id = data_point.id,
vector = data_vectors[data_index],
payload = serialize_data(data_point.model_dump()),
)
for (data_index, data_point) in enumerate(data_points)
]
if not data_points:
raise ValueError("data_points list cannot be empty")
if not await self.has_collection(collection_name):
await self.create_collection(
collection_name = collection_name,
payload_schema = type(data_points[0]),
)
data_vectors = await self.embed_data(
[data_point.get_embeddable_data() for data_point in data_points]
)
vector_size = self.embedding_engine.get_vector_size()
class PGVectorDataPoint(Base):
__tablename__ = collection_name
__table_args__ = {"extend_existing": True}
# PGVector requires one column to be the primary key
primary_key: Mapped[int] = mapped_column(
primary_key=True, autoincrement=True
)
id: Mapped[type(data_points[0].id)]
payload = Column(JSON)
vector = Column(Vector(vector_size))
def __init__(self, id, payload, vector):
self.id = id
self.payload = payload
self.vector = vector
pgvector_data_points = [
PGVectorDataPoint(
id = data_point.id,
vector = data_vectors[data_index],
payload = serialize_data(data_point.model_dump()),
)
for (data_index, data_point) in enumerate(data_points)
]

Comment on lines +177 to +186
async def add_edge(self, from_node: UUID, to_node: UUID, relationship_name: str, edge_properties: Optional[Dict[str, Any]] = {}):
serialized_properties = self.serialize_properties(edge_properties)
from_node = from_node.replace(":", "_")
to_node = to_node.replace(":", "_")

query = f"""MATCH (from_node:`{from_node}`
{{id: $from_node}}),
(to_node:`{to_node}` {{id: $to_node}})
MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
SET r += $properties
RETURN r"""
query = dedent("""MATCH (from_node {id: $from_node}),
(to_node {id: $to_node})
MERGE (from_node)-[r]->(to_node)
ON CREATE SET r += $properties, r.updated_at = timestamp(), r.type = $relationship_name
ON MATCH SET r += $properties, r.updated_at = timestamp()
RETURN r
""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add explicit error handling for node not found scenarios

The MATCH clause assumes both nodes exist. If either node is missing, the error message might not be user-friendly.

 query = dedent("""MATCH (from_node {id: $from_node}),
     (to_node {id: $to_node})
+    WITH from_node, to_node
+    WHERE from_node IS NOT NULL AND to_node IS NOT NULL
     MERGE (from_node)-[r]->(to_node)
     ON CREATE SET r += $properties, r.updated_at = timestamp(), r.type = $relationship_name
     ON MATCH SET r += $properties, r.updated_at = timestamp()
     RETURN r
 """)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def add_edge(self, from_node: UUID, to_node: UUID, relationship_name: str, edge_properties: Optional[Dict[str, Any]] = {}):
serialized_properties = self.serialize_properties(edge_properties)
from_node = from_node.replace(":", "_")
to_node = to_node.replace(":", "_")
query = f"""MATCH (from_node:`{from_node}`
{{id: $from_node}}),
(to_node:`{to_node}` {{id: $to_node}})
MERGE (from_node)-[r:`{relationship_name}`]->(to_node)
SET r += $properties
RETURN r"""
query = dedent("""MATCH (from_node {id: $from_node}),
(to_node {id: $to_node})
MERGE (from_node)-[r]->(to_node)
ON CREATE SET r += $properties, r.updated_at = timestamp(), r.type = $relationship_name
ON MATCH SET r += $properties, r.updated_at = timestamp()
RETURN r
""")
async def add_edge(self, from_node: UUID, to_node: UUID, relationship_name: str, edge_properties: Optional[Dict[str, Any]] = {}):
serialized_properties = self.serialize_properties(edge_properties)
query = dedent("""MATCH (from_node {id: $from_node}),
(to_node {id: $to_node})
WITH from_node, to_node
WHERE from_node IS NOT NULL AND to_node IS NOT NULL
MERGE (from_node)-[r]->(to_node)
ON CREATE SET r += $properties, r.updated_at = timestamp(), r.type = $relationship_name
ON MATCH SET r += $properties, r.updated_at = timestamp()
RETURN r
""")

@borisarzentar borisarzentar merged commit 52180eb into main Nov 11, 2024
26 checks passed
@borisarzentar borisarzentar deleted the feat/COG-184-add-falkordb branch November 11, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants