Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creates edge embeddings collection #251

Merged

Conversation

hajdul88
Copy link
Contributor

@hajdul88 hajdul88 commented Dec 4, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new asynchronous function to index graph edges, enhancing the pipeline functionality.
    • Added a new class to manage edge types within the graph module.
  • Improvements

    • Enhanced graph loading and deletion processes with improved logging and data integrity.
  • Bug Fixes

    • Updated error handling and logging for graph operations to improve user feedback.
  • Tests

    • Added comprehensive unit tests for the new indexing functionality, covering various scenarios.

Copy link
Contributor

coderabbitai bot commented Dec 4, 2024

Walkthrough

The pull request introduces several changes across multiple files. Key modifications include the refactoring of the dynamic_steps_example.py to streamline boolean flag assignments, the addition of the index_graph_edges function in cognee/tasks/storage/index_graph_edges.py, and the introduction of the EdgeType class in cognee/modules/graph/models/EdgeType.py. Enhancements to the NetworkXAdapter class improve graph loading and deletion processes. Additionally, asynchronous unit tests for the index_graph_edges function are included to ensure its functionality under various scenarios.

Changes

File Change Summary
examples/python/dynamic_steps_example.py Removed commented-out flags; replaced with direct boolean assignments for rebuild_kg and retrieve. Updated steps_to_enable dictionary accordingly.
cognee/api/v1/cognify/cognify_v2.py Added import for index_graph_edges; included await index_graph_edges() in run_cognify_pipeline.
cognee/modules/graph/models/EdgeType.py Introduced EdgeType class inheriting from DataPoint, defining attributes and metadata for edge types.
cognee/tasks/storage/index_graph_edges.py Created index_graph_edges() function to manage vector indexes for relationship types; includes error handling and data indexing logic.
cognee/infrastructure/databases/graph/networkx/adapter.py Updated load_graph_from_file to ensure node IDs are set; added logging for file existence and graph deletion confirmation.
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py Added asynchronous unit tests for index_graph_edges(), covering success, no relationships, and initialization error scenarios.

Possibly related PRs

Suggested reviewers

  • Vasilije1990
  • borisarzentar

Poem

In the code where rabbits play,
New paths and steps have found their way.
With edges indexed, graphs align,
A hop, a skip, all functions shine!
Through tests we leap, with joy we cheer,
For every change, a new frontier! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Vasilije1990
Vasilije1990 previously approved these changes Dec 4, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
cognee/tasks/storage/index_graph_edges.py (2)

48-51: Use more descriptive variable names for clarity

Consider renaming the variables text and count to relationship_name and edge_count to enhance readability and better reflect their purpose.

Suggested change:

- for text, count in edge_types.items():
-     edge = EdgeType(relationship_name=text, number_of_edges=count)
+ for relationship_name, edge_count in edge_types.items():
+     edge = EdgeType(relationship_name=relationship_name, number_of_edges=edge_count)

66-68: Avoid variable shadowing to improve code maintainability

The variable index_name is being reused after splitting, which can lead to confusion. It's better to use a different variable name to prevent shadowing and enhance clarity.

Suggested change:

- for index_name, indexable_points in index_points.items():
-     index_name, field_name = index_name.split(".")
+ for index_key, indexable_points in index_points.items():
+     index_name, field_name = index_key.split(".")
cognee/modules/graph/models/EdgeType.py (1)

9-11: Redundant use of Optional for an initialized class variable

Since _metadata is initialized with a default value, using Optional[dict] is unnecessary. Typing it simply as dict improves type clarity.

Suggested change:

- _metadata: Optional[dict] = {
+ _metadata: dict = {
    "index_fields": ["relationship_name"],
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 46ee513 and c20ee11.

📒 Files selected for processing (3)
  • cognee/api/v1/cognify/cognify_v2.py (2 hunks)
  • cognee/modules/graph/models/EdgeType.py (1 hunks)
  • cognee/tasks/storage/index_graph_edges.py (1 hunks)
🔇 Additional comments (2)
cognee/api/v1/cognify/cognify_v2.py (2)

21-21: Properly importing index_graph_edges function

The import statement correctly includes the index_graph_edges function, ensuring it's available for use in the pipeline.


98-98: Integrate index_graph_edges into the cognify pipeline

Calling await index_graph_edges() after the pipeline tasks appropriately incorporates the edge indexing step into the overall process.

@Vasilije1990 Vasilije1990 self-requested a review December 4, 2024 15:32
Vasilije1990
Vasilije1990 previously approved these changes Dec 4, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
cognee/infrastructure/databases/graph/networkx/adapter.py (2)

277-279: Consider validating existing node IDs before overwriting

While explicitly setting node IDs is good practice, consider adding validation to detect and log cases where a node's existing 'id' attribute differs from its node_id in the graph, as this might indicate data inconsistency.

 for node_id, node_data in self.graph.nodes(data=True):
+    if 'id' in node_data and node_data['id'] != node_id:
+        logger.warning(f"Node {node_id} had mismatched ID attribute: {node_data['id']}. Updating to match node_id.")
     node_data['id'] = node_id

Line range hint 287-289: Consider more specific error handling

The broad Exception catch could mask specific issues. Consider handling different error types separately for better debugging.

-        except Exception:
-            logger.error("Failed to load graph from file: %s", file_path)
+        except json.JSONDecodeError as e:
+            logger.error("Failed to parse graph file %s: %s", file_path, e)
+        except IOError as e:
+            logger.error("IO error while loading graph from %s: %s", file_path, e)
+        except Exception as e:
+            logger.error("Unexpected error loading graph from %s: %s", file_path, e)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c20ee11 and 36a5a27.

📒 Files selected for processing (1)
  • cognee/infrastructure/databases/graph/networkx/adapter.py (1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/graph/networkx/adapter.py (2)

Line range hint 280-286: LGTM: Robust empty graph initialization

The changes properly handle missing files by:

  • Logging clear warning messages
  • Creating necessary directories
  • Maintaining graph type consistency
  • Saving the empty graph state

Line range hint 277-286: Verify node ID usage across the codebase

The changes to node ID handling could affect code that assumes specific node ID behavior.

✅ Verification successful

Let me gather more information about the graph initialization and node handling to make a comprehensive assessment.


Node ID handling is consistent with codebase requirements

The changes to node ID handling in the graph loading code are safe and consistent with the codebase's requirements:

  • The node ID assignment (node_data['id'] = node_id) maintains consistency with the MultiDiGraph initialization pattern used across the codebase
  • UUID handling for node IDs is properly implemented, as evidenced by the UUID conversion in the adapter and its usage in edge operations
  • Graph loading error handling follows the established pattern of initializing an empty MultiDiGraph when needed
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for direct node ID access patterns and graph loading error handling

# Search for direct node ID access patterns
echo "Checking node ID access patterns..."
rg -A 2 "node\[.?['\"]id['\"]|node_data\[.?['\"]id['\"]"

# Search for graph loading error handling
echo "Checking graph loading error handling..."
rg -A 2 "load_graph_from_file"

Length of output: 2879


Script:

#!/bin/bash
# Check for graph initialization and node operations
echo "Checking graph initialization patterns..."
rg -A 3 "MultiDiGraph\(\)"

echo "Checking node operations and modifications..."
ast-grep --pattern 'def $_(self, node$_)'

echo "Checking UUID handling in nodes..."
rg -A 2 "UUID.*node"

Length of output: 4247

@hajdul88 hajdul88 changed the title chore: deletes comment from dynamic_steps_example Creates edge embeddings collection Dec 4, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
cognee/tests/infrastructure/databases/test_index_graph_edges.py (3)

4-24: Consider enhancing test coverage with additional assertions

While the basic test structure is good, consider these improvements:

  1. Verify the actual data being passed to index_data_points
  2. Assert the structure of created EdgeType objects
  3. Add assertions for the parameters passed to create_vector_index

Here's a suggested enhancement:

 @pytest.mark.asyncio
 async def test_index_graph_edges_success():
     """Test that index_graph_edges uses the index datapoints and creates vector index."""
     mock_graph_engine = AsyncMock()
+    test_relationships = [
+        [{"relationship_name": "rel1", "properties": {"key": "value"}}, 
+         {"relationship_name": "rel1", "properties": {"key2": "value2"}}],
+        [{"relationship_name": "rel2", "properties": {"key3": "value3"}}]
+    ]
-    mock_graph_engine.get_graph_data.return_value = (None, [
-        [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}],
-        [{"relationship_name": "rel2"}]
-    ])
+    mock_graph_engine.get_graph_data.return_value = (None, test_relationships)

     mock_vector_engine = AsyncMock()

     with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", return_value=mock_graph_engine), \
          patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", return_value=mock_vector_engine):

         from cognee.tasks.storage.index_graph_edges import index_graph_edges
         await index_graph_edges()

         mock_graph_engine.get_graph_data.assert_awaited_once()
         assert mock_vector_engine.create_vector_index.await_count == 1
         assert mock_vector_engine.index_data_points.await_count == 1
+        # Verify the data passed to create_vector_index
+        mock_vector_engine.create_vector_index.assert_awaited_with(
+            "graph_edges", dimension=1536  # Adjust dimension as per your config
+        )
+        # Verify the structure of indexed data
+        calls = mock_vector_engine.index_data_points.await_args_list
+        assert len(calls) == 1
+        indexed_data = calls[0].args[0]
+        assert all(isinstance(item, EdgeType) for item in indexed_data)

26-43: Enhance test documentation for clarity

The test logic is correct, but the docstring could be more descriptive about the expected behavior.

 @pytest.mark.asyncio
 async def test_index_graph_edges_no_relationships():
-    """Test that index_graph_edges handles empty relationships correctly."""
+    """Test that index_graph_edges handles empty relationships correctly.
+    
+    Ensures that when no relationships are returned from the graph engine:
+    1. No vector index is created
+    2. No data points are indexed
+    3. The function completes without errors
+    """

1-56: Consider adding idempotency test

The current test suite covers the basic scenarios well. However, consider adding a test to verify that the indexing operation is idempotent - i.e., running it multiple times with the same data doesn't cause issues or duplicate entries.

This would help ensure the function can be safely retried in case of partial failures in a distributed system.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 36a5a27 and e6bf428.

📒 Files selected for processing (1)
  • cognee/tests/infrastructure/databases/test_index_graph_edges.py (1 hunks)
🔇 Additional comments (1)
cognee/tests/infrastructure/databases/test_index_graph_edges.py (1)

1-2: LGTM! Imports are appropriate for async testing

The necessary testing utilities are imported correctly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (4)

8-11: Consider using type hints and constants for test data

The test data structure could be more explicit and maintainable:

  1. Use type hints to clarify the expected data structure
  2. Define relationship names as constants
+from typing import List, Dict
+
+TEST_RELATIONSHIP_1 = "rel1"
+TEST_RELATIONSHIP_2 = "rel2"
+
+test_data: List[List[Dict[str, str]]] = [
+    [{"relationship_name": TEST_RELATIONSHIP_1}, {"relationship_name": TEST_RELATIONSHIP_1}],
+    [{"relationship_name": TEST_RELATIONSHIP_2}]
+]
-    mock_graph_engine.get_graph_data.return_value = (None, [
-        [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}],
-        [{"relationship_name": "rel2"}]
-    ])
+    mock_graph_engine.get_graph_data.return_value = (None, test_data)

27-29: Enhance test documentation and return value verification

The test documentation could be more explicit about the expected behavior and verify the return value.

-    """Test that index_graph_edges handles empty relationships correctly."""
+    """
+    Test that index_graph_edges handles empty relationships correctly.
+    
+    Expected behavior:
+    - Should not create vector index when no relationships exist
+    - Should not attempt to index any data points
+    - Should return None without raising exceptions
+    """

45-54: Consider adding more error scenarios and cleanup verification

While the basic error case is covered, consider adding tests for:

  1. Vector engine initialization failure
  2. Partial failures during indexing
  3. Cleanup/rollback behavior when errors occur

Example additional test:

@pytest.mark.asyncio
async def test_index_graph_edges_vector_engine_failure():
    """Test handling of vector engine initialization failure."""
    mock_graph_engine = AsyncMock()
    
    with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", 
              return_value=mock_graph_engine), \
         patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", 
              side_effect=Exception("Vector engine failed")):

        from cognee.tasks.storage.index_graph_edges import index_graph_edges
        
        with pytest.raises(RuntimeError, match="Vector engine initialization failed"):
            await index_graph_edges()

1-56: Consider adding integration tests

While the unit tests provide good coverage of the index_graph_edges function, consider adding integration tests to verify:

  1. Actual interaction with the vector store
  2. End-to-end behavior in the cognify pipeline
  3. Performance characteristics with larger datasets

This would help ensure the feature works correctly in a production-like environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e6bf428 and 59035c3.

📒 Files selected for processing (1)
  • cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (1 hunks)
🔇 Additional comments (1)
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (1)

1-3: LGTM: Imports are appropriate for async testing

The imports are minimal and correctly include the necessary components for async testing with pytest.

@hajdul88 hajdul88 merged commit acf0368 into main Dec 5, 2024
40 checks passed
@hajdul88 hajdul88 deleted the feature/cog-717-create-edge-embeddings-in-vector-databases branch December 5, 2024 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants