Creates edge embeddings collection #251

hajdul88 · 2024-12-04T14:00:05Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new asynchronous function to index graph edges, enhancing the pipeline functionality.
- Added a new class to manage edge types within the graph module.
Improvements
- Enhanced graph loading and deletion processes with improved logging and data integrity.
Bug Fixes
- Updated error handling and logging for graph operations to improve user feedback.
Tests
- Added comprehensive unit tests for the new indexing functionality, covering various scenarios.

coderabbitai · 2024-12-04T14:00:13Z

Walkthrough

The pull request introduces several changes across multiple files. Key modifications include the refactoring of the dynamic_steps_example.py to streamline boolean flag assignments, the addition of the index_graph_edges function in cognee/tasks/storage/index_graph_edges.py, and the introduction of the EdgeType class in cognee/modules/graph/models/EdgeType.py. Enhancements to the NetworkXAdapter class improve graph loading and deletion processes. Additionally, asynchronous unit tests for the index_graph_edges function are included to ensure its functionality under various scenarios.

Changes

File	Change Summary
`examples/python/dynamic_steps_example.py`	Removed commented-out flags; replaced with direct boolean assignments for `rebuild_kg` and `retrieve`. Updated `steps_to_enable` dictionary accordingly.
`cognee/api/v1/cognify/cognify_v2.py`	Added import for `index_graph_edges`; included `await index_graph_edges()` in `run_cognify_pipeline`.
`cognee/modules/graph/models/EdgeType.py`	Introduced `EdgeType` class inheriting from `DataPoint`, defining attributes and metadata for edge types.
`cognee/tasks/storage/index_graph_edges.py`	Created `index_graph_edges()` function to manage vector indexes for relationship types; includes error handling and data indexing logic.
`cognee/infrastructure/databases/graph/networkx/adapter.py`	Updated `load_graph_from_file` to ensure node IDs are set; added logging for file existence and graph deletion confirmation.
`cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py`	Added asynchronous unit tests for `index_graph_edges()`, covering success, no relationships, and initialization error scenarios.

Possibly related PRs

Remove graph overwriting on exception in NetworkXAdapter #190: Modifications in the NetworkXAdapter class in cognee/infrastructure/databases/graph/networkx/adapter.py relate to graph handling similar to changes in dynamic_steps_example.py.
Cog 417 chunking unit tests #205: Changes in the TextChunker class involve control flow and data handling that conceptually connect to the enabling steps logic in the main PR.

Suggested reviewers

Vasilije1990
borisarzentar

Poem

In the code where rabbits play,
New paths and steps have found their way.
With edges indexed, graphs align,
A hop, a skip, all functions shine!
Through tests we leap, with joy we cheer,
For every change, a new frontier! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

cognee/tasks/storage/index_graph_edges.py (2)
48-51: Use more descriptive variable names for clarity

Consider renaming the variables text and count to relationship_name and edge_count to enhance readability and better reflect their purpose.

Suggested change:
- for text, count in edge_types.items():
-     edge = EdgeType(relationship_name=text, number_of_edges=count)
+ for relationship_name, edge_count in edge_types.items():
+     edge = EdgeType(relationship_name=relationship_name, number_of_edges=edge_count)
66-68: Avoid variable shadowing to improve code maintainability

The variable index_name is being reused after splitting, which can lead to confusion. It's better to use a different variable name to prevent shadowing and enhance clarity.

Suggested change:
- for index_name, indexable_points in index_points.items():
-     index_name, field_name = index_name.split(".")
+ for index_key, indexable_points in index_points.items():
+     index_name, field_name = index_key.split(".")
cognee/modules/graph/models/EdgeType.py (1)
9-11: Redundant use of Optional for an initialized class variable

Since _metadata is initialized with a default value, using Optional[dict] is unnecessary. Typing it simply as dict improves type clarity.

Suggested change:
- _metadata: Optional[dict] = {
+ _metadata: dict = {
    "index_fields": ["relationship_name"],
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 46ee513 and c20ee11.

📒 Files selected for processing (3)

cognee/api/v1/cognify/cognify_v2.py (2 hunks)
cognee/modules/graph/models/EdgeType.py (1 hunks)
cognee/tasks/storage/index_graph_edges.py (1 hunks)

🔇 Additional comments (2)

cognee/api/v1/cognify/cognify_v2.py (2)

21-21: Properly importing index_graph_edges function

The import statement correctly includes the index_graph_edges function, ensuring it's available for use in the pipeline.

98-98: Integrate index_graph_edges into the cognify pipeline

Calling await index_graph_edges() after the pipeline tasks appropriately incorporates the edge indexing step into the overall process.

…ctor-databases

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

cognee/infrastructure/databases/graph/networkx/adapter.py (2)
277-279: Consider validating existing node IDs before overwriting

While explicitly setting node IDs is good practice, consider adding validation to detect and log cases where a node's existing 'id' attribute differs from its node_id in the graph, as this might indicate data inconsistency.
 for node_id, node_data in self.graph.nodes(data=True):
+    if 'id' in node_data and node_data['id'] != node_id:
+        logger.warning(f"Node {node_id} had mismatched ID attribute: {node_data['id']}. Updating to match node_id.")
     node_data['id'] = node_id
Line range hint 287-289: Consider more specific error handling

The broad Exception catch could mask specific issues. Consider handling different error types separately for better debugging.
-        except Exception:
-            logger.error("Failed to load graph from file: %s", file_path)
+        except json.JSONDecodeError as e:
+            logger.error("Failed to parse graph file %s: %s", file_path, e)
+        except IOError as e:
+            logger.error("IO error while loading graph from %s: %s", file_path, e)
+        except Exception as e:
+            logger.error("Unexpected error loading graph from %s: %s", file_path, e)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c20ee11 and 36a5a27.

📒 Files selected for processing (1)

cognee/infrastructure/databases/graph/networkx/adapter.py (1 hunks)

🔇 Additional comments (2)

cognee/infrastructure/databases/graph/networkx/adapter.py (2)

Line range hint 280-286: LGTM: Robust empty graph initialization

The changes properly handle missing files by:

Logging clear warning messages
Creating necessary directories
Maintaining graph type consistency
Saving the empty graph state

Line range hint 277-286: Verify node ID usage across the codebase

The changes to node ID handling could affect code that assumes specific node ID behavior.

✅ Verification successful

Let me gather more information about the graph initialization and node handling to make a comprehensive assessment.

Node ID handling is consistent with codebase requirements

The changes to node ID handling in the graph loading code are safe and consistent with the codebase's requirements:

The node ID assignment (node_data['id'] = node_id) maintains consistency with the MultiDiGraph initialization pattern used across the codebase
UUID handling for node IDs is properly implemented, as evidenced by the UUID conversion in the adapter and its usage in edge operations
Graph loading error handling follows the established pattern of initializing an empty MultiDiGraph when needed

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for direct node ID access patterns and graph loading error handling

# Search for direct node ID access patterns
echo "Checking node ID access patterns..."
rg -A 2 "node\[.?['\"]id['\"]|node_data\[.?['\"]id['\"]"

# Search for graph loading error handling
echo "Checking graph loading error handling..."
rg -A 2 "load_graph_from_file"

Length of output: 2879

Script:

#!/bin/bash
# Check for graph initialization and node operations
echo "Checking graph initialization patterns..."
rg -A 3 "MultiDiGraph\(\)"

echo "Checking node operations and modifications..."
ast-grep --pattern 'def $_(self, node$_)'

echo "Checking UUID handling in nodes..."
rg -A 2 "UUID.*node"

Length of output: 4247

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

cognee/tests/infrastructure/databases/test_index_graph_edges.py (3)

4-24: Consider enhancing test coverage with additional assertions

While the basic test structure is good, consider these improvements:

Verify the actual data being passed to index_data_points
Assert the structure of created EdgeType objects
Add assertions for the parameters passed to create_vector_index

Here's a suggested enhancement:

 @pytest.mark.asyncio
 async def test_index_graph_edges_success():
     """Test that index_graph_edges uses the index datapoints and creates vector index."""
     mock_graph_engine = AsyncMock()
+    test_relationships = [
+        [{"relationship_name": "rel1", "properties": {"key": "value"}}, 
+         {"relationship_name": "rel1", "properties": {"key2": "value2"}}],
+        [{"relationship_name": "rel2", "properties": {"key3": "value3"}}]
+    ]
-    mock_graph_engine.get_graph_data.return_value = (None, [
-        [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}],
-        [{"relationship_name": "rel2"}]
-    ])
+    mock_graph_engine.get_graph_data.return_value = (None, test_relationships)

     mock_vector_engine = AsyncMock()

     with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", return_value=mock_graph_engine), \
          patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", return_value=mock_vector_engine):

         from cognee.tasks.storage.index_graph_edges import index_graph_edges
         await index_graph_edges()

         mock_graph_engine.get_graph_data.assert_awaited_once()
         assert mock_vector_engine.create_vector_index.await_count == 1
         assert mock_vector_engine.index_data_points.await_count == 1
+        # Verify the data passed to create_vector_index
+        mock_vector_engine.create_vector_index.assert_awaited_with(
+            "graph_edges", dimension=1536  # Adjust dimension as per your config
+        )
+        # Verify the structure of indexed data
+        calls = mock_vector_engine.index_data_points.await_args_list
+        assert len(calls) == 1
+        indexed_data = calls[0].args[0]
+        assert all(isinstance(item, EdgeType) for item in indexed_data)

26-43: Enhance test documentation for clarity

The test logic is correct, but the docstring could be more descriptive about the expected behavior.

 @pytest.mark.asyncio
 async def test_index_graph_edges_no_relationships():
-    """Test that index_graph_edges handles empty relationships correctly."""
+    """Test that index_graph_edges handles empty relationships correctly.
+    
+    Ensures that when no relationships are returned from the graph engine:
+    1. No vector index is created
+    2. No data points are indexed
+    3. The function completes without errors
+    """

1-56: Consider adding idempotency test

The current test suite covers the basic scenarios well. However, consider adding a test to verify that the indexing operation is idempotent - i.e., running it multiple times with the same data doesn't cause issues or duplicate entries.

This would help ensure the function can be safely retried in case of partial failures in a distributed system.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 36a5a27 and e6bf428.

📒 Files selected for processing (1)

cognee/tests/infrastructure/databases/test_index_graph_edges.py (1 hunks)

🔇 Additional comments (1)

cognee/tests/infrastructure/databases/test_index_graph_edges.py (1)

1-2: LGTM! Imports are appropriate for async testing

The necessary testing utilities are imported correctly.

cognee/tests/infrastructure/databases/test_index_graph_edges.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (4)
8-11: Consider using type hints and constants for test data

The test data structure could be more explicit and maintainable:

Use type hints to clarify the expected data structure

Define relationship names as constants
+from typing import List, Dict
+
+TEST_RELATIONSHIP_1 = "rel1"
+TEST_RELATIONSHIP_2 = "rel2"
+
+test_data: List[List[Dict[str, str]]] = [
+    [{"relationship_name": TEST_RELATIONSHIP_1}, {"relationship_name": TEST_RELATIONSHIP_1}],
+    [{"relationship_name": TEST_RELATIONSHIP_2}]
+]
-    mock_graph_engine.get_graph_data.return_value = (None, [
-        [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}],
-        [{"relationship_name": "rel2"}]
-    ])
+    mock_graph_engine.get_graph_data.return_value = (None, test_data)
27-29: Enhance test documentation and return value verification

The test documentation could be more explicit about the expected behavior and verify the return value.
-    """Test that index_graph_edges handles empty relationships correctly."""
+    """
+    Test that index_graph_edges handles empty relationships correctly.
+    
+    Expected behavior:
+    - Should not create vector index when no relationships exist
+    - Should not attempt to index any data points
+    - Should return None without raising exceptions
+    """
45-54: Consider adding more error scenarios and cleanup verification

While the basic error case is covered, consider adding tests for:

Vector engine initialization failure

Partial failures during indexing

Cleanup/rollback behavior when errors occur

Example additional test:
@pytest.mark.asyncio
async def test_index_graph_edges_vector_engine_failure():
    """Test handling of vector engine initialization failure."""
    mock_graph_engine = AsyncMock()
    
    with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", 
              return_value=mock_graph_engine), \
         patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", 
              side_effect=Exception("Vector engine failed")):

        from cognee.tasks.storage.index_graph_edges import index_graph_edges
        
        with pytest.raises(RuntimeError, match="Vector engine initialization failed"):
            await index_graph_edges()
1-56: Consider adding integration tests

While the unit tests provide good coverage of the index_graph_edges function, consider adding integration tests to verify:

Actual interaction with the vector store

End-to-end behavior in the cognify pipeline

Performance characteristics with larger datasets

This would help ensure the feature works correctly in a production-like environment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e6bf428 and 59035c3.

📒 Files selected for processing (1)

cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (1 hunks)

🔇 Additional comments (1)

cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (1)

1-3: LGTM: Imports are appropriate for async testing

The imports are minimal and correctly include the necessary components for async testing with pytest.

cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py

…ctor-databases

chore: deletes comment from dynamic_steps_example

46ee513

dexters1 added the run-checks label Dec 4, 2024

Vasilije1990 previously approved these changes Dec 4, 2024

View reviewed changes

feat: implements graph edge indexing

c20ee11

hajdul88 dismissed Vasilije1990’s stale review via c20ee11 December 4, 2024 14:37

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

Vasilije1990 self-requested a review December 4, 2024 15:32

Merge branch 'main' into feature/cog-717-create-edge-embeddings-in-ve…

080143c

…ctor-databases

Vasilije1990 previously approved these changes Dec 4, 2024

View reviewed changes

fix: adds back the ids to the nodes after node_link_graph

f444ae2

hajdul88 dismissed Vasilije1990’s stale review via f444ae2 December 4, 2024 17:14

Merge branch 'main' into feature/cog-717-create-edge-embeddings-in-ve…

36a5a27

…ctor-databases

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

feat: implements tests for index_graph_edges method

e6bf428

hajdul88 changed the title ~~chore: deletes comment from dynamic_steps_example~~ Creates edge embeddings collection Dec 4, 2024

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

cognee/tests/infrastructure/databases/test_index_graph_edges.py Outdated Show resolved Hide resolved

fix: puts index_graph_edges unit tests under unit test directory

59035c3

coderabbitai bot reviewed Dec 4, 2024

View reviewed changes

cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py Show resolved Hide resolved

Vasilije1990 approved these changes Dec 4, 2024

View reviewed changes

hajdul88 added 2 commits December 4, 2024 20:49

Merge branch 'main' into feature/cog-717-create-edge-embeddings-in-ve…

7f192e1

…ctor-databases

Merge branch 'main' into feature/cog-717-create-edge-embeddings-in-ve…

68c3f42

…ctor-databases

hajdul88 merged commit acf0368 into main Dec 5, 2024
40 checks passed

hajdul88 deleted the feature/cog-717-create-edge-embeddings-in-vector-databases branch December 5, 2024 08:13

hajdul88 mentioned this pull request Dec 18, 2024

feat: First draft of relationship embeddings #379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creates edge embeddings collection #251

Creates edge embeddings collection #251

hajdul88 commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

Creates edge embeddings collection #251

Creates edge embeddings collection #251

Conversation

hajdul88 commented Dec 4, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Dec 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

hajdul88 commented Dec 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)