-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creates edge embeddings collection #251
Creates edge embeddings collection #251
Conversation
WalkthroughThe pull request introduces several changes across multiple files. Key modifications include the refactoring of the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
cognee/tasks/storage/index_graph_edges.py (2)
48-51
: Use more descriptive variable names for clarityConsider renaming the variables
text
andcount
torelationship_name
andedge_count
to enhance readability and better reflect their purpose.Suggested change:
- for text, count in edge_types.items(): - edge = EdgeType(relationship_name=text, number_of_edges=count) + for relationship_name, edge_count in edge_types.items(): + edge = EdgeType(relationship_name=relationship_name, number_of_edges=edge_count)
66-68
: Avoid variable shadowing to improve code maintainabilityThe variable
index_name
is being reused after splitting, which can lead to confusion. It's better to use a different variable name to prevent shadowing and enhance clarity.Suggested change:
- for index_name, indexable_points in index_points.items(): - index_name, field_name = index_name.split(".") + for index_key, indexable_points in index_points.items(): + index_name, field_name = index_key.split(".")cognee/modules/graph/models/EdgeType.py (1)
9-11
: Redundant use ofOptional
for an initialized class variableSince
_metadata
is initialized with a default value, usingOptional[dict]
is unnecessary. Typing it simply asdict
improves type clarity.Suggested change:
- _metadata: Optional[dict] = { + _metadata: dict = { "index_fields": ["relationship_name"], }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
cognee/api/v1/cognify/cognify_v2.py
(2 hunks)cognee/modules/graph/models/EdgeType.py
(1 hunks)cognee/tasks/storage/index_graph_edges.py
(1 hunks)
🔇 Additional comments (2)
cognee/api/v1/cognify/cognify_v2.py (2)
21-21
: Properly importing index_graph_edges
function
The import statement correctly includes the index_graph_edges
function, ensuring it's available for use in the pipeline.
98-98
: Integrate index_graph_edges
into the cognify pipeline
Calling await index_graph_edges()
after the pipeline tasks appropriately incorporates the edge indexing step into the overall process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
cognee/infrastructure/databases/graph/networkx/adapter.py (2)
277-279
: Consider validating existing node IDs before overwritingWhile explicitly setting node IDs is good practice, consider adding validation to detect and log cases where a node's existing 'id' attribute differs from its node_id in the graph, as this might indicate data inconsistency.
for node_id, node_data in self.graph.nodes(data=True): + if 'id' in node_data and node_data['id'] != node_id: + logger.warning(f"Node {node_id} had mismatched ID attribute: {node_data['id']}. Updating to match node_id.") node_data['id'] = node_id
Line range hint
287-289
: Consider more specific error handlingThe broad Exception catch could mask specific issues. Consider handling different error types separately for better debugging.
- except Exception: - logger.error("Failed to load graph from file: %s", file_path) + except json.JSONDecodeError as e: + logger.error("Failed to parse graph file %s: %s", file_path, e) + except IOError as e: + logger.error("IO error while loading graph from %s: %s", file_path, e) + except Exception as e: + logger.error("Unexpected error loading graph from %s: %s", file_path, e)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/infrastructure/databases/graph/networkx/adapter.py
(1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/graph/networkx/adapter.py (2)
Line range hint 280-286
: LGTM: Robust empty graph initialization
The changes properly handle missing files by:
- Logging clear warning messages
- Creating necessary directories
- Maintaining graph type consistency
- Saving the empty graph state
Line range hint 277-286
: Verify node ID usage across the codebase
The changes to node ID handling could affect code that assumes specific node ID behavior.
✅ Verification successful
Let me gather more information about the graph initialization and node handling to make a comprehensive assessment.
Node ID handling is consistent with codebase requirements
The changes to node ID handling in the graph loading code are safe and consistent with the codebase's requirements:
- The node ID assignment (
node_data['id'] = node_id
) maintains consistency with the MultiDiGraph initialization pattern used across the codebase - UUID handling for node IDs is properly implemented, as evidenced by the UUID conversion in the adapter and its usage in edge operations
- Graph loading error handling follows the established pattern of initializing an empty MultiDiGraph when needed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for direct node ID access patterns and graph loading error handling
# Search for direct node ID access patterns
echo "Checking node ID access patterns..."
rg -A 2 "node\[.?['\"]id['\"]|node_data\[.?['\"]id['\"]"
# Search for graph loading error handling
echo "Checking graph loading error handling..."
rg -A 2 "load_graph_from_file"
Length of output: 2879
Script:
#!/bin/bash
# Check for graph initialization and node operations
echo "Checking graph initialization patterns..."
rg -A 3 "MultiDiGraph\(\)"
echo "Checking node operations and modifications..."
ast-grep --pattern 'def $_(self, node$_)'
echo "Checking UUID handling in nodes..."
rg -A 2 "UUID.*node"
Length of output: 4247
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
cognee/tests/infrastructure/databases/test_index_graph_edges.py (3)
4-24
: Consider enhancing test coverage with additional assertionsWhile the basic test structure is good, consider these improvements:
- Verify the actual data being passed to
index_data_points
- Assert the structure of created EdgeType objects
- Add assertions for the parameters passed to
create_vector_index
Here's a suggested enhancement:
@pytest.mark.asyncio async def test_index_graph_edges_success(): """Test that index_graph_edges uses the index datapoints and creates vector index.""" mock_graph_engine = AsyncMock() + test_relationships = [ + [{"relationship_name": "rel1", "properties": {"key": "value"}}, + {"relationship_name": "rel1", "properties": {"key2": "value2"}}], + [{"relationship_name": "rel2", "properties": {"key3": "value3"}}] + ] - mock_graph_engine.get_graph_data.return_value = (None, [ - [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}], - [{"relationship_name": "rel2"}] - ]) + mock_graph_engine.get_graph_data.return_value = (None, test_relationships) mock_vector_engine = AsyncMock() with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", return_value=mock_graph_engine), \ patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", return_value=mock_vector_engine): from cognee.tasks.storage.index_graph_edges import index_graph_edges await index_graph_edges() mock_graph_engine.get_graph_data.assert_awaited_once() assert mock_vector_engine.create_vector_index.await_count == 1 assert mock_vector_engine.index_data_points.await_count == 1 + # Verify the data passed to create_vector_index + mock_vector_engine.create_vector_index.assert_awaited_with( + "graph_edges", dimension=1536 # Adjust dimension as per your config + ) + # Verify the structure of indexed data + calls = mock_vector_engine.index_data_points.await_args_list + assert len(calls) == 1 + indexed_data = calls[0].args[0] + assert all(isinstance(item, EdgeType) for item in indexed_data)
26-43
: Enhance test documentation for clarityThe test logic is correct, but the docstring could be more descriptive about the expected behavior.
@pytest.mark.asyncio async def test_index_graph_edges_no_relationships(): - """Test that index_graph_edges handles empty relationships correctly.""" + """Test that index_graph_edges handles empty relationships correctly. + + Ensures that when no relationships are returned from the graph engine: + 1. No vector index is created + 2. No data points are indexed + 3. The function completes without errors + """
1-56
: Consider adding idempotency testThe current test suite covers the basic scenarios well. However, consider adding a test to verify that the indexing operation is idempotent - i.e., running it multiple times with the same data doesn't cause issues or duplicate entries.
This would help ensure the function can be safely retried in case of partial failures in a distributed system.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/tests/infrastructure/databases/test_index_graph_edges.py
(1 hunks)
🔇 Additional comments (1)
cognee/tests/infrastructure/databases/test_index_graph_edges.py (1)
1-2
: LGTM! Imports are appropriate for async testing
The necessary testing utilities are imported correctly.
cognee/tests/infrastructure/databases/test_index_graph_edges.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (4)
8-11
: Consider using type hints and constants for test dataThe test data structure could be more explicit and maintainable:
- Use type hints to clarify the expected data structure
- Define relationship names as constants
+from typing import List, Dict + +TEST_RELATIONSHIP_1 = "rel1" +TEST_RELATIONSHIP_2 = "rel2" + +test_data: List[List[Dict[str, str]]] = [ + [{"relationship_name": TEST_RELATIONSHIP_1}, {"relationship_name": TEST_RELATIONSHIP_1}], + [{"relationship_name": TEST_RELATIONSHIP_2}] +] - mock_graph_engine.get_graph_data.return_value = (None, [ - [{"relationship_name": "rel1"}, {"relationship_name": "rel1"}], - [{"relationship_name": "rel2"}] - ]) + mock_graph_engine.get_graph_data.return_value = (None, test_data)
27-29
: Enhance test documentation and return value verificationThe test documentation could be more explicit about the expected behavior and verify the return value.
- """Test that index_graph_edges handles empty relationships correctly.""" + """ + Test that index_graph_edges handles empty relationships correctly. + + Expected behavior: + - Should not create vector index when no relationships exist + - Should not attempt to index any data points + - Should return None without raising exceptions + """
45-54
: Consider adding more error scenarios and cleanup verificationWhile the basic error case is covered, consider adding tests for:
- Vector engine initialization failure
- Partial failures during indexing
- Cleanup/rollback behavior when errors occur
Example additional test:
@pytest.mark.asyncio async def test_index_graph_edges_vector_engine_failure(): """Test handling of vector engine initialization failure.""" mock_graph_engine = AsyncMock() with patch("cognee.tasks.storage.index_graph_edges.get_graph_engine", return_value=mock_graph_engine), \ patch("cognee.tasks.storage.index_graph_edges.get_vector_engine", side_effect=Exception("Vector engine failed")): from cognee.tasks.storage.index_graph_edges import index_graph_edges with pytest.raises(RuntimeError, match="Vector engine initialization failed"): await index_graph_edges()
1-56
: Consider adding integration testsWhile the unit tests provide good coverage of the
index_graph_edges
function, consider adding integration tests to verify:
- Actual interaction with the vector store
- End-to-end behavior in the cognify pipeline
- Performance characteristics with larger datasets
This would help ensure the feature works correctly in a production-like environment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py
(1 hunks)
🔇 Additional comments (1)
cognee/tests/unit/infrastructure/databases/test_index_graph_edges.py (1)
1-3
: LGTM: Imports are appropriate for async testing
The imports are minimal and correctly include the necessary components for async testing with pytest.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
Tests