feat: deletes on the fly embeddings and uses edge collections #436

hajdul88 · 2025-01-13T10:51:06Z

Summary by CodeRabbit

Performance Improvements
- Streamlined vector distance mapping in graph processing
- Simplified result collection in vector search functionality
Code Optimization
- Removed duplicate vector database element filtering
- Enhanced efficiency of search and mapping methods
Refactor
- Simplified asynchronous method implementation
- Removed unnecessary commented code segments

coderabbitai · 2025-01-13T10:51:14Z

Walkthrough

The pull request introduces modifications to two key files in the Cognee project. In CogneeGraph.py, the map_vector_distances_to_graph_edges method has been streamlined with a more direct approach to retrieving and mapping vector distances. In brute_force_triplet_search.py, the duplicate filtering function delete_duplicated_vector_db_elements has been removed, replacing it with a simplified result mapping mechanism that may allow duplicate entries.

Changes

File	Change Summary
`cognee/modules/graph/cognee_graph/CogneeGraph.py`	- Added `asyncio` import - Simplified `map_vector_distances_to_graph_edges` method - Replaced multi-step embedding process with direct distance retrieval
`cognee/modules/retrieval/brute_force_triplet_search.py`	- Removed `delete_duplicated_vector_db_elements` function - Replaced duplicate filtering with direct collection-to-results mapping

Sequence Diagram

sequenceDiagram
    participant VectorEngine
    participant CogneeGraph
    participant GraphEdges

    CogneeGraph->>VectorEngine: get_distance_from_collection_elements
    VectorEngine-->>CogneeGraph: Return distances
    CogneeGraph->>GraphEdges: Assign distances to edges

Poem

🐰 In the graph of code, a rabbit's delight,
Distances mapped with algorithmic might!
Duplicates vanish, efficiency gleams,
Async and smooth, like moonlit streams 🌙
Cognee evolves, with each clever line! 🚀

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30c69bd and 07fcce7.

📒 Files selected for processing (2)

cognee/modules/graph/cognee_graph/CogneeGraph.py (2 hunks)
cognee/modules/retrieval/brute_force_triplet_search.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (17)

GitHub Check: run_notebook_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: profiler
GitHub Check: docker-compose-test

🔇 Additional comments (3)

cognee/modules/retrieval/brute_force_triplet_search.py (1)

110-110: Verify impact of removing duplicate filtering

The simplified dictionary comprehension allows duplicates to persist in the results. While this aligns with the PR objective, we should verify that:

Downstream processing in CogneeGraph can handle duplicate entries

This change doesn't negatively impact the quality of search results

Let's check for potential duplicate handling in the codebase:

✅ Verification successful

Duplicate handling verified - safe to remove filtering

The removal of duplicate filtering is safe as the graph storage layer handles deduplication through the deduplicate_nodes_and_edges function, which is called during data point addition. Node uniqueness is also enforced at the graph level.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for duplicate handling patterns in vector and graph operations
rg -g '!{test_*,*.test}*' -A 3 'duplicat|unique' cognee/
Length of output: 8469
cognee/modules/graph/cognee_graph/CogneeGraph.py (2)

11-11: LGTM: Required import for async operations

The addition of asyncio import is necessary for the async operations in the refactored code.

130-149: Verify edge collection usage

The method now uses a single edge collection ("edge_type_relationship_name") instead of on-the-fly embeddings. Let's verify this collection exists and contains the expected data.

✅ Verification successful

Collection usage is properly managed

The "edge_type_relationship_name" collection is dynamically populated by the index_graph_edges task, which collects relationship types from graph edges, transforms them into EdgeType objects, and indexes them in the vector database. The collection name is a constant used consistently between the indexing task and the graph querying code.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for edge collection references and their structure
rg -g '!{test_*,*.test}*' "edge_type_relationship_name" cognee/

# Look for vector engine collection initialization
rg -g '!{test_*,*.test}*' -B 3 -A 3 "create_collection|init.*collection" cognee/infrastructure/databases/vector/
Length of output: 11001

Script:
#!/bin/bash
# Look for edge relationship initialization and configuration
rg -g '!{test_*,*.test}*' -B 3 -A 3 "relationship.*type|edge.*type" cognee/

# Check for configuration files and initialization
fd -e yaml -e yml -e json . cognee/ --exec cat {}
Length of output: 22780

coderabbitai · 2025-01-13T10:53:58Z

cognee/modules/graph/cognee_graph/CogneeGraph.py

+            embedding_map = {result.payload["text"]: result.score for result in edge_distances}

-            # Step 4: Assign precomputed distances to edges
            for edge in self.edges:
                relationship_type = edge.attributes.get("relationship_type")
                if not relationship_type or relationship_type not in embedding_map:
                    print(f"Edge {edge} has an unknown or missing relationship type.")
                    continue

-                # Assign the precomputed distance
                edge.attributes["vector_distance"] = embedding_map[relationship_type]


🛠️ Refactor suggestion

Improve edge distance mapping robustness

The current implementation has potential issues:

Silent failures with print statements

No validation of embedding_map values

No handling of edge cases (empty results, invalid scores)

Consider this improvement:

- embedding_map = {result.payload["text"]: result.score for result in edge_distances} + embedding_map = {} + for result in edge_distances: + if "text" not in result.payload: + raise InvalidValueError(f"Missing 'text' in payload: {result.payload}") + if not isinstance(result.score, (int, float)): + raise InvalidValueError(f"Invalid score type: {type(result.score)}") + embedding_map[result.payload["text"]] = result.score for edge in self.edges: relationship_type = edge.attributes.get("relationship_type") if not relationship_type or relationship_type not in embedding_map: - print(f"Edge {edge} has an unknown or missing relationship type.") + logging.warning("Edge %s has an unknown or missing relationship type", edge) continue edge.attributes["vector_distance"] = embedding_map[relationship_type]

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-01-13T10:53:58Z

cognee/modules/graph/cognee_graph/CogneeGraph.py

+    async def map_vector_distances_to_graph_edges(self, vector_engine, query) -> None:
        try:
-            # Step 1: Generate the query embedding
            query_vector = await vector_engine.embed_data([query])
            query_vector = query_vector[0]
            if query_vector is None or len(query_vector) == 0:
                raise ValueError("Failed to generate query embedding.")

-            # Step 2: Collect all unique relationship types
-            unique_relationship_types = set()
-            for edge in self.edges:
-                relationship_type = edge.attributes.get("relationship_type")
-                if relationship_type:
-                    unique_relationship_types.add(relationship_type)
-
-            # Step 3: Embed all unique relationship types
-            unique_relationship_types = list(unique_relationship_types)
-            relationship_type_embeddings = await vector_engine.embed_data(unique_relationship_types)
-
-            # Step 4: Map relationship types to their embeddings and calculate distances
-            embedding_map = {}
-            for relationship_type, embedding in zip(
-                unique_relationship_types, relationship_type_embeddings
-            ):
-                edge_vector = np.array(embedding)
-
-                # Calculate cosine similarity
-                similarity = np.dot(query_vector, edge_vector) / (
-                    np.linalg.norm(query_vector) * np.linalg.norm(edge_vector)
-                )
-                distance = 1 - similarity
+            edge_distances = await vector_engine.get_distance_from_collection_elements(
+                "edge_type_relationship_name", query_text=query
+            )


🛠️ Refactor suggestion

Enhance error handling and logging

The method has several areas that could benefit from improved error handling:

Query vector validation could raise a custom exception

Collection name should be configurable

Consider using proper logging instead of print statements

Consider this improvement:

async def map_vector_distances_to_graph_edges(self, vector_engine, query) -> None: + EDGE_TYPE_COLLECTION = "edge_type_relationship_name" # Move to config try: query_vector = await vector_engine.embed_data([query]) query_vector = query_vector[0] if query_vector is None or len(query_vector) == 0: - raise ValueError("Failed to generate query embedding.") + raise InvalidValueError("Failed to generate query embedding: empty or null vector") edge_distances = await vector_engine.get_distance_from_collection_elements( - "edge_type_relationship_name", query_text=query + EDGE_TYPE_COLLECTION, query_text=query ) + if not edge_distances: + raise InvalidValueError(f"No distances retrieved from {EDGE_TYPE_COLLECTION}")

Committable suggestion skipped: line range outside the PR's diff.

…-from-bruteforce-search-and

feat: deletes on the fly embeddings as uses edge collections

07fcce7

hajdul88 added the run-checks label Jan 13, 2025

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

Merge branch 'dev' into feature/cog-762-deleting-in-memory-embeddings…

d2f06e2

…-from-bruteforce-search-and

hajdul88 requested review from borisarzentar, Vasilije1990, dexters1, alekszievr and lxobr January 13, 2025 13:26

Merge branch 'dev' into feature/cog-762-deleting-in-memory-embeddings…

8621127

…-from-bruteforce-search-and

borisarzentar approved these changes Jan 13, 2025

View reviewed changes

hajdul88 merged commit 25d8f5e into dev Jan 13, 2025
27 checks passed

hajdul88 deleted the feature/cog-762-deleting-in-memory-embeddings-from-bruteforce-search-and branch January 13, 2025 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deletes on the fly embeddings and uses edge collections #436

feat: deletes on the fly embeddings and uses edge collections #436

hajdul88 commented Jan 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 13, 2025

coderabbitai bot Jan 13, 2025

feat: deletes on the fly embeddings and uses edge collections #436

feat: deletes on the fly embeddings and uses edge collections #436

Conversation

hajdul88 commented Jan 13, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jan 13, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Poem

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 13, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 13, 2025

Choose a reason for hiding this comment

hajdul88 commented Jan 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)