Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deletes on the fly embeddings and uses edge collections #436

Conversation

hajdul88
Copy link
Contributor

@hajdul88 hajdul88 commented Jan 13, 2025

Summary by CodeRabbit

  • Performance Improvements

    • Streamlined vector distance mapping in graph processing
    • Simplified result collection in vector search functionality
  • Code Optimization

    • Removed duplicate vector database element filtering
    • Enhanced efficiency of search and mapping methods
  • Refactor

    • Simplified asynchronous method implementation
    • Removed unnecessary commented code segments

Copy link
Contributor

coderabbitai bot commented Jan 13, 2025

Walkthrough

The pull request introduces modifications to two key files in the Cognee project. In CogneeGraph.py, the map_vector_distances_to_graph_edges method has been streamlined with a more direct approach to retrieving and mapping vector distances. In brute_force_triplet_search.py, the duplicate filtering function delete_duplicated_vector_db_elements has been removed, replacing it with a simplified result mapping mechanism that may allow duplicate entries.

Changes

File Change Summary
cognee/modules/graph/cognee_graph/CogneeGraph.py - Added asyncio import
- Simplified map_vector_distances_to_graph_edges method
- Replaced multi-step embedding process with direct distance retrieval
cognee/modules/retrieval/brute_force_triplet_search.py - Removed delete_duplicated_vector_db_elements function
- Replaced duplicate filtering with direct collection-to-results mapping

Sequence Diagram

sequenceDiagram
    participant VectorEngine
    participant CogneeGraph
    participant GraphEdges

    CogneeGraph->>VectorEngine: get_distance_from_collection_elements
    VectorEngine-->>CogneeGraph: Return distances
    CogneeGraph->>GraphEdges: Assign distances to edges
Loading

Poem

🐰 In the graph of code, a rabbit's delight,
Distances mapped with algorithmic might!
Duplicates vanish, efficiency gleams,
Async and smooth, like moonlit streams 🌙
Cognee evolves, with each clever line! 🚀

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30c69bd and 07fcce7.

📒 Files selected for processing (2)
  • cognee/modules/graph/cognee_graph/CogneeGraph.py (2 hunks)
  • cognee/modules/retrieval/brute_force_triplet_search.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (17)
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: profiler
  • GitHub Check: docker-compose-test
🔇 Additional comments (3)
cognee/modules/retrieval/brute_force_triplet_search.py (1)

110-110: Verify impact of removing duplicate filtering

The simplified dictionary comprehension allows duplicates to persist in the results. While this aligns with the PR objective, we should verify that:

  1. Downstream processing in CogneeGraph can handle duplicate entries
  2. This change doesn't negatively impact the quality of search results

Let's check for potential duplicate handling in the codebase:

✅ Verification successful

Duplicate handling verified - safe to remove filtering

The removal of duplicate filtering is safe as the graph storage layer handles deduplication through the deduplicate_nodes_and_edges function, which is called during data point addition. Node uniqueness is also enforced at the graph level.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for duplicate handling patterns in vector and graph operations
rg -g '!{test_*,*.test}*' -A 3 'duplicat|unique' cognee/

Length of output: 8469

cognee/modules/graph/cognee_graph/CogneeGraph.py (2)

11-11: LGTM: Required import for async operations

The addition of asyncio import is necessary for the async operations in the refactored code.


130-149: Verify edge collection usage

The method now uses a single edge collection ("edge_type_relationship_name") instead of on-the-fly embeddings. Let's verify this collection exists and contains the expected data.

✅ Verification successful

Collection usage is properly managed

The "edge_type_relationship_name" collection is dynamically populated by the index_graph_edges task, which collects relationship types from graph edges, transforms them into EdgeType objects, and indexes them in the vector database. The collection name is a constant used consistently between the indexing task and the graph querying code.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for edge collection references and their structure
rg -g '!{test_*,*.test}*' "edge_type_relationship_name" cognee/

# Look for vector engine collection initialization
rg -g '!{test_*,*.test}*' -B 3 -A 3 "create_collection|init.*collection" cognee/infrastructure/databases/vector/

Length of output: 11001


Script:

#!/bin/bash
# Look for edge relationship initialization and configuration
rg -g '!{test_*,*.test}*' -B 3 -A 3 "relationship.*type|edge.*type" cognee/

# Check for configuration files and initialization
fd -e yaml -e yml -e json . cognee/ --exec cat {}

Length of output: 22780

Comment on lines +141 to 149
embedding_map = {result.payload["text"]: result.score for result in edge_distances}

# Step 4: Assign precomputed distances to edges
for edge in self.edges:
relationship_type = edge.attributes.get("relationship_type")
if not relationship_type or relationship_type not in embedding_map:
print(f"Edge {edge} has an unknown or missing relationship type.")
continue

# Assign the precomputed distance
edge.attributes["vector_distance"] = embedding_map[relationship_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve edge distance mapping robustness

The current implementation has potential issues:

  1. Silent failures with print statements
  2. No validation of embedding_map values
  3. No handling of edge cases (empty results, invalid scores)

Consider this improvement:

-        embedding_map = {result.payload["text"]: result.score for result in edge_distances}
+        embedding_map = {}
+        for result in edge_distances:
+            if "text" not in result.payload:
+                raise InvalidValueError(f"Missing 'text' in payload: {result.payload}")
+            if not isinstance(result.score, (int, float)):
+                raise InvalidValueError(f"Invalid score type: {type(result.score)}")
+            embedding_map[result.payload["text"]] = result.score

         for edge in self.edges:
             relationship_type = edge.attributes.get("relationship_type")
             if not relationship_type or relationship_type not in embedding_map:
-                print(f"Edge {edge} has an unknown or missing relationship type.")
+                logging.warning("Edge %s has an unknown or missing relationship type", edge)
                 continue

             edge.attributes["vector_distance"] = embedding_map[relationship_type]

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +130 to +139
async def map_vector_distances_to_graph_edges(self, vector_engine, query) -> None:
try:
# Step 1: Generate the query embedding
query_vector = await vector_engine.embed_data([query])
query_vector = query_vector[0]
if query_vector is None or len(query_vector) == 0:
raise ValueError("Failed to generate query embedding.")

# Step 2: Collect all unique relationship types
unique_relationship_types = set()
for edge in self.edges:
relationship_type = edge.attributes.get("relationship_type")
if relationship_type:
unique_relationship_types.add(relationship_type)

# Step 3: Embed all unique relationship types
unique_relationship_types = list(unique_relationship_types)
relationship_type_embeddings = await vector_engine.embed_data(unique_relationship_types)

# Step 4: Map relationship types to their embeddings and calculate distances
embedding_map = {}
for relationship_type, embedding in zip(
unique_relationship_types, relationship_type_embeddings
):
edge_vector = np.array(embedding)

# Calculate cosine similarity
similarity = np.dot(query_vector, edge_vector) / (
np.linalg.norm(query_vector) * np.linalg.norm(edge_vector)
)
distance = 1 - similarity
edge_distances = await vector_engine.get_distance_from_collection_elements(
"edge_type_relationship_name", query_text=query
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and logging

The method has several areas that could benefit from improved error handling:

  1. Query vector validation could raise a custom exception
  2. Collection name should be configurable
  3. Consider using proper logging instead of print statements

Consider this improvement:

 async def map_vector_distances_to_graph_edges(self, vector_engine, query) -> None:
+    EDGE_TYPE_COLLECTION = "edge_type_relationship_name"  # Move to config
     try:
         query_vector = await vector_engine.embed_data([query])
         query_vector = query_vector[0]
         if query_vector is None or len(query_vector) == 0:
-            raise ValueError("Failed to generate query embedding.")
+            raise InvalidValueError("Failed to generate query embedding: empty or null vector")

         edge_distances = await vector_engine.get_distance_from_collection_elements(
-            "edge_type_relationship_name", query_text=query
+            EDGE_TYPE_COLLECTION, query_text=query
         )
+        if not edge_distances:
+            raise InvalidValueError(f"No distances retrieved from {EDGE_TYPE_COLLECTION}")

Committable suggestion skipped: line range outside the PR's diff.

@hajdul88 hajdul88 merged commit 25d8f5e into dev Jan 13, 2025
27 checks passed
@hajdul88 hajdul88 deleted the feature/cog-762-deleting-in-memory-embeddings-from-bruteforce-search-and branch January 13, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants