Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cog 539 implementing additional retriever approaches #262

Merged

Conversation

hajdul88
Copy link
Contributor

@hajdul88 hajdul88 commented Dec 6, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a new CodeSummary class for improved data handling related to code summaries.
    • Added asynchronous search functionality for retrieving code parts based on description queries.
    • Enhanced Node and Edge classes with new methods for better data retrieval.
  • Bug Fixes

    • Improved error handling in the add_edge method to log messages instead of raising exceptions.
  • Tests

    • Removed the test_add_duplicate_edge function from the test suite for the CogneeGraph class.
    • Added asynchronous unit tests for the code_description_to_code_part function, covering various scenarios including input validation and error handling.

Copy link
Contributor

coderabbitai bot commented Dec 6, 2024

Walkthrough

The changes introduce a new class named CodeSummary in the cognee/tasks/summarization/models.py file. This class inherits from DataPoint and is associated with a new table called code_summary. It features two attributes: text, which is a string, and made_from, which is of type CodeFile. The _metadata dictionary for CodeSummary includes index_fields set to ["text"], aligning it with the existing TextSummary class, which remains unchanged. Additionally, modifications were made to the CogneeGraph class to alter error handling, and new methods were introduced in the Node and Edge classes to enhance functionality. A new file for asynchronous code part searching was also added.

Changes

File Path Change Summary
cognee/tasks/summarization/models.py Added new class CodeSummary inheriting from DataPoint, with table name code_summary, attributes text and made_from, and _metadata specifying index_fields.
cognee/modules/graph/cognee_graph/CogneeGraph.py Modified add_edge method to log existing edges instead of raising an error; updated project_graph_from_db for better error handling.
cognee/modules/graph/cognee_graph/CogneeGraphElements.py Added methods get_skeleton_edges and get_skeleton_neighbours in Node; modified get_attribute method in Edge and added get_source_node and get_destination_node methods.
cognee/modules/retrieval/description_to_codepart_search.py Introduced asynchronous functions code_description_to_code_part_search and code_description_to_code_part for searching code parts based on descriptions.
cognee/tests/unit/modules/graph/cognee_graph_test.py Removed test_add_duplicate_edge function, which tested for EntityAlreadyExistsError when adding duplicate edges.
cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py Added asynchronous unit tests for code_description_to_code_part function covering various scenarios and error handling.

Possibly related PRs

  • Small cleanup pull request #201: The changes in this PR involve modifications to the classify_documents function, which may relate to how documents are processed and classified, potentially impacting the functionality of the CodeSummary class introduced in the main PR.
  • fix: refactor get_graph_from_model to return nodes and edges correctly #257: This PR refactors the get_graph_from_model function, which may interact with the CodeSummary class if it is involved in graph-related operations, suggesting a connection to the changes made in the main PR.

Suggested reviewers

  • Vasilije1990

🐇 "In the code, a new class hops,
With fields to store, it never stops.
CodeSummary joins the data dance,
In the database, it takes a chance.
With text and files, it finds its way,
A bright new feature here to stay!" 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 00a3668 and a39e692.

📒 Files selected for processing (3)
  • cognee/modules/graph/cognee_graph/CogneeGraphElements.py (2 hunks)
  • cognee/modules/retrieval/description_to_codepart_search.py (1 hunks)
  • cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cognee/modules/retrieval/description_to_codepart_search.py
  • cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py
  • cognee/modules/graph/cognee_graph/CogneeGraphElements.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Base automatically changed from fix/graph-extraction-from-model to main December 6, 2024 11:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
cognee/tasks/summarization/models.py (3)

17-17: Add class documentation

Please add a docstring to document the purpose and usage of the CodeSummary class.

 class CodeSummary(DataPoint):
+    """Represents a summary of code content with indexable text.
+
+    Attributes:
+        text (str): The summary text content
+        made_from (CodeFile): The source code file this summary is derived from
+    """
     __tablename__ = "code_summary"

Line range hint 17-22: Consider using more explicit type hints

The type hints could be more explicit using the typing module for better static type checking.

+from typing import ClassVar, Dict, List
+
 class CodeSummary(DataPoint):
     __tablename__ = "code_summary"
-    text: str
-    made_from: CodeFile
+    text: str
+    made_from: CodeFile
 
-    _metadata: dict = {
+    _metadata: ClassVar[Dict[str, List[str]]] = {
         "index_fields": ["text"],
     }

Line range hint 21-22: Consider code-specific indexing strategy

The current indexing strategy is identical to TextSummary. Consider whether code summaries might benefit from additional indexed fields specific to code content (e.g., function names, class names).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 348610e and fb0f97c.

📒 Files selected for processing (1)
  • cognee/tasks/summarization/models.py (1 hunks)
🔇 Additional comments (1)
cognee/tasks/summarization/models.py (1)

Line range hint 17-22: Verify CodeFile relationship implementation

Let's verify that the CodeFile relationship is properly implemented in the database schema and that all necessary foreign key constraints are in place.

@hajdul88
Copy link
Contributor Author

hajdul88 commented Dec 6, 2024

This is work in progress please do not merge to main!

Copy link

gitguardian bot commented Dec 6, 2024

⚠️ GitGuardian has uncovered 4 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 51b825e .github/workflows/test_deduplication.yml View secret
9573981 Triggered Generic Password 51b825e cognee/tests/test_deduplication.py View secret
9573981 Triggered Generic Password 3ef3376 .github/workflows/test_deduplication.yml View secret
9573981 Triggered Generic Password 3ef3376 cognee/tests/test_deduplication.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

…aches' of https://github.com/topoteretes/cognee into feature/cog-539-implementing-additional-retriever-approaches
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (9)
cognee/modules/retrieval/description_to_codepart_search.py (3)

12-12: PEP 8 Style: Remove spaces around '=' in default arguments

In the function definition, there are extra spaces around the = in default arguments. According to PEP 8 style guidelines, there should be no spaces around = in function signatures with default parameter values.

Apply this diff to fix the style issue:

- async def code_description_to_code_part_search(query: str, user: User = None, top_k = 2) -> list:
+ async def code_description_to_code_part_search(query: str, user: User=None, top_k=2) -> list:

69-69: Reminder: Address the TODO comment related to codegraph structure

The TODO comment indicates that this section needs updating when the codegraph structure changes. Proactively addressing this will prevent future compatibility issues.

Would you like assistance in updating this section to accommodate the upcoming codegraph structure changes?


84-94: Consider moving test code under __main__ to a separate test module

While the test code in the __main__ block is useful for demonstration, it's better practice to place tests in a separate test file using a testing framework like pytest. This improves maintainability and aligns with best testing practices.

cognee/modules/graph/cognee_graph/CogneeGraphElements.py (2)

68-73: Add return type annotations to new methods

For better code clarity and type checking, consider adding return type annotations to the get_skeleton_edges and get_skeleton_neighbours methods in the Node class.

Apply this diff:

def get_skeleton_edges(self) -> List["Edge"]:
    return self.skeleton_edges

def get_skeleton_neighbours(self) -> List["Node"]:
    return self.skeleton_neighbours

121-126: Add return type annotations to new methods

To improve code readability and facilitate type checking, add return type annotations to the get_node_from and get_node_to methods in the Edge class.

Apply this diff:

def get_node_from(self) -> "Node":
    return self.node1

def get_node_to(self) -> "Node":
    return self.node2
cognee/modules/graph/cognee_graph/CogneeGraph.py (4)

Line range hint 98-100: Replace print statements with logging and reconsider exception handling in project_graph_from_db

The method catches exceptions and prints error messages, which is not ideal for production code. Use the logging module to log errors, and consider re-raising exceptions or handling them appropriately to avoid silent failures.

Apply this diff:

except (ValueError, TypeError) as e:
-     print(f"Error projecting graph: {e}")
+     logging.error(f"Error projecting graph: {e}")
+     raise

except Exception as ex:
-     print(f"Unexpected error: {ex}")
+     logging.exception("Unexpected error in project_graph_from_db")
+     raise

Make sure to import the logging module if it's not already imported.


Line range hint 109-110: Replace print statements with logging in map_vector_distances_to_graph_nodes

Using print statements for logging is not recommended in production code. Replace them with appropriate logging calls to control log levels and outputs.

Apply this diff:

else:
-     print(f"Node with id {node_id} not found in the graph.")
+     logging.warning(f"Node with id {node_id} not found in the graph.")

Line range hint 142-143: Replace print statements with logging in map_vector_distances_to_graph_edges

Replace print statements with logging to improve log management and avoid cluttering the console.

Apply this diff:

-                     print(f"Edge {edge} has an unknown or missing relationship type.")
+                     logging.warning(f"Edge {edge} has an unknown or missing relationship type.")

Line range hint 145-146: Replace print statement with logging in exception handling

Using print statements in exception handling is not recommended. Use logging.exception to log the error along with the traceback, which aids in debugging.

Apply this diff:

except Exception as ex:
-     print(f"Error mapping vector distances to edges: {ex}")
+     logging.exception("Error mapping vector distances to edges")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fb0f97c and 3ef3376.

📒 Files selected for processing (3)
  • cognee/modules/graph/cognee_graph/CogneeGraph.py (1 hunks)
  • cognee/modules/graph/cognee_graph/CogneeGraphElements.py (2 hunks)
  • cognee/modules/retrieval/description_to_codepart_search.py (1 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (7)
cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py (7)

4-6: Remove unnecessary empty lines

Multiple consecutive empty lines can be reduced to one for better readability.

 from unittest.mock import AsyncMock, patch

-

-
 @pytest.mark.asyncio

7-21: Enhance test coverage with parameter verification

While the test correctly verifies the empty result case, it would be more robust to also verify that the mock was called with the expected parameters.

         from cognee.modules.retrieval.description_to_codepart_search import code_description_to_code_part
         result = await code_description_to_code_part("search query", mock_user, 2)
 
         assert result == []
+        mock_vector_engine.search.assert_called_once_with("search query", 2)

24-31: Consider using parametrize for comprehensive input validation

The test could be enhanced to cover more invalid input cases using pytest's parametrize decorator.

+@pytest.mark.parametrize("invalid_query", [
+    "",
+    None,
+    123,
+    [],
+])
 @pytest.mark.asyncio
-async def test_code_description_to_code_part_invalid_query():
+async def test_code_description_to_code_part_invalid_query(invalid_query):
     """Test that code_description_to_code_part raises ValueError for invalid query."""
     mock_user = AsyncMock()
     with pytest.raises(ValueError, match="The query must be a non-empty string."):
         from cognee.modules.retrieval.description_to_codepart_search import code_description_to_code_part
-        await code_description_to_code_part("", mock_user, 2)
+        await code_description_to_code_part(invalid_query, mock_user, 2)

33-40: Extend top_k validation tests

The test should cover more invalid cases for the top_k parameter.

+@pytest.mark.parametrize("invalid_top_k,expected_message", [
+    (0, "top_k must be a positive integer."),
+    (-1, "top_k must be a positive integer."),
+    (1.5, "top_k must be a positive integer."),
+    ("2", "top_k must be a positive integer."),
+])
 @pytest.mark.asyncio
-async def test_code_description_to_code_part_invalid_top_k():
+async def test_code_description_to_code_part_invalid_top_k(invalid_top_k, expected_message):
     """Test that code_description_to_code_part raises ValueError for invalid top_k."""
     mock_user = AsyncMock()
-    with pytest.raises(ValueError, match="top_k must be a positive integer."):
+    with pytest.raises(ValueError, match=expected_message):
         from cognee.modules.retrieval.description_to_codepart_search import code_description_to_code_part
-        await code_description_to_code_part("search query", mock_user, 0)
+        await code_description_to_code_part("search query", mock_user, invalid_top_k)

42-52: Verify exception chain and test different initialization scenarios

Consider testing different initialization failure scenarios and verifying the exception chain.

+@pytest.mark.parametrize("init_error,expected_message", [
+    (Exception("Engine init failed"), "System initialization error. Please try again later."),
+    (ConnectionError("Database unreachable"), "System initialization error. Please try again later."),
+    (ValueError("Invalid configuration"), "System initialization error. Please try again later."),
+])
 @pytest.mark.asyncio
-async def test_code_description_to_code_part_initialization_error():
+async def test_code_description_to_code_part_initialization_error(init_error, expected_message):
     """Test that code_description_to_code_part raises RuntimeError for engine initialization errors."""
     mock_user = AsyncMock()
-    with patch("cognee.modules.retrieval.description_to_codepart_search.get_vector_engine", side_effect=Exception("Engine init failed")), \
+    with patch("cognee.modules.retrieval.description_to_codepart_search.get_vector_engine", side_effect=init_error), \
          patch("cognee.modules.retrieval.description_to_codepart_search.get_graph_engine", return_value=AsyncMock()):
 
         from cognee.modules.retrieval.description_to_codepart_search import code_description_to_code_part
-        with pytest.raises(RuntimeError, match="System initialization error. Please try again later."):
+        with pytest.raises(RuntimeError, match=expected_message) as exc_info:
             await code_description_to_code_part("search query", mock_user, 2)
+            assert exc_info.value.__cause__ == init_error

54-68: Enhance execution error testing

Consider testing different execution error scenarios and verifying the search attempt was made.

+@pytest.mark.parametrize("exec_error,expected_message", [
+    (Exception("Execution error"), "An error occurred while processing your request."),
+    (TimeoutError("Search timeout"), "An error occurred while processing your request."),
+    (ValueError("Invalid search parameters"), "An error occurred while processing your request."),
+])
 @pytest.mark.asyncio
-async def test_code_description_to_code_part_execution_error():
+async def test_code_description_to_code_part_execution_error(exec_error, expected_message):
     """Test that code_description_to_code_part raises RuntimeError for execution errors."""
     mock_user = AsyncMock()
     mock_user.id = "user123"
     mock_vector_engine = AsyncMock()
-    mock_vector_engine.search.side_effect = Exception("Execution error")
+    mock_vector_engine.search.side_effect = exec_error
 
     with patch("cognee.modules.retrieval.description_to_codepart_search.get_vector_engine", return_value=mock_vector_engine), \
          patch("cognee.modules.retrieval.description_to_codepart_search.get_graph_engine", return_value=AsyncMock()), \
          patch("cognee.modules.retrieval.description_to_codepart_search.CogneeGraph", return_value=AsyncMock()):
 
         from cognee.modules.retrieval.description_to_codepart_search import code_description_to_code_part
-        with pytest.raises(RuntimeError, match="An error occurred while processing your request."):
+        with pytest.raises(RuntimeError, match=expected_message) as exc_info:
             await code_description_to_code_part("search query", mock_user, 2)
+            # Verify search was attempted
+            mock_vector_engine.search.assert_called_once()
+            assert exc_info.value.__cause__ == exec_error

1-68: Consider adding tests for missing scenarios

While the current test coverage is good for error cases, consider adding tests for:

  1. Success case with actual search results (including result transformation)
  2. Edge cases with large top_k values
  3. Authentication/authorization scenarios
  4. Rate limiting scenarios
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3cbcfc1 and 00a3668.

📒 Files selected for processing (3)
  • cognee/modules/graph/cognee_graph/CogneeGraphElements.py (2 hunks)
  • cognee/modules/retrieval/description_to_codepart_search.py (1 hunks)
  • cognee/tests/unit/modules/retriever/test_description_to_codepart_search.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/modules/retrieval/description_to_codepart_search.py
  • cognee/modules/graph/cognee_graph/CogneeGraphElements.py

@borisarzentar borisarzentar merged commit 6d85165 into main Dec 10, 2024
42 checks passed
@borisarzentar borisarzentar deleted the feature/cog-539-implementing-additional-retriever-approaches branch December 10, 2024 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants