Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COG 870 Remove duplicate edges from the code graph #293

Merged
merged 13 commits into from
Dec 17, 2024

Conversation

lxobr
Copy link
Collaborator

@lxobr lxobr commented Dec 10, 2024

  • Refactored the summarize_code task into a generator that does not call the add_points function.

  • summarize_code now:

    • Yields input data points.
    • Creates and yields summary data points.

    This ensures all data points reach the add_datapoints function.

  • Moved the add_datapoints task to the end of the pipeline.

  • Extracted run_code_graph_pipeline into a standalone function.

  • Added a script example for running the code graph pipeline with the new function.

Summary by CodeRabbit

  • New Features

    • Introduced a new Python script for executing a code graph pipeline.
    • Added a new function to handle the code graph processing pipeline.
  • Improvements

    • Updated method signatures for clarity and consistency.
    • Enhanced error handling and streamlined function calls for better performance.
  • Bug Fixes

    • Adjusted formatting and logic within existing functions to improve functionality.
  • Deprecation Notices

    • Marked existing functions as deprecated in favor of new implementations.
  • Documentation

    • Updated comments and structure for better readability and understanding.

@lxobr lxobr self-assigned this Dec 10, 2024
Copy link
Contributor

coderabbitai bot commented Dec 10, 2024

Walkthrough

The pull request introduces significant modifications across four files, primarily focusing on the summarize_code function in cognee/tasks/summarization/summarize_code.py. The parameter name has been updated to reflect a new expected input type, and the function now yields results instead of returning a list. Additionally, evals/eval_swe_bench.py has undergone restructuring, with the introduction of a new function to manage code graph processing tasks. A new script in examples/python/code_graph_example.py has also been added to facilitate the execution of the code graph pipeline. Furthermore, cognee/api/v1/cognify/code_graph_pipeline.py has been updated to include the new run_code_graph_pipeline function.

Changes

File Path Change Summary
cognee/tasks/summarization/summarize_code.py Renamed parameter code_files to code_graph_nodes, updated logic to return None for empty input, and changed summary construction to yield results instead of returning a list.
evals/eval_swe_bench.py Replaced generate_patch_with_cognee with run_code_graph_pipeline, redefined function calls, streamlined error handling, and modified main function for better package management.
examples/python/code_graph_example.py Added a new script that defines an asynchronous main function to execute the code graph pipeline and handle command-line arguments.
cognee/api/v1/cognify/code_graph_pipeline.py Introduced run_code_graph_pipeline function to manage code graph processing, including data setup and task execution.

Possibly related PRs

  • Add code_graph_demo notebook #191: The cognee_code_graph_demo.ipynb notebook introduces functionality for generating knowledge graphs from code, which aligns with the changes in the summarize_code function that also processes code files and generates summaries.
  • Feat/cog-544 eval on swe bench #232: This PR implements functionality for evaluating the cognee code graph pipeline, which involves processing datasets and could relate to the changes made in the summarize_code function regarding how code files are summarized and processed.
  • Feature/cog 539 implementing additional retriever approaches #262: The addition of the CodeSummary class in cognee/tasks/summarization/models.py is directly related to the changes in the summarize_code function, which now yields CodeSummary objects based on the new mapping of summaries.

Suggested labels

run-checks

Suggested reviewers

  • hajdul88
  • Vasilije1990
  • alekszievr

🐰 In the code where the graphs grow,
A new path for summaries flows.
With nodes that yield, not just return,
The rabbit hops, and lessons learn.
From files to graphs, a tale unfolds,
In every line, a story told! 🌟

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
evals/eval_swe_bench.py (2)

Line range hint 44-49: Consider moving imports inside the function to the top-level scope

The imports within the run_code_graph_pipeline function can be moved to the top of the module unless there is a specific reason to keep them inside the function (e.g., to avoid circular dependencies). This improves readability and follows Python's best practices.

Proposed code change:

-async def run_code_graph_pipeline(repo_path):
-    import os
-    import pathlib
-    import cognee
-    from cognee.infrastructure.databases.relational import create_db_and_tables
+import os
+import pathlib
+import cognee
+from cognee.infrastructure.databases.relational import create_db_and_tables

+async def run_code_graph_pipeline(repo_path):

71-71: Remove unused parameter search_type

The parameter search_type in the function generate_patch_with_cognee is not used in the function body. Consider removing it to clean up the function signature.

Proposed code change:

-async def generate_patch_with_cognee(instance, llm_client, search_type=SearchType.CHUNKS):
+async def generate_patch_with_cognee(instance, llm_client):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6d85165 and 15cf708.

📒 Files selected for processing (3)
  • cognee/tasks/summarization/summarize_code.py (1 hunks)
  • evals/eval_swe_bench.py (3 hunks)
  • examples/python/code_graph_example.py (1 hunks)

cognee/tasks/summarization/summarize_code.py Show resolved Hide resolved
cognee/tasks/summarization/summarize_code.py Show resolved Hide resolved
examples/python/code_graph_example.py Show resolved Hide resolved
@borisarzentar borisarzentar changed the base branch from main to dev December 11, 2024 20:54
evals/eval_swe_bench.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cognee/tasks/summarization/summarize_code.py (2)

14-14: 🛠️ Refactor suggestion

Update return type annotation to reflect generator

The function summarize_code now yields DataPoint objects instead of returning a list. Update the return type annotation to -> AsyncGenerator[DataPoint, None] to accurately represent its behavior.

Apply this diff:

- ) -> list[DataPoint]:
+ ) -> AsyncGenerator[DataPoint, None]:

3-3: 🛠️ Refactor suggestion

Import AsyncGenerator from typing

Since summarize_code is now an async generator, import AsyncGenerator from the typing module to properly annotate the return type.

Apply this diff:

- from typing import Type
+ from typing import Type, AsyncGenerator
🧹 Nitpick comments (6)
cognee/api/v1/cognify/code_graph_pipeline.py (5)

113-116: Consider moving imports to the module level

The imports inside the run_code_graph_pipeline function (lines 113-116) can be moved to the top of the file. This enhances readability and adheres to standard Python conventions.

Apply this diff to move the imports:

+ import os
+ import pathlib
+ import cognee
+ from cognee.infrastructure.databases.relational import create_db_and_tables

 async def run_code_graph_pipeline(repo_path):
-     import os
-     import pathlib
-     import cognee
-     from cognee.infrastructure.databases.relational import create_db_and_tables

114-114: Remove redundant import of pathlib

The module pathlib is already imported at the module level on line 3 (from pathlib import Path). The import on line 114 is redundant and can be removed.

Apply this diff to remove the redundant import:

-     import pathlib

116-116: Move create_db_and_tables import to the module level

For consistency and clarity, consider moving the import of create_db_and_tables to the module level with other imports.

Apply this diff:

+ from cognee.infrastructure.databases.relational import create_db_and_tables

 async def run_code_graph_pipeline(repo_path):
-     from cognee.infrastructure.databases.relational import create_db_and_tables

119-122: Simplify path manipulations using Path objects

Currently, both os.path.join and pathlib.Path are used for path manipulations. For better readability and consistency, consider using Path objects throughout.

Apply this diff to refactor the path handling:

-     data_directory_path = str(pathlib.Path(os.path.join(file_path, ".data_storage/code_graph")).resolve())
-     cognee_directory_path = str(pathlib.Path(os.path.join(file_path, ".cognee_system/code_graph")).resolve())
+     data_directory_path = (file_path / ".data_storage" / "code_graph").resolve()
+     cognee_directory_path = (file_path / ".cognee_system" / "code_graph").resolve()

112-136: Add a docstring to run_code_graph_pipeline

Adding a docstring to the run_code_graph_pipeline function will improve code readability and maintainability by explaining the purpose and usage of the function.

cognee/tasks/summarization/summarize_code.py (1)

18-18: Ensure generator function concludes appropriately

The return statement at line 18 exits the generator function without yielding any values if code_graph_nodes is empty. This is acceptable, but for clarity, consider using return None to explicitly indicate the end of the generator.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15cf708 and 4023b4d.

📒 Files selected for processing (4)
  • cognee/api/v1/cognify/code_graph_pipeline.py (3 hunks)
  • cognee/tasks/summarization/summarize_code.py (1 hunks)
  • evals/eval_swe_bench.py (2 hunks)
  • examples/python/code_graph_example.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/python/code_graph_example.py
🔇 Additional comments (2)
evals/eval_swe_bench.py (2)

10-10: Refactored run_code_graph_pipeline import improves modularity

Good job moving run_code_graph_pipeline to code_graph_pipeline.py and importing it here. This enhances code organization and reusability.


39-44: Function generate_patch_with_cognee correctly utilizes the pipeline

The updated function properly calls run_code_graph_pipeline and processes the results asynchronously, which aligns with best practices for async operations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
cognee/api/v1/cognify/code_graph_pipeline.py (3)

1-4: Enhance deprecation notice with timeline

The deprecation notice is clear but could be more helpful by including:

  • When these functions will be removed
  • Migration guide or link to documentation
 # NOTICE: This module contains deprecated functions.
 # Use only the run_code_graph_pipeline function; all other functions are deprecated.
+# These functions will be removed in version X.Y.Z. See migration guide at <link>
 # Related issue: COG-906

133-139: Make task configurations configurable

The task configurations use hardcoded batch sizes which might not be optimal for all scenarios.

Consider making these configurable:

+    def get_batch_size(task_name: str) -> int:
+        return cognee.config.get(f"tasks.{task_name}.batch_size", default=50)
+
     tasks = [
         Task(get_repo_file_dependencies),
-        Task(enrich_dependency_graph, task_config={"batch_size": 50}),
-        Task(expand_dependency_graph, task_config={"batch_size": 50}),
-        Task(summarize_code, summarization_model=SummarizedContent, task_config={"batch_size": 50}),
-        Task(add_data_points, task_config={"batch_size": 50}),
+        Task(enrich_dependency_graph, task_config={"batch_size": get_batch_size("enrich_dependency_graph")}),
+        Task(expand_dependency_graph, task_config={"batch_size": get_batch_size("expand_dependency_graph")}),
+        Task(summarize_code, summarization_model=SummarizedContent, task_config={"batch_size": get_batch_size("summarize_code")}),
+        Task(add_data_points, task_config={"batch_size": get_batch_size("add_data_points")}),
     ]

124-127: Consider using temporary directories for isolation

Using fixed directory names under the package directory could cause conflicts between different runs or in different environments.

Consider using tempfile.mkdtemp() to create isolated directories for each run, or make the paths configurable through environment variables.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4023b4d and ad1b435.

📒 Files selected for processing (1)
  • cognee/api/v1/cognify/code_graph_pipeline.py (4 hunks)
🔇 Additional comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

62-62: Verify error handling migration

While the deprecation notice is good, ensure that the error handling and telemetry capabilities from run_pipeline are properly migrated to run_code_graph_pipeline.

✅ Verification successful

Error handling and telemetry are properly migrated

The new run_code_graph_pipeline implementation includes comprehensive error handling and telemetry:

  • Sends telemetry events at pipeline start, completion, and error states
  • Wraps core execution in try-except block
  • Updates pipeline status appropriately for both success and error cases
  • Maintains consistent error reporting pattern with the deprecated version
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for error handling and telemetry in the new implementation
rg -A 5 "try|except|send_telemetry" cognee/api/v1/cognify/code_graph_pipeline.py

Length of output: 1600

cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved
cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved
@lxobr lxobr merged commit da5e3ab into dev Dec 17, 2024
23 of 24 checks passed
@lxobr lxobr deleted the COG-870-deduplicate-code-graph-edges branch December 17, 2024 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants