Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest non-code files in code graph pipeline #380

Closed
wants to merge 2 commits into from

Conversation

alekszievr
Copy link
Contributor

@alekszievr alekszievr commented Dec 18, 2024

In this first version, I assume for simplicity that every non-py file is documentation.

Summary by CodeRabbit

  • New Features

    • Enhanced processing capabilities for both code and non-code files in the pipeline.
    • New functions added for retrieving non-Python files and user-specific datasets.
  • Bug Fixes

    • Improved error handling for repository path existence checks.
  • Documentation

    • Updated method signatures and added new function descriptions for clarity.

Copy link
Contributor

coderabbitai bot commented Dec 18, 2024

Walkthrough

This pull request enhances the code graph pipeline functionality by introducing a new parameter include_docs to the run_code_graph_pipeline function. The changes enable more comprehensive data processing by conditionally handling non-code files and documents. New utility functions are added to retrieve non-Python files and user-specific datasets, expanding the pipeline's flexibility in processing repository contents.

Changes

File Change Summary
cognee/api/v1/cognify/code_graph_pipeline.py - Added include_docs parameter to run_code_graph_pipeline
- Introduced conditional logic for processing non-code files
- Added new imports for embedding engine and knowledge graph
cognee/tasks/repo_processor/__init__.py - Imported get_data_list_for_user and get_non_py_files functions
cognee/tasks/repo_processor/get_non_code_files.py - Added get_non_py_files() to retrieve non-Python files
- Added get_data_list_for_user() to fetch user-specific datasets

Sequence Diagram

sequenceDiagram
    participant User
    participant CodeGraphPipeline
    participant EmbeddingEngine
    participant KnowledgeGraph
    
    User->>CodeGraphPipeline: run_code_graph_pipeline(repo_path, include_docs=True)
    CodeGraphPipeline->>EmbeddingEngine: Get embedding engine
    CodeGraphPipeline->>CodeGraphPipeline: Fetch default user
    
    alt include_docs is True
        CodeGraphPipeline->>CodeGraphPipeline: Retrieve non-code files
        CodeGraphPipeline->>KnowledgeGraph: Extract graphs from data
        CodeGraphPipeline->>CodeGraphPipeline: Summarize text
    end
    
    CodeGraphPipeline-->>User: Return processed results
Loading

Possibly related PRs

Suggested reviewers

  • lxobr
  • Vasilije1990

Poem

🐰 Hopping through code with glee and might,
Docs and files now dance in light!
Non-Python treasures, come what may,
Our pipeline grows smarter today!
Rabbit's code magic takes its flight! 🚀

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
examples/python/code_graph_example.py (1)

13-14: Boolean CLI parameter note.
Argparse interprets booleans in Python-specific ways. The approach here should work, but be mindful that some users might pass “true” or “false” in different formats. Consider a more robust approach if usage is widespread.

cognee/tasks/repo_processor/get_non_code_files.py (3)

18-28: Efficient retrieval of non-Python files.
Traversing the directory to filter out “.py” files is a straightforward approach. Consider adding logging or metrics for large repositories to track performance if needed.


30-39: Streamlined function for returning non-Python file paths.
Locating the repository path from the DataPoint list is succinct. If multiple DataPoints represent different repos, consider clarifying or extending the logic in future versions to handle multiple repos.


40-46: Retrieving data documents for user’s dataset.
This method effectively aggregates all relevant data documents. Potentially, adding pagination or partial retrieval mechanisms might help avoid large memory usage with very large datasets.

cognee/tasks/repo_processor/__init__.py (1)

7-7: Place module-level import at top.
Per linting suggestions (E402), consider moving this import to the top.

-import logging

logger = logging.getLogger("task:repo_processor")

from .enrich_dependency_graph import enrich_dependency_graph
from .expand_dependency_graph import expand_dependency_graph
from .get_non_code_files import get_data_list_for_user, get_non_code_files
from .get_repo_file_dependencies import get_repo_file_dependencies
+import logging

+logger = logging.getLogger("task:repo_processor")
🧰 Tools
🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and e93dc12.

📒 Files selected for processing (4)
  • cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
  • cognee/tasks/repo_processor/__init__.py (1 hunks)
  • cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
  • examples/python/code_graph_example.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (10)
cognee/api/v1/cognify/code_graph_pipeline.py (8)

10-10: Good addition of the cognify config import.
This import is necessary for retrieving the summarization model configuration used later in the code.


23-23: Useful import of KnowledgeGraph for non-code data extraction.
This addition aligns well with the new tasks that process and extract graphs from non-code files.


29-30: Expanded graph and ingestion tasks.
Importing these new tasks (extract_graph_from_data and ingest_data_with_metadata) is essential for handling documentation. Everything looks consistent.


33-34: Import for new helper functions.
The new non-code file retrieval and dataset retrieval methods are important for the pipeline expansion.


37-37: Summarization tasks included.
The imported summarize_text complements summarize_code for non-code file summarizations.


127-130: New parameter for handling non-code files.
Updating the function signature to include 'include_docs' is a clean and logical approach. The documentation for the parameter usage is concise and clear.


144-146: Retrieving config and default user.
These lines ensure the pipeline has access to the relevant config and user context. No issues observed.


155-171: Conditional pipeline tasks for non-code files.
This block effectively groups tasks required for retrieving, ingesting, classifying, summarizing, and building graphs for non-Python files. The approach is modular, making the pipeline clearer and easier to maintain.

examples/python/code_graph_example.py (2)

7-8: New parameter 'include_docs' for main.
Allowing users to control non-code file processing from the CLI is a practical extension. The asynchronous call to 'run_code_graph_pipeline' with this new parameter is straightforward.


16-16: Coordinates with new 'include_docs' parameter.
The call to main with both arguments is correct and fulfills the pipeline configuration.

cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved


async def get_non_code_files(data_points: list[DataPoint]):
for data_point in data_points:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task will run every time for every batch, which can be a bit slow on huge amount of data points. If we split the pipeline like I mentioned above, then you don't need to do this, but receive the repo_path as entry data into this pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borisarzentar I made the suggested changes, could you please look at it again?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

Line range hint 128-146: Add docstring to document the include_docs parameter

The function signature change is good, but it would benefit from documentation explaining the purpose and impact of the include_docs parameter.

Example docstring:

async def run_code_graph_pipeline(repo_path, include_docs=True):
    """
    Run the code graph pipeline on a repository.
    
    Args:
        repo_path: Path to the repository
        include_docs: If True, processes non-Python files as documentation (default: True)
    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e93dc12 and 3f96416.

📒 Files selected for processing (4)
  • cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
  • cognee/tasks/repo_processor/__init__.py (1 hunks)
  • cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
  • examples/python/code_graph_example.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/python/code_graph_example.py
  • cognee/tasks/repo_processor/get_non_code_files.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (3)
cognee/tasks/repo_processor/__init__.py (1)

7-7: LGTM: Import aligns with PR objectives

The addition of get_data_list_for_user and get_non_py_files imports is appropriate for enabling non-code file processing functionality.

🧰 Tools
🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

7-7: LGTM: Imports support non-code file processing

The new imports appropriately include all necessary modules for document processing, graph extraction, and data ingestion.

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38


156-170: Consider splitting into separate pipelines

As suggested in a previous review, consider splitting this into separate pipelines for code and documentation processing. This would provide several benefits:

  1. Better separation of concerns
  2. Possibility of parallel execution
  3. Easier maintenance and testing
  4. Independent scaling of each pipeline

Comment on lines 172 to 88
async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
yield result

if include_docs:
async for result in run_tasks(non_code_tasks, repo_path):
yield result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and status tracking for document pipeline

The document processing pipeline would benefit from:

  1. Error handling similar to the deprecated run_pipeline
  2. Status tracking for document processing tasks
  3. Telemetry for document pipeline execution

Example structure:

try:
    async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
        yield result
    
    if include_docs:
        send_telemetry("doc_pipeline_started", user.id)
        try:
            async for result in run_tasks(non_code_tasks, repo_path):
                yield result
            send_telemetry("doc_pipeline_completed", user.id)
        except Exception as error:
            send_telemetry("doc_pipeline_errored", user.id)
            raise error
except Exception as error:
    send_telemetry("code_pipeline_errored", user.id)
    raise error

@alekszievr alekszievr force-pushed the COG-697-ingest-non-code-files branch from 8f4d480 to 428b2cf Compare December 18, 2024 22:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

156-174: ⚠️ Potential issue

Add error handling and status tracking for document pipeline

The document processing pipeline would benefit from:

  1. Error handling similar to the deprecated run_pipeline
  2. Status tracking for document processing tasks
  3. Telemetry for document pipeline execution

Example structure:

try:
    send_telemetry("doc_pipeline_started", user.id)
    async for result in run_tasks(non_code_tasks, repo_path):
        yield result
    send_telemetry("doc_pipeline_completed", user.id)
except Exception as error:
    send_telemetry("doc_pipeline_errored", user.id)
    raise error
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f96416 and 428b2cf.

📒 Files selected for processing (4)
  • cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
  • cognee/tasks/repo_processor/__init__.py (1 hunks)
  • cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
  • examples/python/code_graph_example.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/tasks/repo_processor/get_non_code_files.py
  • examples/python/code_graph_example.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (4)
cognee/tasks/repo_processor/__init__.py (1)

7-7: LGTM: Import statement aligns with PR objectives

The import of get_data_list_for_user and get_non_py_files functions is correctly placed and supports the new non-code file processing functionality.

🧰 Tools
🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/api/v1/cognify/code_graph_pipeline.py (3)

7-7: LGTM: New imports support non-code file processing

The added imports provide necessary functionality for processing non-code files, including document handling, graph extraction, and summarization capabilities.

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38


Line range hint 128-147: LGTM: Function signature and initialization look good

The addition of the include_docs parameter with a default value of True maintains backward compatibility while enabling the new functionality. The initialization of config and user is properly placed.


156-169: Consider splitting document processing into a separate pipeline

As suggested in the previous review, separating this into a dedicated pipeline would allow for better:

  • Parallel execution
  • Independent error handling
  • Separate status tracking

Comment on lines 171 to 194
if include_docs:
async for result in run_tasks(non_code_tasks, repo_path):
yield result

async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
yield result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider parallel execution of document and code tasks

Since there's no apparent dependency between document and code tasks, consider running them in parallel using asyncio.gather() to improve performance:

if include_docs:
    doc_pipeline = run_tasks(non_code_tasks, repo_path)
code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")

async for result in chain(
    doc_pipeline if include_docs else [],
    code_pipeline
):
    yield result

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)

Line range hint 135-154: Consider extracting initialization logic to a separate function.

The function is mixing pipeline execution with system initialization concerns. This could lead to unexpected side effects and makes testing more difficult.

Consider extracting the initialization logic:

async def initialize_pipeline():
    file_path = Path(__file__).parent
    data_directory_path = str(pathlib.Path(os.path.join(file_path, ".data_storage/code_graph")).resolve())
    cognee.config.data_root_directory(data_directory_path)
    cognee_directory_path = str(pathlib.Path(os.path.join(file_path, ".cognee_system/code_graph")).resolve())
    cognee.config.system_root_directory(cognee_directory_path)

    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    await create_db_and_tables()

async def run_code_graph_pipeline(repo_path, include_docs=True):
    await initialize_pipeline()
    cognee_config = get_cognify_config()
    user = await get_default_user()
    # ... rest of the function

Line range hint 146-147: Avoid hardcoding paths relative to the file location.

The paths should be configurable through environment variables or configuration files to support different deployment environments.

🧹 Nitpick comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

7-7: Consider organizing imports by type for better maintainability.

The new imports could be better organized into logical groups:

  • Standard library imports (e.g., chain)
  • Third-party imports
  • Local application imports (grouped by module)

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 428b2cf and ea67ac9.

📒 Files selected for processing (1)
  • cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
🔇 Additional comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)

163-176: Consider previous review suggestion about separate pipelines.

As suggested in the previous review by @borisarzentar, splitting this into separate pipelines would allow for better:

  • Parallel execution
  • Independent error handling
  • Separate monitoring and telemetry

Additionally, consider externalizing task configurations:

  1. Move batch sizes to configuration
  2. Allow task-specific configuration overrides

Example configuration structure:

task_config = {
    "default": {"batch_size": 50},
    "get_non_py_files": {"batch_size": 100},
    "extract_chunks_from_documents": {"batch_size": 25}
}

178-183: Add error handling and consider parallel execution.

The current implementation:

  1. Lacks error handling for pipeline failures
  2. Could benefit from parallel execution as suggested in previous reviews
  3. Missing telemetry for tracking pipeline progress

Consider implementing the error handling and parallel execution pattern from the previous review:

try:
    if include_docs:
        doc_pipeline = run_tasks(non_code_tasks, repo_path)
    code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")

    async for result in chain(
        doc_pipeline if include_docs else [],
        code_pipeline
    ):
        yield result
except Exception as error:
    send_telemetry("pipeline_errored", {"error": str(error)})
    raise error

cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved
@alekszievr alekszievr force-pushed the COG-697-ingest-non-code-files branch 2 times, most recently from 7aa6c0f to f30b17d Compare December 19, 2024 17:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
evals/eval_swe_bench.py (1)

Line range hint 133-134: Remove dataset slicing used for testing

The [:1] slice limits processing to just one item, which appears to be debugging code that shouldn't be in production.

-        swe_dataset = load_swebench_dataset(
-            dataset_name, split='test')[:1]
+        swe_dataset = load_swebench_dataset(
+            dataset_name, split='test')
♻️ Duplicate comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

163-181: 🛠️ Refactor suggestion

Implement parallel execution and error handling for document pipeline

The document pipeline could be optimized and made more robust:

  1. Consider parallel execution since code and document tasks are independent
  2. Add proper error handling and telemetry

Example implementation:

if include_docs:
    try:
        send_telemetry("doc_pipeline_started")
        doc_pipeline = run_tasks(non_code_tasks, repo_path)
        code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")
        
        async for result in asyncio.gather(doc_pipeline, code_pipeline):
            yield result
            
        send_telemetry("doc_pipeline_completed")
    except Exception as error:
        send_telemetry("doc_pipeline_errored")
        raise error
🧹 Nitpick comments (8)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)

35-72: Potential performance concern with recursive calls.
The function _get_subchunk_token_counts calls itself recursively for large child nodes (line 69). This might lead to deep recursion for unusually large or nested code structures, possibly affecting performance or stack usage.


74-99: Clarify overlap logic in _get_chunk_source_code.
The logic for determining the cutoff index (lines 92–97) works well for progressive chunking with overlap, but the variable naming and inline arithmetic might benefit from extra comments to increase maintainability.

cognee/tasks/repo_processor/__init__.py (1)

7-7: Reorder imports to comply with style guidelines (E402).
According to static analysis, this import (line 7) should appear near the top of the file.

+from .get_non_code_files import get_data_list_for_user, get_non_py_files
 import logging
 logger = logging.getLogger("task:repo_processor")
 from .enrich_dependency_graph import enrich_dependency_graph
 from .expand_dependency_graph import expand_dependency_graph
-from .get_non_code_files import get_data_list_for_user, get_non_py_files
 from .get_repo_file_dependencies import get_repo_file_dependencies
🧰 Tools
🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/shared/CodeGraphEntities.py (2)

28-29: Consider documenting the commented-out code

The commented-out part_of relationship and optional source_code suggest a significant architectural change. Consider either removing the comment or documenting the rationale.


35-43: Review the SourceCodeChunk implementation

The new SourceCodeChunk class introduces a linked-list structure through previous_chunk but lacks:

  1. Documentation explaining the chunking strategy
  2. Validation for circular references
  3. Clear relationship with the parent CodePart

Consider:

  • Adding docstring explaining the chunking strategy
  • Implementing validation for circular references
  • Clarifying the relationship between chunks and their parent CodePart
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)

Line range hint 97-110: Optimize UUID generation and dependency handling

The nested UUID generation and CodeFile creation for dependencies could be optimized:

  1. Consider caching UUIDs for frequently referenced files
  2. Batch create dependencies instead of inline creation

Consider extracting dependency creation to a separate method and implementing caching:

def create_dependency_file(dependency: str, repo: Repository, uuid_cache: dict, file_dict: dict) -> CodeFile:
    if dependency in uuid_cache:
        return uuid_cache[dependency]
    
    code_file = CodeFile(
        id=uuid5(NAMESPACE_OID, dependency),
        extracted_id=dependency,
        part_of=repo,
        source_code=file_dict.get(dependency, {}).get("source_code"),
    )
    uuid_cache[dependency] = code_file
    return code_file
evals/eval_swe_bench.py (1)

40-46: Remove commented out code if no longer needed

The render_graph call is commented out. If this functionality is no longer needed, the code should be removed rather than left as a comment.

-    # await render_graph(None, include_labels=True, include_nodes=True)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

164-176: Centralize task configurations

Consider moving hardcoded batch sizes and other task configurations to a central configuration:

  1. Makes it easier to tune performance
  2. Ensures consistency across tasks
  3. Allows for environment-specific configurations

Example:

# config.py
TASK_CONFIG = {
    "default_batch_size": 50,
    "task_specific_config": {
        "get_non_code_files": {"batch_size": 50},
        "extract_chunks_from_documents": {"batch_size": 30},
        # ... other task-specific configs
    }
}
🛑 Comments failed to post (3)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

128-129: 🛠️ Refactor suggestion

Type annotation discrepancy in get_source_code_chunks.
The function is annotated to return AsyncGenerator[list[DataPoint], None], but it yields individual DataPoint or SourceCodeChunk objects. Update the return type to accurately match the emitted values, e.g. AsyncGenerator[DataPoint | SourceCodeChunk, None].

-async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \
-        -> AsyncGenerator[list[DataPoint], None]:
+async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \
+        -> AsyncGenerator[DataPoint | SourceCodeChunk, None]:

Also applies to: 131-140

cognee/modules/data/extraction/extract_summary.py (1)

20-24: 🛠️ Refactor suggestion

Enhance error handling with proper logging

While the error handling is a good addition, consider these improvements:

  1. Use proper logging instead of print statements
  2. Add telemetry for tracking LLM failures
  3. Consider adding type hints for the mock data return
- print(str(e))
+ logger.error(f"Failed to generate structured output: {str(e)}", exc_info=True)
+ metrics.increment("llm.structured_output.failure", tags={"error_type": type(e).__name__})

Committable suggestion skipped: line range outside the PR's diff.

cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)

74-74: 🛠️ Refactor suggestion

Consider memory and performance optimizations

The current implementation has several potential issues:

  1. Fixed worker count (12) might not be optimal for all systems
  2. Collecting all code files in memory before yielding could be problematic for large repositories
  3. Lack of batch size control for the yield operation

Consider these improvements:

  1. Make worker count configurable based on system resources
  2. Implement batched processing to control memory usage
  3. Add progress tracking for large repositories
- with ProcessPoolExecutor(max_workers = 12) as executor:
+ worker_count = min(os.cpu_count() or 1, MAX_WORKERS)
+ with ProcessPoolExecutor(max_workers=worker_count) as executor:

- code_files = []
+ batch_size = 100  # or make configurable
+ code_files_batch = []
  for (file_path, metadata), dependencies in zip(py_files_dict.items(), results):
      # ... existing code ...
-     code_files.append(CodeFile(...))
+     code_files_batch.append(CodeFile(...))
+     if len(code_files_batch) >= batch_size:
+         yield code_files_batch
+         code_files_batch = []
  
- yield code_files
+ if code_files_batch:  # yield remaining files
+     yield code_files_batch

Also applies to: 93-112

@alekszievr alekszievr force-pushed the COG-697-ingest-non-code-files branch 2 times, most recently from 85f8aac to 800a507 Compare December 19, 2024 17:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)

101-105: Consider removing task configurations from deprecated function

While the batch size configurations improve performance, adding new features to deprecated functions may discourage users from migrating to run_code_graph_pipeline. Consider moving these improvements to the new pipeline only.


164-176: Consider separating document processing into a dedicated pipeline

The current implementation mixes document processing tasks within the code graph pipeline. Consider:

  1. Creating a separate document processing pipeline module
  2. Adding proper telemetry and logging for document processing
  3. Implementing document-specific configuration options

This would improve:

  • Maintainability through separation of concerns
  • Ability to run pipelines independently
  • Configuration management for each pipeline type
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7aa6c0f and 800a507.

📒 Files selected for processing (7)
  • cognee/api/v1/cognify/code_graph_pipeline.py (6 hunks)
  • cognee/modules/data/extraction/extract_summary.py (1 hunks)
  • cognee/shared/data_models.py (0 hunks)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
  • cognee/tasks/summarization/models.py (2 hunks)
  • cognee/tasks/summarization/summarize_code.py (3 hunks)
  • evals/eval_swe_bench.py (1 hunks)
💤 Files with no reviewable changes (1)
  • cognee/shared/data_models.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • cognee/tasks/summarization/models.py
  • evals/eval_swe_bench.py
  • cognee/modules/data/extraction/extract_summary.py
  • cognee/tasks/repo_processor/get_source_code_chunks.py
  • cognee/tasks/summarization/summarize_code.py
🔇 Additional comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)

Line range hint 10-41: LGTM: Imports are well-organized and support the new functionality

The new imports are logically grouped and necessary for supporting document processing capabilities.


Line range hint 146-149: Add safety controls for pruning operations

The pruning operations lack critical safety controls and could be dangerous in a production environment.

Comment on lines +178 to +91
if include_docs:
async for result in run_tasks(non_code_tasks, repo_path):
yield result

async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
yield result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve pipeline execution and error handling

The current implementation runs document and code tasks sequentially and lacks proper error handling. Consider:

  1. Running document and code tasks in parallel for better performance
  2. Adding specific error handling for document processing
  3. Implementing status tracking for document pipeline tasks

Example implementation:

try:
    if include_docs:
        doc_pipeline = run_tasks(non_code_tasks, repo_path)
    code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")

    async for result in asyncio.gather(
        doc_pipeline if include_docs else [],
        code_pipeline
    ):
        yield result
except Exception as error:
    logger.error(f"Pipeline execution failed: {error}")
    # Add specific error handling for document vs code pipeline
    raise

@alekszievr alekszievr force-pushed the COG-697-ingest-non-code-files branch from 800a507 to 6ed7730 Compare December 20, 2024 09:38
Copy link

gitguardian bot commented Dec 20, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password 9ce09a8 notebooks/hr_demo.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@alekszievr alekszievr force-pushed the COG-697-ingest-non-code-files branch from 6ed7730 to 719d621 Compare December 20, 2024 09:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
cognee/api/v1/cognify/code_graph_pipeline.py (3)

55-56: ⚠️ Potential issue

Add safety controls for pruning operations

This is a duplicate of a previously raised concern about dangerous pruning operations.

The pruning operations need immediate safety controls:

  1. Environment checks
  2. Logging
  3. User permission validation
  4. Confirmation mechanism

71-84: 🛠️ Refactor suggestion

Consider parallel execution of document and code tasks

This is a duplicate of a previously raised suggestion for parallel execution.

The document and code tasks could be executed in parallel using asyncio.gather() for better performance, as they appear to be independent.


86-91: 🛠️ Refactor suggestion

Add error handling and status tracking

This is a duplicate of a previously raised concern about pipeline execution.

The pipeline execution needs:

  1. Proper error handling for both document and code pipelines
  2. Status tracking for document processing tasks
  3. Telemetry for pipeline execution stages
🧹 Nitpick comments (3)
cognee/tasks/repo_processor/__init__.py (1)

7-7: Reorder imports to follow PEP 8

The import statement should be moved to the top of the file, after the logger initialization but before other imports, to comply with PEP 8 guidelines.

import logging

logger = logging.getLogger("task:repo_processor")

+from .get_non_code_files import get_data_list_for_user, get_non_py_files
from .enrich_dependency_graph import enrich_dependency_graph
from .expand_dependency_graph import expand_dependency_graph
-from .get_non_code_files import get_data_list_for_user, get_non_py_files
from .get_repo_file_dependencies import get_repo_file_dependencies
🧰 Tools
🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

35-35: Move summarization import to main import block

The summarization import should be grouped with other task-related imports.

from cognee.tasks.storage import add_data_points
+from cognee.tasks.summarization import summarize_code, summarize_text

monitoring = get_base_config().monitoring_tool
if monitoring == MonitoringTool.LANGFUSE:
    from langfuse.decorators import observe

-from cognee.tasks.summarization import summarize_code, summarize_text
🧰 Tools
🪛 Ruff (0.8.2)

35-35: Module level import not at top of file

(E402)


41-41: Consider defaulting include_docs to False

Since this is a new feature that processes additional files, it might be safer to default include_docs to False and require explicit opt-in.

-async def run_code_graph_pipeline(repo_path, include_docs=True):
+async def run_code_graph_pipeline(repo_path, include_docs=False):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 800a507 and 719d621.

📒 Files selected for processing (3)
  • cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
  • cognee/tasks/repo_processor/__init__.py (1 hunks)
  • cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tasks/repo_processor/get_non_code_files.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/api/v1/cognify/code_graph_pipeline.py

35-35: Module level import not at top of file

(E402)

cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)

1-4: LGTM! Clear deprecation notice

The deprecation notice is well-documented with a reference to the related issue.

Task(expand_dependency_graph, task_config={"batch_size": 50}),
Task(get_source_code_chunks, embedding_model=embedding_engine.model, task_config={"batch_size": 50}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does get_source_code_chunks come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @borisarzentar I messed up the commits on this PR, moved the meaningful part here: #395

I am closing this one now, sorry for the confusion

@alekszievr alekszievr closed this Dec 20, 2024
@alekszievr alekszievr deleted the COG-697-ingest-non-code-files branch December 20, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants