Ingest non-code files in code graph pipeline #380

alekszievr · 2024-12-18T11:01:57Z

In this first version, I assume for simplicity that every non-py file is documentation.

Summary by CodeRabbit

New Features
- Enhanced processing capabilities for both code and non-code files in the pipeline.
- New functions added for retrieving non-Python files and user-specific datasets.
Bug Fixes
- Improved error handling for repository path existence checks.
Documentation
- Updated method signatures and added new function descriptions for clarity.

coderabbitai · 2024-12-18T11:02:05Z

Walkthrough

This pull request enhances the code graph pipeline functionality by introducing a new parameter include_docs to the run_code_graph_pipeline function. The changes enable more comprehensive data processing by conditionally handling non-code files and documents. New utility functions are added to retrieve non-Python files and user-specific datasets, expanding the pipeline's flexibility in processing repository contents.

Changes

File	Change Summary
`cognee/api/v1/cognify/code_graph_pipeline.py`	- Added `include_docs` parameter to `run_code_graph_pipeline` - Introduced conditional logic for processing non-code files - Added new imports for embedding engine and knowledge graph
`cognee/tasks/repo_processor/__init__.py`	- Imported `get_data_list_for_user` and `get_non_py_files` functions
`cognee/tasks/repo_processor/get_non_code_files.py`	- Added `get_non_py_files()` to retrieve non-Python files - Added `get_data_list_for_user()` to fetch user-specific datasets

Sequence Diagram

sequenceDiagram
    participant User
    participant CodeGraphPipeline
    participant EmbeddingEngine
    participant KnowledgeGraph
    
    User->>CodeGraphPipeline: run_code_graph_pipeline(repo_path, include_docs=True)
    CodeGraphPipeline->>EmbeddingEngine: Get embedding engine
    CodeGraphPipeline->>CodeGraphPipeline: Fetch default user
    
    alt include_docs is True
        CodeGraphPipeline->>CodeGraphPipeline: Retrieve non-code files
        CodeGraphPipeline->>KnowledgeGraph: Extract graphs from data
        CodeGraphPipeline->>CodeGraphPipeline: Summarize text
    end
    
    CodeGraphPipeline-->>User: Return processed results

Possibly related PRs

Feature/cog 920 implement mock summaryobject for codegraph #385: Enhancements to extract_code_summary function, which may relate to the new summarization tasks in the code graph pipeline

Suggested reviewers

lxobr
Vasilije1990

Poem

🐰 Hopping through code with glee and might,
Docs and files now dance in light!
Non-Python treasures, come what may,
Our pipeline grows smarter today!
Rabbit's code magic takes its flight! 🚀

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

examples/python/code_graph_example.py (1)

13-14: Boolean CLI parameter note.
Argparse interprets booleans in Python-specific ways. The approach here should work, but be mindful that some users might pass “true” or “false” in different formats. Consider a more robust approach if usage is widespread.

cognee/tasks/repo_processor/get_non_code_files.py (3)

18-28: Efficient retrieval of non-Python files.
Traversing the directory to filter out “.py” files is a straightforward approach. Consider adding logging or metrics for large repositories to track performance if needed.

30-39: Streamlined function for returning non-Python file paths.
Locating the repository path from the DataPoint list is succinct. If multiple DataPoints represent different repos, consider clarifying or extending the logic in future versions to handle multiple repos.

40-46: Retrieving data documents for user’s dataset.
This method effectively aggregates all relevant data documents. Potentially, adding pagination or partial retrieval mechanisms might help avoid large memory usage with very large datasets.
cognee/tasks/repo_processor/__init__.py (1)
7-7: Place module-level import at top.
Per linting suggestions (E402), consider moving this import to the top.
-import logging

logger = logging.getLogger("task:repo_processor")

from .enrich_dependency_graph import enrich_dependency_graph
from .expand_dependency_graph import expand_dependency_graph
from .get_non_code_files import get_data_list_for_user, get_non_code_files
from .get_repo_file_dependencies import get_repo_file_dependencies
+import logging

+logger = logging.getLogger("task:repo_processor")
🧰 Tools

🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and e93dc12.

📒 Files selected for processing (4)

cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
cognee/tasks/repo_processor/__init__.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
examples/python/code_graph_example.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (10)

cognee/api/v1/cognify/code_graph_pipeline.py (8)

10-10: Good addition of the cognify config import.
This import is necessary for retrieving the summarization model configuration used later in the code.

23-23: Useful import of KnowledgeGraph for non-code data extraction.
This addition aligns well with the new tasks that process and extract graphs from non-code files.

29-30: Expanded graph and ingestion tasks.
Importing these new tasks (extract_graph_from_data and ingest_data_with_metadata) is essential for handling documentation. Everything looks consistent.

33-34: Import for new helper functions.
The new non-code file retrieval and dataset retrieval methods are important for the pipeline expansion.

37-37: Summarization tasks included.
The imported summarize_text complements summarize_code for non-code file summarizations.

127-130: New parameter for handling non-code files.
Updating the function signature to include 'include_docs' is a clean and logical approach. The documentation for the parameter usage is concise and clear.

144-146: Retrieving config and default user.
These lines ensure the pipeline has access to the relevant config and user context. No issues observed.

155-171: Conditional pipeline tasks for non-code files.
This block effectively groups tasks required for retrieving, ingesting, classifying, summarizing, and building graphs for non-Python files. The approach is modular, making the pipeline clearer and easier to maintain.

examples/python/code_graph_example.py (2)

7-8: New parameter 'include_docs' for main.
Allowing users to control non-code file processing from the CLI is a practical extension. The asynchronous call to 'run_code_graph_pipeline' with this new parameter is straightforward.

16-16: Coordinates with new 'include_docs' parameter.
The call to main with both arguments is correct and fulfills the pipeline configuration.

cognee/api/v1/cognify/code_graph_pipeline.py

borisarzentar · 2024-12-18T14:03:35Z

cognee/tasks/repo_processor/get_non_code_files.py

+
+
+async def get_non_code_files(data_points: list[DataPoint]):
+    for data_point in data_points:


This task will run every time for every batch, which can be a bit slow on huge amount of data points. If we split the pipeline like I mentioned above, then you don't need to do this, but receive the repo_path as entry data into this pipeline.

@borisarzentar I made the suggested changes, could you please look at it again?

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cognee/api/v1/cognify/code_graph_pipeline.py (1)
Line range hint 128-146: Add docstring to document the include_docs parameter

The function signature change is good, but it would benefit from documentation explaining the purpose and impact of the include_docs parameter.

Example docstring:
async def run_code_graph_pipeline(repo_path, include_docs=True):
    """
    Run the code graph pipeline on a repository.
    
    Args:
        repo_path: Path to the repository
        include_docs: If True, processes non-Python files as documentation (default: True)
    """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e93dc12 and 3f96416.

📒 Files selected for processing (4)

cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
cognee/tasks/repo_processor/__init__.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
examples/python/code_graph_example.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

examples/python/code_graph_example.py
cognee/tasks/repo_processor/get_non_code_files.py

🧰 Additional context used

🪛 Ruff (0.8.2)

cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (3)

cognee/tasks/repo_processor/__init__.py (1)

7-7: LGTM: Import aligns with PR objectives

The addition of get_data_list_for_user and get_non_py_files imports is appropriate for enabling non-code file processing functionality.

🧰 Tools

🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

7-7: LGTM: Imports support non-code file processing

The new imports appropriately include all necessary modules for document processing, graph extraction, and data ingestion.

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38

156-170: Consider splitting into separate pipelines

As suggested in a previous review, consider splitting this into separate pipelines for code and documentation processing. This would provide several benefits:

Better separation of concerns
Possibility of parallel execution
Easier maintenance and testing
Independent scaling of each pipeline

coderabbitai · 2024-12-18T15:11:48Z

cognee/api/v1/cognify/code_graph_pipeline.py

+    async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
+        yield result
+
+    if include_docs:
+        async for result in run_tasks(non_code_tasks, repo_path):
+            yield result


🛠️ Refactor suggestion

Add error handling and status tracking for document pipeline

The document processing pipeline would benefit from:

Error handling similar to the deprecated run_pipeline

Status tracking for document processing tasks

Telemetry for document pipeline execution

Example structure:

try: async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"): yield result if include_docs: send_telemetry("doc_pipeline_started", user.id) try: async for result in run_tasks(non_code_tasks, repo_path): yield result send_telemetry("doc_pipeline_completed", user.id) except Exception as error: send_telemetry("doc_pipeline_errored", user.id) raise error except Exception as error: send_telemetry("code_pipeline_errored", user.id) raise error

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

cognee/api/v1/cognify/code_graph_pipeline.py (1)
156-174: ⚠️ Potential issue

Add error handling and status tracking for document pipeline

The document processing pipeline would benefit from:

Error handling similar to the deprecated run_pipeline

Status tracking for document processing tasks

Telemetry for document pipeline execution

Example structure:
try:
    send_telemetry("doc_pipeline_started", user.id)
    async for result in run_tasks(non_code_tasks, repo_path):
        yield result
    send_telemetry("doc_pipeline_completed", user.id)
except Exception as error:
    send_telemetry("doc_pipeline_errored", user.id)
    raise error

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f96416 and 428b2cf.

📒 Files selected for processing (4)

cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
cognee/tasks/repo_processor/__init__.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
examples/python/code_graph_example.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cognee/tasks/repo_processor/get_non_code_files.py
examples/python/code_graph_example.py

🧰 Additional context used

🪛 Ruff (0.8.2)

cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (4)

cognee/tasks/repo_processor/__init__.py (1)

7-7: LGTM: Import statement aligns with PR objectives

The import of get_data_list_for_user and get_non_py_files functions is correctly placed and supports the new non-code file processing functionality.

🧰 Tools

🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)

cognee/api/v1/cognify/code_graph_pipeline.py (3)

7-7: LGTM: New imports support non-code file processing

The added imports provide necessary functionality for processing non-code files, including document handling, graph extraction, and summarization capabilities.

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38

Line range hint 128-147: LGTM: Function signature and initialization look good

The addition of the include_docs parameter with a default value of True maintains backward compatibility while enabling the new functionality. The initialization of config and user is properly placed.

156-169: Consider splitting document processing into a separate pipeline

As suggested in the previous review, separating this into a dedicated pipeline would allow for better:

Parallel execution
Independent error handling
Separate status tracking

coderabbitai · 2024-12-18T22:18:16Z

cognee/api/v1/cognify/code_graph_pipeline.py

+    if include_docs:
+        async for result in run_tasks(non_code_tasks, repo_path):
+            yield result
+
+    async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
+        yield result


🛠️ Refactor suggestion

Consider parallel execution of document and code tasks

Since there's no apparent dependency between document and code tasks, consider running them in parallel using asyncio.gather() to improve performance:

if include_docs: doc_pipeline = run_tasks(non_code_tasks, repo_path) code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline") async for result in chain( doc_pipeline if include_docs else [], code_pipeline ): yield result

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

cognee/api/v1/cognify/code_graph_pipeline.py (2)
Line range hint 135-154: Consider extracting initialization logic to a separate function.

The function is mixing pipeline execution with system initialization concerns. This could lead to unexpected side effects and makes testing more difficult.

Consider extracting the initialization logic:
async def initialize_pipeline():
    file_path = Path(__file__).parent
    data_directory_path = str(pathlib.Path(os.path.join(file_path, ".data_storage/code_graph")).resolve())
    cognee.config.data_root_directory(data_directory_path)
    cognee_directory_path = str(pathlib.Path(os.path.join(file_path, ".cognee_system/code_graph")).resolve())
    cognee.config.system_root_directory(cognee_directory_path)

    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)
    await create_db_and_tables()

async def run_code_graph_pipeline(repo_path, include_docs=True):
    await initialize_pipeline()
    cognee_config = get_cognify_config()
    user = await get_default_user()
    # ... rest of the function
Line range hint 146-147: Avoid hardcoding paths relative to the file location.

The paths should be configurable through environment variables or configuration files to support different deployment environments.

🧹 Nitpick comments (1)

cognee/api/v1/cognify/code_graph_pipeline.py (1)

7-7: Consider organizing imports by type for better maintainability.

The new imports could be better organized into logical groups:

Standard library imports (e.g., chain)

Third-party imports

Local application imports (grouped by module)

Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 428b2cf and ea67ac9.

📒 Files selected for processing (1)

cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)

🔇 Additional comments (2)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

163-176: Consider previous review suggestion about separate pipelines.

As suggested in the previous review by @borisarzentar, splitting this into separate pipelines would allow for better:

Parallel execution
Independent error handling
Separate monitoring and telemetry

Additionally, consider externalizing task configurations:

Move batch sizes to configuration
Allow task-specific configuration overrides

Example configuration structure:

task_config = {
    "default": {"batch_size": 50},
    "get_non_py_files": {"batch_size": 100},
    "extract_chunks_from_documents": {"batch_size": 25}
}

178-183: Add error handling and consider parallel execution.

The current implementation:

Lacks error handling for pipeline failures
Could benefit from parallel execution as suggested in previous reviews
Missing telemetry for tracking pipeline progress

Consider implementing the error handling and parallel execution pattern from the previous review:

try:
    if include_docs:
        doc_pipeline = run_tasks(non_code_tasks, repo_path)
    code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")

    async for result in chain(
        doc_pipeline if include_docs else [],
        code_pipeline
    ):
        yield result
except Exception as error:
    send_telemetry("pipeline_errored", {"error": str(error)})
    raise error

cognee/api/v1/cognify/code_graph_pipeline.py

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 3

🔭 Outside diff range comments (1)

evals/eval_swe_bench.py (1)
Line range hint 133-134: Remove dataset slicing used for testing

The [:1] slice limits processing to just one item, which appears to be debugging code that shouldn't be in production.
-        swe_dataset = load_swebench_dataset(
-            dataset_name, split='test')[:1]
+        swe_dataset = load_swebench_dataset(
+            dataset_name, split='test')

♻️ Duplicate comments (1)

cognee/api/v1/cognify/code_graph_pipeline.py (1)
163-181: 🛠️ Refactor suggestion

Implement parallel execution and error handling for document pipeline

The document pipeline could be optimized and made more robust:

Consider parallel execution since code and document tasks are independent

Add proper error handling and telemetry

Example implementation:
if include_docs:
    try:
        send_telemetry("doc_pipeline_started")
        doc_pipeline = run_tasks(non_code_tasks, repo_path)
        code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")
        
        async for result in asyncio.gather(doc_pipeline, code_pipeline):
            yield result
            
        send_telemetry("doc_pipeline_completed")
    except Exception as error:
        send_telemetry("doc_pipeline_errored")
        raise error

🧹 Nitpick comments (8)

cognee/tasks/repo_processor/get_source_code_chunks.py (2)

35-72: Potential performance concern with recursive calls.
The function _get_subchunk_token_counts calls itself recursively for large child nodes (line 69). This might lead to deep recursion for unusually large or nested code structures, possibly affecting performance or stack usage.

74-99: Clarify overlap logic in _get_chunk_source_code.
The logic for determining the cutoff index (lines 92–97) works well for progressive chunking with overlap, but the variable naming and inline arithmetic might benefit from extra comments to increase maintainability.
cognee/tasks/repo_processor/__init__.py (1)
7-7: Reorder imports to comply with style guidelines (E402).
According to static analysis, this import (line 7) should appear near the top of the file.
+from .get_non_code_files import get_data_list_for_user, get_non_py_files
 import logging
 logger = logging.getLogger("task:repo_processor")
 from .enrich_dependency_graph import enrich_dependency_graph
 from .expand_dependency_graph import expand_dependency_graph
-from .get_non_code_files import get_data_list_for_user, get_non_py_files
 from .get_repo_file_dependencies import get_repo_file_dependencies
🧰 Tools

🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)
cognee/shared/CodeGraphEntities.py (2)

28-29: Consider documenting the commented-out code

The commented-out part_of relationship and optional source_code suggest a significant architectural change. Consider either removing the comment or documenting the rationale.

35-43: Review the SourceCodeChunk implementation

The new SourceCodeChunk class introduces a linked-list structure through previous_chunk but lacks:

Documentation explaining the chunking strategy

Validation for circular references

Clear relationship with the parent CodePart

Consider:

Adding docstring explaining the chunking strategy

Implementing validation for circular references

Clarifying the relationship between chunks and their parent CodePart
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
Line range hint 97-110: Optimize UUID generation and dependency handling

The nested UUID generation and CodeFile creation for dependencies could be optimized:

Consider caching UUIDs for frequently referenced files

Batch create dependencies instead of inline creation

Consider extracting dependency creation to a separate method and implementing caching:
def create_dependency_file(dependency: str, repo: Repository, uuid_cache: dict, file_dict: dict) -> CodeFile:
    if dependency in uuid_cache:
        return uuid_cache[dependency]
    
    code_file = CodeFile(
        id=uuid5(NAMESPACE_OID, dependency),
        extracted_id=dependency,
        part_of=repo,
        source_code=file_dict.get(dependency, {}).get("source_code"),
    )
    uuid_cache[dependency] = code_file
    return code_file
evals/eval_swe_bench.py (1)
40-46: Remove commented out code if no longer needed

The render_graph call is commented out. If this functionality is no longer needed, the code should be removed rather than left as a comment.
-    # await render_graph(None, include_labels=True, include_nodes=True)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
164-176: Centralize task configurations

Consider moving hardcoded batch sizes and other task configurations to a central configuration:

Makes it easier to tune performance

Ensures consistency across tasks

Allows for environment-specific configurations

Example:
# config.py
TASK_CONFIG = {
    "default_batch_size": 50,
    "task_specific_config": {
        "get_non_code_files": {"batch_size": 50},
        "extract_chunks_from_documents": {"batch_size": 30},
        # ... other task-specific configs
    }
}

🛑 Comments failed to post (3)

cognee/tasks/repo_processor/get_source_code_chunks.py (1)
128-129: 🛠️ Refactor suggestion

Type annotation discrepancy in get_source_code_chunks.
The function is annotated to return AsyncGenerator[list[DataPoint], None], but it yields individual DataPoint or SourceCodeChunk objects. Update the return type to accurately match the emitted values, e.g. AsyncGenerator[DataPoint | SourceCodeChunk, None].
-async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \
-        -> AsyncGenerator[list[DataPoint], None]:
+async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \
+        -> AsyncGenerator[DataPoint | SourceCodeChunk, None]:
Also applies to: 131-140
cognee/modules/data/extraction/extract_summary.py (1)
20-24: 🛠️ Refactor suggestion

Enhance error handling with proper logging

While the error handling is a good addition, consider these improvements:

Use proper logging instead of print statements

Add telemetry for tracking LLM failures

Consider adding type hints for the mock data return
- print(str(e))
+ logger.error(f"Failed to generate structured output: {str(e)}", exc_info=True)
+ metrics.increment("llm.structured_output.failure", tags={"error_type": type(e).__name__})
Committable suggestion skipped: line range outside the PR's diff.
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
74-74: 🛠️ Refactor suggestion

Consider memory and performance optimizations

The current implementation has several potential issues:

Fixed worker count (12) might not be optimal for all systems

Collecting all code files in memory before yielding could be problematic for large repositories

Lack of batch size control for the yield operation

Consider these improvements:

Make worker count configurable based on system resources

Implement batched processing to control memory usage

Add progress tracking for large repositories
- with ProcessPoolExecutor(max_workers = 12) as executor:
+ worker_count = min(os.cpu_count() or 1, MAX_WORKERS)
+ with ProcessPoolExecutor(max_workers=worker_count) as executor:

- code_files = []
+ batch_size = 100  # or make configurable
+ code_files_batch = []
  for (file_path, metadata), dependencies in zip(py_files_dict.items(), results):
      # ... existing code ...
-     code_files.append(CodeFile(...))
+     code_files_batch.append(CodeFile(...))
+     if len(code_files_batch) >= batch_size:
+         yield code_files_batch
+         code_files_batch = []
  
- yield code_files
+ if code_files_batch:  # yield remaining files
+     yield code_files_batch
Also applies to: 93-112

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

101-105: Consider removing task configurations from deprecated function

While the batch size configurations improve performance, adding new features to deprecated functions may discourage users from migrating to run_code_graph_pipeline. Consider moving these improvements to the new pipeline only.

164-176: Consider separating document processing into a dedicated pipeline

The current implementation mixes document processing tasks within the code graph pipeline. Consider:

Creating a separate document processing pipeline module

Adding proper telemetry and logging for document processing

Implementing document-specific configuration options

This would improve:

Maintainability through separation of concerns

Ability to run pipelines independently

Configuration management for each pipeline type

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7aa6c0f and 800a507.

📒 Files selected for processing (7)

cognee/api/v1/cognify/code_graph_pipeline.py (6 hunks)
cognee/modules/data/extraction/extract_summary.py (1 hunks)
cognee/shared/data_models.py (0 hunks)
cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
cognee/tasks/summarization/models.py (2 hunks)
cognee/tasks/summarization/summarize_code.py (3 hunks)
evals/eval_swe_bench.py (1 hunks)

💤 Files with no reviewable changes (1)

cognee/shared/data_models.py

🚧 Files skipped from review as they are similar to previous changes (5)

cognee/tasks/summarization/models.py
evals/eval_swe_bench.py
cognee/modules/data/extraction/extract_summary.py
cognee/tasks/repo_processor/get_source_code_chunks.py
cognee/tasks/summarization/summarize_code.py

🔇 Additional comments (2)

cognee/api/v1/cognify/code_graph_pipeline.py (2)

Line range hint 10-41: LGTM: Imports are well-organized and support the new functionality

The new imports are logically grouped and necessary for supporting document processing capabilities.

Line range hint 146-149: Add safety controls for pruning operations

The pruning operations lack critical safety controls and could be dangerous in a production environment.

coderabbitai · 2024-12-19T17:12:19Z

cognee/api/v1/cognify/code_graph_pipeline.py

+    if include_docs:
+        async for result in run_tasks(non_code_tasks, repo_path):
+            yield result
+
+    async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
+        yield result


🛠️ Refactor suggestion

Improve pipeline execution and error handling

The current implementation runs document and code tasks sequentially and lacks proper error handling. Consider:

Running document and code tasks in parallel for better performance

Adding specific error handling for document processing

Implementing status tracking for document pipeline tasks

Example implementation:

try: if include_docs: doc_pipeline = run_tasks(non_code_tasks, repo_path) code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline") async for result in asyncio.gather( doc_pipeline if include_docs else [], code_pipeline ): yield result except Exception as error: logger.error(f"Pipeline execution failed: {error}") # Add specific error handling for document vs code pipeline raise

gitguardian · 2024-12-20T09:38:07Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9573981	Triggered	Generic Password	`9ce09a8`	notebooks/hr_demo.ipynb	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

cognee/api/v1/cognify/code_graph_pipeline.py (3)

55-56: ⚠️ Potential issue

Add safety controls for pruning operations

This is a duplicate of a previously raised concern about dangerous pruning operations.

The pruning operations need immediate safety controls:

Environment checks

Logging

User permission validation

Confirmation mechanism

71-84: 🛠️ Refactor suggestion

Consider parallel execution of document and code tasks

This is a duplicate of a previously raised suggestion for parallel execution.

The document and code tasks could be executed in parallel using asyncio.gather() for better performance, as they appear to be independent.

86-91: 🛠️ Refactor suggestion

Add error handling and status tracking

This is a duplicate of a previously raised concern about pipeline execution.

The pipeline execution needs:

Proper error handling for both document and code pipelines

Status tracking for document processing tasks

Telemetry for pipeline execution stages

🧹 Nitpick comments (3)

cognee/tasks/repo_processor/__init__.py (1)
7-7: Reorder imports to follow PEP 8

The import statement should be moved to the top of the file, after the logger initialization but before other imports, to comply with PEP 8 guidelines.
import logging

logger = logging.getLogger("task:repo_processor")

+from .get_non_code_files import get_data_list_for_user, get_non_py_files
from .enrich_dependency_graph import enrich_dependency_graph
from .expand_dependency_graph import expand_dependency_graph
-from .get_non_code_files import get_data_list_for_user, get_non_py_files
from .get_repo_file_dependencies import get_repo_file_dependencies
🧰 Tools

🪛 Ruff (0.8.2)

7-7: Module level import not at top of file

(E402)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
35-35: Move summarization import to main import block

The summarization import should be grouped with other task-related imports.
from cognee.tasks.storage import add_data_points
+from cognee.tasks.summarization import summarize_code, summarize_text

monitoring = get_base_config().monitoring_tool
if monitoring == MonitoringTool.LANGFUSE:
    from langfuse.decorators import observe

-from cognee.tasks.summarization import summarize_code, summarize_text
🧰 Tools

🪛 Ruff (0.8.2)

35-35: Module level import not at top of file

(E402)

41-41: Consider defaulting include_docs to False

Since this is a new feature that processes additional files, it might be safer to default include_docs to False and require explicit opt-in.
-async def run_code_graph_pipeline(repo_path, include_docs=True):
+async def run_code_graph_pipeline(repo_path, include_docs=False):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 800a507 and 719d621.

📒 Files selected for processing (3)

cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
cognee/tasks/repo_processor/__init__.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tasks/repo_processor/get_non_code_files.py

🧰 Additional context used

🪛 Ruff (0.8.2)

cognee/api/v1/cognify/code_graph_pipeline.py

35-35: Module level import not at top of file

(E402)

cognee/tasks/repo_processor/__init__.py

7-7: Module level import not at top of file

(E402)

🔇 Additional comments (1)

cognee/api/v1/cognify/code_graph_pipeline.py (1)

1-4: LGTM! Clear deprecation notice

The deprecation notice is well-documented with a reference to the related issue.

borisarzentar · 2024-12-20T10:52:49Z

cognee/api/v1/cognify/code_graph_pipeline.py

        Task(expand_dependency_graph, task_config={"batch_size": 50}),
+        Task(get_source_code_chunks, embedding_model=embedding_engine.model, task_config={"batch_size": 50}),


Where does get_source_code_chunks come from?

hey @borisarzentar I messed up the commits on this PR, moved the meaningful part here: #395

I am closing this one now, sorry for the confusion

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

alekszievr requested review from dexters1, hajdul88 and lxobr December 18, 2024 11:08

dexters1 added the run-checks label Dec 18, 2024

borisarzentar reviewed Dec 18, 2024

View reviewed changes

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

alekszievr requested a review from borisarzentar December 18, 2024 15:55

alekszievr force-pushed the COG-697-ingest-non-code-files branch from 8f4d480 to 428b2cf Compare December 18, 2024 22:15

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved

alekszievr force-pushed the COG-697-ingest-non-code-files branch 2 times, most recently from 7aa6c0f to f30b17d Compare December 19, 2024 17:02

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

alekszievr force-pushed the COG-697-ingest-non-code-files branch 2 times, most recently from 85f8aac to 800a507 Compare December 19, 2024 17:05

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

Ingesting non-code files -- clean version

9ce09a8

alekszievr force-pushed the COG-697-ingest-non-code-files branch from 800a507 to 6ed7730 Compare December 20, 2024 09:38

Merge branch 'dev' into COG-697-ingest-non-code-files

719d621

alekszievr force-pushed the COG-697-ingest-non-code-files branch from 6ed7730 to 719d621 Compare December 20, 2024 09:39

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

borisarzentar reviewed Dec 20, 2024

View reviewed changes

alekszievr closed this Dec 20, 2024

alekszievr deleted the COG-697-ingest-non-code-files branch December 20, 2024 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest non-code files in code graph pipeline #380

Ingest non-code files in code graph pipeline #380

alekszievr commented Dec 18, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

borisarzentar Dec 18, 2024

alekszievr Dec 18, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 18, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 18, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Dec 19, 2024

gitguardian bot commented Dec 20, 2024 •

edited

Loading

coderabbitai bot left a comment

borisarzentar Dec 20, 2024

alekszievr Dec 20, 2024



		async def get_non_code_files(data_points: list[DataPoint]):
		for data_point in data_points:

		Task(expand_dependency_graph, task_config={"batch_size": 50}),
		Task(get_source_code_chunks, embedding_model=embedding_engine.model, task_config={"batch_size": 50}),

Ingest non-code files in code graph pipeline #380

Ingest non-code files in code graph pipeline #380

Conversation

alekszievr commented Dec 18, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 18, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

borisarzentar Dec 18, 2024

Choose a reason for hiding this comment

alekszievr Dec 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 18, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 19, 2024

Choose a reason for hiding this comment

gitguardian bot commented Dec 20, 2024 • edited Loading

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

coderabbitai bot left a comment

Choose a reason for hiding this comment

borisarzentar Dec 20, 2024

Choose a reason for hiding this comment

alekszievr Dec 20, 2024

Choose a reason for hiding this comment

alekszievr commented Dec 18, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

gitguardian bot commented Dec 20, 2024 •

edited

Loading