-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest non-code files in code graph pipeline #380
Conversation
WalkthroughThis pull request enhances the code graph pipeline functionality by introducing a new parameter Changes
Sequence DiagramsequenceDiagram
participant User
participant CodeGraphPipeline
participant EmbeddingEngine
participant KnowledgeGraph
User->>CodeGraphPipeline: run_code_graph_pipeline(repo_path, include_docs=True)
CodeGraphPipeline->>EmbeddingEngine: Get embedding engine
CodeGraphPipeline->>CodeGraphPipeline: Fetch default user
alt include_docs is True
CodeGraphPipeline->>CodeGraphPipeline: Retrieve non-code files
CodeGraphPipeline->>KnowledgeGraph: Extract graphs from data
CodeGraphPipeline->>CodeGraphPipeline: Summarize text
end
CodeGraphPipeline-->>User: Return processed results
Possibly related PRs
Suggested reviewers
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
examples/python/code_graph_example.py (1)
13-14
: Boolean CLI parameter note.
Argparse interprets booleans in Python-specific ways. The approach here should work, but be mindful that some users might pass “true” or “false” in different formats. Consider a more robust approach if usage is widespread.cognee/tasks/repo_processor/get_non_code_files.py (3)
18-28
: Efficient retrieval of non-Python files.
Traversing the directory to filter out “.py” files is a straightforward approach. Consider adding logging or metrics for large repositories to track performance if needed.
30-39
: Streamlined function for returning non-Python file paths.
Locating the repository path from the DataPoint list is succinct. If multiple DataPoints represent different repos, consider clarifying or extending the logic in future versions to handle multiple repos.
40-46
: Retrieving data documents for user’s dataset.
This method effectively aggregates all relevant data documents. Potentially, adding pagination or partial retrieval mechanisms might help avoid large memory usage with very large datasets.cognee/tasks/repo_processor/__init__.py (1)
7-7
: Place module-level import at top.
Per linting suggestions (E402), consider moving this import to the top.-import logging logger = logging.getLogger("task:repo_processor") from .enrich_dependency_graph import enrich_dependency_graph from .expand_dependency_graph import expand_dependency_graph from .get_non_code_files import get_data_list_for_user, get_non_code_files from .get_repo_file_dependencies import get_repo_file_dependencies +import logging +logger = logging.getLogger("task:repo_processor")🧰 Tools
🪛 Ruff (0.8.2)
7-7: Module level import not at top of file
(E402)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/api/v1/cognify/code_graph_pipeline.py
(5 hunks)cognee/tasks/repo_processor/__init__.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)examples/python/code_graph_example.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py
7-7: Module level import not at top of file
(E402)
🔇 Additional comments (10)
cognee/api/v1/cognify/code_graph_pipeline.py (8)
10-10
: Good addition of the cognify config import.
This import is necessary for retrieving the summarization model configuration used later in the code.
23-23
: Useful import of KnowledgeGraph for non-code data extraction.
This addition aligns well with the new tasks that process and extract graphs from non-code files.
29-30
: Expanded graph and ingestion tasks.
Importing these new tasks (extract_graph_from_data and ingest_data_with_metadata) is essential for handling documentation. Everything looks consistent.
33-34
: Import for new helper functions.
The new non-code file retrieval and dataset retrieval methods are important for the pipeline expansion.
37-37
: Summarization tasks included.
The imported summarize_text complements summarize_code for non-code file summarizations.
127-130
: New parameter for handling non-code files.
Updating the function signature to include 'include_docs' is a clean and logical approach. The documentation for the parameter usage is concise and clear.
144-146
: Retrieving config and default user.
These lines ensure the pipeline has access to the relevant config and user context. No issues observed.
155-171
: Conditional pipeline tasks for non-code files.
This block effectively groups tasks required for retrieving, ingesting, classifying, summarizing, and building graphs for non-Python files. The approach is modular, making the pipeline clearer and easier to maintain.
examples/python/code_graph_example.py (2)
7-8
: New parameter 'include_docs' for main.
Allowing users to control non-code file processing from the CLI is a practical extension. The asynchronous call to 'run_code_graph_pipeline' with this new parameter is straightforward.
16-16
: Coordinates with new 'include_docs' parameter.
The call to main with both arguments is correct and fulfills the pipeline configuration.
|
||
|
||
async def get_non_code_files(data_points: list[DataPoint]): | ||
for data_point in data_points: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task will run every time for every batch, which can be a bit slow on huge amount of data points. If we split the pipeline like I mentioned above, then you don't need to do this, but receive the repo_path
as entry data into this pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borisarzentar I made the suggested changes, could you please look at it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
Line range hint
128-146
: Add docstring to document the include_docs parameterThe function signature change is good, but it would benefit from documentation explaining the purpose and impact of the
include_docs
parameter.Example docstring:
async def run_code_graph_pipeline(repo_path, include_docs=True): """ Run the code graph pipeline on a repository. Args: repo_path: Path to the repository include_docs: If True, processes non-Python files as documentation (default: True) """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/api/v1/cognify/code_graph_pipeline.py
(5 hunks)cognee/tasks/repo_processor/__init__.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)examples/python/code_graph_example.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/python/code_graph_example.py
- cognee/tasks/repo_processor/get_non_code_files.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py
7-7: Module level import not at top of file
(E402)
🔇 Additional comments (3)
cognee/tasks/repo_processor/__init__.py (1)
7-7
: LGTM: Import aligns with PR objectives
The addition of get_data_list_for_user
and get_non_py_files
imports is appropriate for enabling non-code file processing functionality.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Module level import not at top of file
(E402)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
7-7
: LGTM: Imports support non-code file processing
The new imports appropriately include all necessary modules for document processing, graph extraction, and data ingestion.
Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38
156-170
: Consider splitting into separate pipelines
As suggested in a previous review, consider splitting this into separate pipelines for code and documentation processing. This would provide several benefits:
- Better separation of concerns
- Possibility of parallel execution
- Easier maintenance and testing
- Independent scaling of each pipeline
async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"): | ||
yield result | ||
|
||
if include_docs: | ||
async for result in run_tasks(non_code_tasks, repo_path): | ||
yield result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and status tracking for document pipeline
The document processing pipeline would benefit from:
- Error handling similar to the deprecated
run_pipeline
- Status tracking for document processing tasks
- Telemetry for document pipeline execution
Example structure:
try:
async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"):
yield result
if include_docs:
send_telemetry("doc_pipeline_started", user.id)
try:
async for result in run_tasks(non_code_tasks, repo_path):
yield result
send_telemetry("doc_pipeline_completed", user.id)
except Exception as error:
send_telemetry("doc_pipeline_errored", user.id)
raise error
except Exception as error:
send_telemetry("code_pipeline_errored", user.id)
raise error
8f4d480
to
428b2cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
156-174
:⚠️ Potential issueAdd error handling and status tracking for document pipeline
The document processing pipeline would benefit from:
- Error handling similar to the deprecated
run_pipeline
- Status tracking for document processing tasks
- Telemetry for document pipeline execution
Example structure:
try: send_telemetry("doc_pipeline_started", user.id) async for result in run_tasks(non_code_tasks, repo_path): yield result send_telemetry("doc_pipeline_completed", user.id) except Exception as error: send_telemetry("doc_pipeline_errored", user.id) raise error
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/api/v1/cognify/code_graph_pipeline.py
(5 hunks)cognee/tasks/repo_processor/__init__.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)examples/python/code_graph_example.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cognee/tasks/repo_processor/get_non_code_files.py
- examples/python/code_graph_example.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/tasks/repo_processor/__init__.py
7-7: Module level import not at top of file
(E402)
🔇 Additional comments (4)
cognee/tasks/repo_processor/__init__.py (1)
7-7
: LGTM: Import statement aligns with PR objectives
The import of get_data_list_for_user
and get_non_py_files
functions is correctly placed and supports the new non-code file processing functionality.
🧰 Tools
🪛 Ruff (0.8.2)
7-7: Module level import not at top of file
(E402)
cognee/api/v1/cognify/code_graph_pipeline.py (3)
7-7
: LGTM: New imports support non-code file processing
The added imports provide necessary functionality for processing non-code files, including document handling, graph extraction, and summarization capabilities.
Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38
Line range hint 128-147
: LGTM: Function signature and initialization look good
The addition of the include_docs
parameter with a default value of True
maintains backward compatibility while enabling the new functionality. The initialization of config and user is properly placed.
156-169
: Consider splitting document processing into a separate pipeline
As suggested in the previous review, separating this into a dedicated pipeline would allow for better:
- Parallel execution
- Independent error handling
- Separate status tracking
if include_docs: | ||
async for result in run_tasks(non_code_tasks, repo_path): | ||
yield result | ||
|
||
async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"): | ||
yield result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider parallel execution of document and code tasks
Since there's no apparent dependency between document and code tasks, consider running them in parallel using asyncio.gather()
to improve performance:
if include_docs:
doc_pipeline = run_tasks(non_code_tasks, repo_path)
code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")
async for result in chain(
doc_pipeline if include_docs else [],
code_pipeline
):
yield result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
Line range hint
135-154
: Consider extracting initialization logic to a separate function.The function is mixing pipeline execution with system initialization concerns. This could lead to unexpected side effects and makes testing more difficult.
Consider extracting the initialization logic:
async def initialize_pipeline(): file_path = Path(__file__).parent data_directory_path = str(pathlib.Path(os.path.join(file_path, ".data_storage/code_graph")).resolve()) cognee.config.data_root_directory(data_directory_path) cognee_directory_path = str(pathlib.Path(os.path.join(file_path, ".cognee_system/code_graph")).resolve()) cognee.config.system_root_directory(cognee_directory_path) await cognee.prune.prune_data() await cognee.prune.prune_system(metadata=True) await create_db_and_tables() async def run_code_graph_pipeline(repo_path, include_docs=True): await initialize_pipeline() cognee_config = get_cognify_config() user = await get_default_user() # ... rest of the function
Line range hint
146-147
: Avoid hardcoding paths relative to the file location.The paths should be configurable through environment variables or configuration files to support different deployment environments.
🧹 Nitpick comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
7-7
: Consider organizing imports by type for better maintainability.The new imports could be better organized into logical groups:
- Standard library imports (e.g.,
chain
)- Third-party imports
- Local application imports (grouped by module)
Also applies to: 11-11, 24-24, 30-31, 34-35, 38-38
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/api/v1/cognify/code_graph_pipeline.py
(5 hunks)
🔇 Additional comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
163-176
: Consider previous review suggestion about separate pipelines.
As suggested in the previous review by @borisarzentar, splitting this into separate pipelines would allow for better:
- Parallel execution
- Independent error handling
- Separate monitoring and telemetry
Additionally, consider externalizing task configurations:
- Move batch sizes to configuration
- Allow task-specific configuration overrides
Example configuration structure:
task_config = {
"default": {"batch_size": 50},
"get_non_py_files": {"batch_size": 100},
"extract_chunks_from_documents": {"batch_size": 25}
}
178-183
: Add error handling and consider parallel execution.
The current implementation:
- Lacks error handling for pipeline failures
- Could benefit from parallel execution as suggested in previous reviews
- Missing telemetry for tracking pipeline progress
Consider implementing the error handling and parallel execution pattern from the previous review:
try:
if include_docs:
doc_pipeline = run_tasks(non_code_tasks, repo_path)
code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")
async for result in chain(
doc_pipeline if include_docs else [],
code_pipeline
):
yield result
except Exception as error:
send_telemetry("pipeline_errored", {"error": str(error)})
raise error
7aa6c0f
to
f30b17d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
evals/eval_swe_bench.py (1)
Line range hint
133-134
: Remove dataset slicing used for testingThe
[:1]
slice limits processing to just one item, which appears to be debugging code that shouldn't be in production.- swe_dataset = load_swebench_dataset( - dataset_name, split='test')[:1] + swe_dataset = load_swebench_dataset( + dataset_name, split='test')
♻️ Duplicate comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
163-181
: 🛠️ Refactor suggestionImplement parallel execution and error handling for document pipeline
The document pipeline could be optimized and made more robust:
- Consider parallel execution since code and document tasks are independent
- Add proper error handling and telemetry
Example implementation:
if include_docs: try: send_telemetry("doc_pipeline_started") doc_pipeline = run_tasks(non_code_tasks, repo_path) code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline") async for result in asyncio.gather(doc_pipeline, code_pipeline): yield result send_telemetry("doc_pipeline_completed") except Exception as error: send_telemetry("doc_pipeline_errored") raise error
🧹 Nitpick comments (8)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
35-72
: Potential performance concern with recursive calls.
The function _get_subchunk_token_counts calls itself recursively for large child nodes (line 69). This might lead to deep recursion for unusually large or nested code structures, possibly affecting performance or stack usage.
74-99
: Clarify overlap logic in _get_chunk_source_code.
The logic for determining the cutoff index (lines 92–97) works well for progressive chunking with overlap, but the variable naming and inline arithmetic might benefit from extra comments to increase maintainability.cognee/tasks/repo_processor/__init__.py (1)
7-7
: Reorder imports to comply with style guidelines (E402).
According to static analysis, this import (line 7) should appear near the top of the file.+from .get_non_code_files import get_data_list_for_user, get_non_py_files import logging logger = logging.getLogger("task:repo_processor") from .enrich_dependency_graph import enrich_dependency_graph from .expand_dependency_graph import expand_dependency_graph -from .get_non_code_files import get_data_list_for_user, get_non_py_files from .get_repo_file_dependencies import get_repo_file_dependencies🧰 Tools
🪛 Ruff (0.8.2)
7-7: Module level import not at top of file
(E402)
cognee/shared/CodeGraphEntities.py (2)
28-29
: Consider documenting the commented-out codeThe commented-out
part_of
relationship and optionalsource_code
suggest a significant architectural change. Consider either removing the comment or documenting the rationale.
35-43
: Review the SourceCodeChunk implementationThe new SourceCodeChunk class introduces a linked-list structure through
previous_chunk
but lacks:
- Documentation explaining the chunking strategy
- Validation for circular references
- Clear relationship with the parent CodePart
Consider:
- Adding docstring explaining the chunking strategy
- Implementing validation for circular references
- Clarifying the relationship between chunks and their parent CodePart
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
Line range hint
97-110
: Optimize UUID generation and dependency handlingThe nested UUID generation and CodeFile creation for dependencies could be optimized:
- Consider caching UUIDs for frequently referenced files
- Batch create dependencies instead of inline creation
Consider extracting dependency creation to a separate method and implementing caching:
def create_dependency_file(dependency: str, repo: Repository, uuid_cache: dict, file_dict: dict) -> CodeFile: if dependency in uuid_cache: return uuid_cache[dependency] code_file = CodeFile( id=uuid5(NAMESPACE_OID, dependency), extracted_id=dependency, part_of=repo, source_code=file_dict.get(dependency, {}).get("source_code"), ) uuid_cache[dependency] = code_file return code_fileevals/eval_swe_bench.py (1)
40-46
: Remove commented out code if no longer neededThe
render_graph
call is commented out. If this functionality is no longer needed, the code should be removed rather than left as a comment.- # await render_graph(None, include_labels=True, include_nodes=True)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
164-176
: Centralize task configurationsConsider moving hardcoded batch sizes and other task configurations to a central configuration:
- Makes it easier to tune performance
- Ensures consistency across tasks
- Allows for environment-specific configurations
Example:
# config.py TASK_CONFIG = { "default_batch_size": 50, "task_specific_config": { "get_non_code_files": {"batch_size": 50}, "extract_chunks_from_documents": {"batch_size": 30}, # ... other task-specific configs } }
🛑 Comments failed to post (3)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
128-129: 🛠️ Refactor suggestion
Type annotation discrepancy in get_source_code_chunks.
The function is annotated to return AsyncGenerator[list[DataPoint], None], but it yields individual DataPoint or SourceCodeChunk objects. Update the return type to accurately match the emitted values, e.g. AsyncGenerator[DataPoint | SourceCodeChunk, None].-async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \ - -> AsyncGenerator[list[DataPoint], None]: +async def get_source_code_chunks(data_points: list[DataPoint], embedding_model="text-embedding-3-large") \ + -> AsyncGenerator[DataPoint | SourceCodeChunk, None]:Also applies to: 131-140
cognee/modules/data/extraction/extract_summary.py (1)
20-24: 🛠️ Refactor suggestion
Enhance error handling with proper logging
While the error handling is a good addition, consider these improvements:
- Use proper logging instead of print statements
- Add telemetry for tracking LLM failures
- Consider adding type hints for the mock data return
- print(str(e)) + logger.error(f"Failed to generate structured output: {str(e)}", exc_info=True) + metrics.increment("llm.structured_output.failure", tags={"error_type": type(e).__name__})Committable suggestion skipped: line range outside the PR's diff.
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
74-74: 🛠️ Refactor suggestion
Consider memory and performance optimizations
The current implementation has several potential issues:
- Fixed worker count (12) might not be optimal for all systems
- Collecting all code files in memory before yielding could be problematic for large repositories
- Lack of batch size control for the yield operation
Consider these improvements:
- Make worker count configurable based on system resources
- Implement batched processing to control memory usage
- Add progress tracking for large repositories
- with ProcessPoolExecutor(max_workers = 12) as executor: + worker_count = min(os.cpu_count() or 1, MAX_WORKERS) + with ProcessPoolExecutor(max_workers=worker_count) as executor: - code_files = [] + batch_size = 100 # or make configurable + code_files_batch = [] for (file_path, metadata), dependencies in zip(py_files_dict.items(), results): # ... existing code ... - code_files.append(CodeFile(...)) + code_files_batch.append(CodeFile(...)) + if len(code_files_batch) >= batch_size: + yield code_files_batch + code_files_batch = [] - yield code_files + if code_files_batch: # yield remaining files + yield code_files_batchAlso applies to: 93-112
85f8aac
to
800a507
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
101-105
: Consider removing task configurations from deprecated functionWhile the batch size configurations improve performance, adding new features to deprecated functions may discourage users from migrating to
run_code_graph_pipeline
. Consider moving these improvements to the new pipeline only.
164-176
: Consider separating document processing into a dedicated pipelineThe current implementation mixes document processing tasks within the code graph pipeline. Consider:
- Creating a separate document processing pipeline module
- Adding proper telemetry and logging for document processing
- Implementing document-specific configuration options
This would improve:
- Maintainability through separation of concerns
- Ability to run pipelines independently
- Configuration management for each pipeline type
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cognee/api/v1/cognify/code_graph_pipeline.py
(6 hunks)cognee/modules/data/extraction/extract_summary.py
(1 hunks)cognee/shared/data_models.py
(0 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(1 hunks)cognee/tasks/summarization/models.py
(2 hunks)cognee/tasks/summarization/summarize_code.py
(3 hunks)evals/eval_swe_bench.py
(1 hunks)
💤 Files with no reviewable changes (1)
- cognee/shared/data_models.py
🚧 Files skipped from review as they are similar to previous changes (5)
- cognee/tasks/summarization/models.py
- evals/eval_swe_bench.py
- cognee/modules/data/extraction/extract_summary.py
- cognee/tasks/repo_processor/get_source_code_chunks.py
- cognee/tasks/summarization/summarize_code.py
🔇 Additional comments (2)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
Line range hint 10-41
: LGTM: Imports are well-organized and support the new functionality
The new imports are logically grouped and necessary for supporting document processing capabilities.
Line range hint 146-149
: Add safety controls for pruning operations
The pruning operations lack critical safety controls and could be dangerous in a production environment.
if include_docs: | ||
async for result in run_tasks(non_code_tasks, repo_path): | ||
yield result | ||
|
||
async for result in run_tasks(tasks, repo_path, "cognify_code_pipeline"): | ||
yield result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve pipeline execution and error handling
The current implementation runs document and code tasks sequentially and lacks proper error handling. Consider:
- Running document and code tasks in parallel for better performance
- Adding specific error handling for document processing
- Implementing status tracking for document pipeline tasks
Example implementation:
try:
if include_docs:
doc_pipeline = run_tasks(non_code_tasks, repo_path)
code_pipeline = run_tasks(tasks, repo_path, "cognify_code_pipeline")
async for result in asyncio.gather(
doc_pipeline if include_docs else [],
code_pipeline
):
yield result
except Exception as error:
logger.error(f"Pipeline execution failed: {error}")
# Add specific error handling for document vs code pipeline
raise
800a507
to
6ed7730
Compare
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | 9ce09a8 | notebooks/hr_demo.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
6ed7730
to
719d621
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
cognee/api/v1/cognify/code_graph_pipeline.py (3)
55-56
:⚠️ Potential issueAdd safety controls for pruning operations
This is a duplicate of a previously raised concern about dangerous pruning operations.
The pruning operations need immediate safety controls:
- Environment checks
- Logging
- User permission validation
- Confirmation mechanism
71-84
: 🛠️ Refactor suggestionConsider parallel execution of document and code tasks
This is a duplicate of a previously raised suggestion for parallel execution.
The document and code tasks could be executed in parallel using
asyncio.gather()
for better performance, as they appear to be independent.
86-91
: 🛠️ Refactor suggestionAdd error handling and status tracking
This is a duplicate of a previously raised concern about pipeline execution.
The pipeline execution needs:
- Proper error handling for both document and code pipelines
- Status tracking for document processing tasks
- Telemetry for pipeline execution stages
🧹 Nitpick comments (3)
cognee/tasks/repo_processor/__init__.py (1)
7-7
: Reorder imports to follow PEP 8The import statement should be moved to the top of the file, after the logger initialization but before other imports, to comply with PEP 8 guidelines.
import logging logger = logging.getLogger("task:repo_processor") +from .get_non_code_files import get_data_list_for_user, get_non_py_files from .enrich_dependency_graph import enrich_dependency_graph from .expand_dependency_graph import expand_dependency_graph -from .get_non_code_files import get_data_list_for_user, get_non_py_files from .get_repo_file_dependencies import get_repo_file_dependencies🧰 Tools
🪛 Ruff (0.8.2)
7-7: Module level import not at top of file
(E402)
cognee/api/v1/cognify/code_graph_pipeline.py (2)
35-35
: Move summarization import to main import blockThe summarization import should be grouped with other task-related imports.
from cognee.tasks.storage import add_data_points +from cognee.tasks.summarization import summarize_code, summarize_text monitoring = get_base_config().monitoring_tool if monitoring == MonitoringTool.LANGFUSE: from langfuse.decorators import observe -from cognee.tasks.summarization import summarize_code, summarize_text🧰 Tools
🪛 Ruff (0.8.2)
35-35: Module level import not at top of file
(E402)
41-41
: Consider defaulting include_docs to FalseSince this is a new feature that processes additional files, it might be safer to default
include_docs
toFalse
and require explicit opt-in.-async def run_code_graph_pipeline(repo_path, include_docs=True): +async def run_code_graph_pipeline(repo_path, include_docs=False):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/api/v1/cognify/code_graph_pipeline.py
(2 hunks)cognee/tasks/repo_processor/__init__.py
(1 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tasks/repo_processor/get_non_code_files.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/api/v1/cognify/code_graph_pipeline.py
35-35: Module level import not at top of file
(E402)
cognee/tasks/repo_processor/__init__.py
7-7: Module level import not at top of file
(E402)
🔇 Additional comments (1)
cognee/api/v1/cognify/code_graph_pipeline.py (1)
1-4
: LGTM! Clear deprecation notice
The deprecation notice is well-documented with a reference to the related issue.
Task(expand_dependency_graph, task_config={"batch_size": 50}), | ||
Task(get_source_code_chunks, embedding_model=embedding_engine.model, task_config={"batch_size": 50}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does get_source_code_chunks
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @borisarzentar I messed up the commits on this PR, moved the meaningful part here: #395
I am closing this one now, sorry for the confusion
In this first version, I assume for simplicity that every non-py file is documentation.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation