Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 813 source code chunks #383

Merged
merged 19 commits into from
Dec 26, 2024
Merged

Cog 813 source code chunks #383

merged 19 commits into from
Dec 26, 2024

Conversation

lxobr
Copy link
Collaborator

@lxobr lxobr commented Dec 18, 2024

  • Fixed the data points flow between the first two pipeline tasks.
  • Updated CodeGraphEntity data points to have the correct metadata type and removed unnecessary embeddings.
  • Introduced a new SourceCodeChunk data point.
  • Implemented a task to chunk source code into embeddable parts.

Summary by CodeRabbit

  • New Features

    • Enhanced functionality for processing source code into manageable chunks.
    • Introduced a new custom exception for handling embedding errors.
  • Bug Fixes

    • Improved error handling and logging in various functions to enhance robustness.
  • Documentation

    • Updated argument parsing for improved flexibility in input options.
  • Refactor

    • Restructured task execution logic and class attributes for better clarity and usability.
  • Style

    • Adjusted formatting and readability in multiple files for consistency.

@lxobr lxobr requested a review from alekszievr December 18, 2024 13:13
Copy link
Contributor

coderabbitai bot commented Dec 18, 2024

Walkthrough

This pull request introduces significant changes to the code graph processing pipeline, focusing on source code chunking, embedding, and summarization. The modifications span multiple files, including code_graph_pipeline.py, CodeGraphEntities.py, and various task-related modules. The changes enhance the system's ability to process and analyze source code by introducing more granular code chunking, improving error handling, and providing more flexible summarization capabilities.

Changes

File Change Summary
cognee/api/v1/cognify/code_graph_pipeline.py Updated imports, added embedding engine initialization, modified task configurations for source code chunks and non-code files
cognee/shared/CodeGraphEntities.py Removed CodeRelationship class, added SourceCodeChunk class, modified metadata and attributes of existing classes
cognee/tasks/repo_processor/get_repo_file_dependencies.py Modified yield statements to return lists of CodeFile instances
cognee/tasks/repo_processor/get_source_code_chunks.py Added comprehensive source code chunking functionality with token-based splitting and error handling
cognee/shared/data_models.py Removed file_name attribute from SummarizedCode class
cognee/tasks/summarization/models.py Updated CodeSummary class to support multiple source types
cognee/tasks/summarization/summarize_code.py Modified attribute references and summary generation logic
cognee/tasks/storage/index_data_points.py Added logging and error handling for indexing operations
examples/python/code_graph_example.py Enhanced --include_docs argument parsing
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py Improved error handling and added custom embedding exception
cognee/infrastructure/databases/exceptions/EmbeddingException.py Introduced new custom exception for embedding-related errors

Sequence Diagram

sequenceDiagram
    participant Pipeline as Code Graph Pipeline
    participant Processor as Repo Processor
    participant Chunker as Source Code Chunker
    participant Embedder as Embedding Engine
    participant Summarizer as Code Summarizer
    participant Indexer as Data Indexer

    Pipeline->>Processor: Get repository files
    Processor-->>Pipeline: Return code files
    Pipeline->>Chunker: Process source code chunks
    Chunker->>Embedder: Generate embeddings
    Embedder-->>Chunker: Return embeddings
    Chunker-->>Pipeline: Return code chunks
    Pipeline->>Summarizer: Generate code summaries
    Summarizer-->>Pipeline: Return summaries
    Pipeline->>Indexer: Index data points
    Indexer-->>Pipeline: Confirm indexing
Loading

Possibly related PRs

Suggested reviewers

  • borisarzentar
  • Vasilije1990
  • 0xideas

Poem

🐰 Code chunks dancing, tokens light,
Embedding magic takes its flight!
Parsing lines with rabbit's might,
Transforming source to insight bright,
A pipeline of pure delight! 🔍✨

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf63dbc and f5fa3ec.

📒 Files selected for processing (3)
  • cognee/infrastructure/databases/exceptions/EmbeddingException.py (1 hunks)
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (6 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/tasks/storage/index_data_points.py
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
🔇 Additional comments (1)
cognee/infrastructure/databases/exceptions/EmbeddingException.py (1)

1-3: Looks Good!

The custom exception is straightforward, correctly inherits from Exception, and includes a concise docstring describing its purpose. No issues spotted.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@lxobr lxobr requested a review from borisarzentar December 18, 2024 13:13
@lxobr lxobr self-assigned this Dec 18, 2024
@lxobr
Copy link
Collaborator Author

lxobr commented Dec 18, 2024

@borisarzentar , can you please check whether the DataPoints in CodeGraphEntities.py look alright?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)

46-63: Clarify cutoff logic in _get_chunk_source_code.
The variable names (e.g., “current_cutoff”) can be confusing. Consider renaming it to something like “overlap_cutoff_index” for readability. Also verify that line 57 sets an index one behind the current iteration. If “i - 1” happens to be negative, we might skip the first subchunk.

- current_cutoff = i - 1
+ overlap_cutoff_index = max(0, i - 1)

65-90: Check concurrency or parallel chunk generation in get_source_code_chunks_from_code_part.
This function yields chunks in a sequential manner, which is likely fine for many uses. However, if a user requires parallel chunking (e.g., for large repositories), you might consider asynchronous partitioning. Assess whether sequential generation could become a bottleneck and if asynchronous patterns would help.

cognee/shared/CodeGraphEntities.py (2)

32-40: SourceCodeChunk design appears coherent, but consider linking to CodeFile for traceability.
Right now, a SourceCodeChunk belongs to a CodePart, which in turn belongs to a CodeFile. For direct referencing or analytics, you might need a direct link to the parent CodeFile. Evaluate whether hooking SourceCodeChunk directly to a CodeFile could simplify queries and indexing.


45-45: Coordinate model rebuild calls more systematically.
All classes call model_rebuild() individually. While this is correct, consider grouping the calls together or leveraging an auto-rebuild pattern if the engine supports it. This could reduce duplication and ensure consistency across all models.

cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)

112-112: Yielding code_files as a single list might disrupt existing pipeline logic.
Some tasks may anticipate sequential DataPoints. Double-check any tasks that handle CodeFile objects to ensure they correctly process a list of them at once.

cognee/api/v1/cognify/code_graph_pipeline.py (3)

66-66: Document differences between run_pipeline and run_code_graph_pipeline.
Although run_pipeline is deprecated, it still has an effect. Clarify in the docstring that run_code_graph_pipeline is the recommended approach so that future maintainers understand the differences.


141-141: Revisit enrich_dependency_graph ordering.
This task precedes expanding dependencies and extracting source code chunks. If additional metadata or code relationships introduced by chunking are needed for the dependency graph, you may want to reorder these tasks.


143-144: Integrate concurrency or partial streaming strategies.
The code_graph_pipeline tasks can handle large repositories. When dealing with thousands of files, running get_source_code_chunks followed by summarize_code might be time-consuming. Consider partial updates or concurrency to handle tasks without waiting for the entire operation to complete.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb2c3 and b5847c5.

📒 Files selected for processing (4)
  • cognee/api/v1/cognify/code_graph_pipeline.py (4 hunks)
  • cognee/shared/CodeGraphEntities.py (1 hunks)
  • cognee/tasks/repo_processor/get_repo_file_dependencies.py (3 hunks)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
🔇 Additional comments (5)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)

11-13: Consider validating or sanitizing input in _count_tokens.
If the input source code string contains unexpected characters or is extremely large, it may cause performance overhead or unexpected tokenization results. You might want to add a guard clause or logging messages for extremely large inputs to avoid potential memory strain.


92-104: Review data point yielding order and potential duplication.
This function yields the original data point first, and then yields each contained CodePart followed by newly generated SourceCodeChunks. Ensure that no duplicate DataPoints are being introduced back into the pipeline, especially when combining tasks that rely on unique DataPoint identities.

✅ Verification successful

Let me gather more information about the data point types and their relationships to better understand the potential for duplicates.


Let me check how the source code chunks are generated and if there's any deduplication mechanism in place.


Based on the gathered information, I can now provide a final response:

The yielding order is correct and duplicates are prevented by design.

The implementation is safe because:

  1. Each DataPoint has a unique UUID identifier generated at creation (via uuid4)
  2. The hierarchical yielding (CodeFile -> CodePart -> SourceCodeChunk) follows the natural containment relationship
  3. Each level represents different entity types (CodeFile, CodePart, SourceCodeChunk) that are processed differently in the pipeline

The sequential yielding is intentional to allow downstream tasks to process each level of abstraction while maintaining the relationship between parent and child entities.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Confirm that no DataPoint is processed multiple times, leading to duplicates.
# Test: Look for yield statements in get_source_code_chunks that might reintroduce duplicates.
rg -A 2 'yield data_point'
rg -A 2 'yield code_part'
rg -A 2 'yield source_code_chunk'

Length of output: 1254


Script:

#!/bin/bash
# Check the DataPoint class hierarchy and relationships
ast-grep --pattern 'class DataPoint'
ast-grep --pattern 'class CodeFile'
ast-grep --pattern 'class CodePart'
ast-grep --pattern 'class SourceCodeChunk'

# Look for any deduplication or unique identifier logic
rg -l 'def.*id' cognee/tasks/repo_processor/
rg -A 5 'class.*DataPoint'

Length of output: 18863


Script:

#!/bin/bash
# Check the implementation of get_source_code_chunks_from_code_part
rg -A 10 'def get_source_code_chunks_from_code_part'

# Check if there's any unique identifier or deduplication logic in DataPoint base class
ast-grep --pattern 'class DataPoint'
rg -A 10 'class DataPoint'

Length of output: 2196

cognee/tasks/repo_processor/get_repo_file_dependencies.py (2)

Line range hint 97-110: Confirm that depends_on references for CodeFile objects are accurate.
If some of these dependencies cannot be resolved (not found in py_files_dict), you might end up with partial or inconsistent references. That could break data relationships in subsequent tasks.


74-74: Yielding [repo] changes the downstream format.
Previously, a single DataPoint might have been yielded. Now a list containing the repo is yielded. Ensure that all downstream tasks that expect a single Repository object can handle a list instead.

cognee/api/v1/cognify/code_graph_pipeline.py (1)

10-12: Ensure the newly imported modules are fully utilized.
You have introduced imports for SourceCodeGraph and SummarizedContent. Check if the references to these objects in the pipeline are correct and valid, or if any are unused.

cognee/tasks/repo_processor/get_source_code_chunks.py Outdated Show resolved Hide resolved
cognee/shared/CodeGraphEntities.py Show resolved Hide resolved
cognee/api/v1/cognify/code_graph_pipeline.py Show resolved Hide resolved
@lxobr lxobr marked this pull request as draft December 18, 2024 16:07
@lxobr lxobr force-pushed the COG-813-source-code-chunks branch from b5847c5 to 1c5ca84 Compare December 18, 2024 16:38
Copy link

gitguardian bot commented Dec 20, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password b524e94 notebooks/hr_demo.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@lxobr lxobr marked this pull request as ready for review December 20, 2024 14:19
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

35-41: ⚠️ Potential issue

Potential risk of deep recursion in _get_subchunk_token_counts.

Due to nested structures or large modules, this function might recurse many levels deep. Although max_depth is set to 100, extremely pathological code might still cause performance or memory issues. Consider an iterative approach as recommended in past reviews for safer handling of large or heavily-nested code.

🧹 Nitpick comments (5)
cognee/tasks/repo_processor/get_source_code_chunks.py (3)

11-12: Consider caching or single-pass token counting instead of repeated encoding.

The function _count_tokens calls tokenizer.encode on each source code snippet. If used frequently in tight loops, encoding overhead may become a bottleneck. Consider caching or carefully scheduling these calls if performance becomes an issue.


83-108: _checking overlap ratio and chunk boundaries.

In _get_chunk_source_code, the logic for overlap-based trimming is correct, but the partial chunk boundary might lead to frequent context break if overlap is large. Monitor whether this approach causes any confusion in subsequent processing or embedding steps.


122-135: Efficient repeated chunk generation.

Where the loop repeatedly calls _get_chunk_source_code, watch out for performance overhead. In large code files, this could loop many times. If performance is acceptable for your use case, ignore. Otherwise, consider more direct or streamed chunk generation.

cognee/shared/CodeGraphEntities.py (1)

28-29: Optional source_code recommended for partial code parts?

Changing source_code to Optional helps avoid frequent null checks. However, ensure that the rest of the code generation pipeline can handle missing code gracefully.

cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)

Line range hint 93-111: Consider optimizing dependency resolution.

The current implementation loads source code for all dependencies upfront, which might not be necessary if the dependent files are processed later in the pipeline.

Consider lazy loading of dependency source code:

                 depends_on=[
                     CodeFile(
                         id=uuid5(NAMESPACE_OID, dependency),
                         extracted_id=dependency,
                         part_of=repo,
-                        source_code=py_files_dict.get(dependency, {}).get("source_code"),
+                        source_code=None,  # Lazy load when needed
                     ) for dependency in dependencies
                 ] if dependencies else None,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5847c5 and d2911c1.

📒 Files selected for processing (7)
  • cognee/api/v1/cognify/code_graph_pipeline.py (4 hunks)
  • cognee/shared/CodeGraphEntities.py (2 hunks)
  • cognee/shared/data_models.py (0 hunks)
  • cognee/tasks/repo_processor/get_repo_file_dependencies.py (3 hunks)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
  • cognee/tasks/summarization/models.py (2 hunks)
  • cognee/tasks/summarization/summarize_code.py (3 hunks)
💤 Files with no reviewable changes (1)
  • cognee/shared/data_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/api/v1/cognify/code_graph_pipeline.py
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (15)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)

56-59: Handle potential single-child node carefully.

When a module has only one real child, you reassign module = module.children[0]. If that child is also minimal or invalid, we risk reassigning to a node that’s not guaranteed to be parseable. Ensure we don’t inadvertently skip children or cause infinite loops if child is also near-empty or missing.


72-74: Special handling for string nodes is correct but be mindful of edges.

By calling _get_naive_subchunk_token_counts for string nodes, you avoid further parsing. This is logical for large string literals. However, watch out for multi-line strings or docstrings that may contain code-like content. Users might store code blocks in docstrings.


137-149: Asynchronous generator pipeline clarity.

In get_source_code_chunks, you yield the original data_point, then code_part, then each chunk. This is logical but ensure the order of yields is correct for downstream consumers. Also, confirm that none of them assume a single type per iteration.


15-33: Warn about naive splitting approach for large source code.

The _get_naive_subchunk_token_counts function can generate many subchunks for very large files, potentially leading to high memory usage. Additionally, if logic upstream feeds extremely large source code strings, consider streaming or iterative chunking to avoid ballooning memory usage.

cognee/tasks/summarization/models.py (2)

1-1: Import seems fine. No issues.

The addition of Union from typing is consistent with usage below.


22-22: Flexible summarization with union of types.

Allowing summarizes to handle CodeFile, CodePart, or SourceCodeChunk is a good approach. Ensure that downstream logic checks the actual type to handle each variant correctly.

cognee/tasks/summarization/summarize_code.py (2)

24-24: Renaming extracted_id to id ensures consistent referencing.

This dictionary key change is straightforward and beneficial if the new code references node.id.


38-39: Renamed attribute for clarity.

Swapping made_from=node for summarizes=node aligns with the new attribute in CodeSummary. This fosters better readability and consistency.

cognee/shared/CodeGraphEntities.py (5)

9-9: Index fields are empty.

Setting index_fields to an empty list on Repository means less overhead but also fewer search capabilities. Verify that it’s deliberate if you need to search on path or other fields.


22-22: Index fields removed from CodeFile.

No immediate issues, but if the removal of index_fields was not intentional, it may affect searching or referencing CodeFile by those fields.


31-31: Index fields removed from CodePart.

Similar to CodeFile, confirm if removal is intended and that searching by code content is not needed.


35-43: Introduction of SourceCodeChunk class.

Bringing in code_chunk_of, source_code, and previous_chunk is crucial for chunk chaining. Good job. But be mindful that previous_chunk references can form cyclical references if used incorrectly.


48-48: Model rebuild call.

Ensuring that SourceCodeChunk.model_rebuild() is invoked helps keep Pydantic model definitions updated. Confirm that these model rebuild calls are necessary in your environment.

cognee/tasks/repo_processor/get_repo_file_dependencies.py (2)

Line range hint 93-111: Memory optimization needed for large repositories.

Accumulating all CodeFile instances in memory before yielding could cause issues with large repositories. Consider yielding in batches instead.

Here's a suggested implementation using batching:

-        code_files = []
+        batch_size = 100  # Adjust based on typical file sizes
+        current_batch = []
         for (file_path, metadata), dependencies in zip(py_files_dict.items(), results):
             source_code = metadata.get("source_code")
-            code_files.append(CodeFile(
+            current_batch.append(CodeFile(
                 id=uuid5(NAMESPACE_OID, file_path),
                 source_code=source_code,
                 extracted_id=file_path,
                 part_of=repo,
                 depends_on=[
                     CodeFile(
                         id=uuid5(NAMESPACE_OID, dependency),
                         extracted_id=dependency,
                         part_of=repo,
                         source_code=py_files_dict.get(dependency, {}).get("source_code"),
                     ) for dependency in dependencies
                 ] if dependencies else None,
             ))
-
-        yield code_files
+            if len(current_batch) >= batch_size:
+                yield current_batch
+                current_batch = []
+        if current_batch:  # Yield any remaining files
+            yield current_batch

74-74: Verify consumers of this generator function.

The change to yield [repo] instead of just repo standardizes the output format but might impact existing consumers of this generator.

✅ Verification successful

Let me analyze the findings and generate additional verification scripts to understand the impact better.

This will help us understand how the tasks process the yielded data and verify if they can handle the list format.


Let me analyze the Task implementation one more time to be completely certain.


Based on the gathered evidence, I can now provide a final response:

The change to yield [repo] is safe and aligns with the pipeline's design

The verification shows that:

  • All downstream tasks (enrich_dependency_graph, expand_dependency_graph) expect and process a list of data points
  • The pipeline tasks are designed to handle lists as shown by their function signatures: async def enrich_dependency_graph(data_points: list[DataPoint])
  • The change standardizes the output format, making both yields return lists consistently
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct usage of get_repo_file_dependencies
rg -l "get_repo_file_dependencies" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "get_repo_file_dependencies" "$file"
done

Length of output: 4807


Script:

#!/bin/bash
# Check the implementation of tasks that consume the generator output
echo "=== Pipeline Task Implementation ==="
ast-grep --pattern 'class Task {
  $$$
  async def process($$$) {
    $$$
  }
  $$$
}'

echo "=== Checking enrich_dependency_graph implementation ==="
rg -A 10 "async def enrich_dependency_graph" 

echo "=== Checking expand_dependency_graph implementation ==="
rg -A 10 "def expand_dependency_graph"

Length of output: 2457


Script:

#!/bin/bash
# Check the Task class implementation
echo "=== Task Class Implementation ==="
rg -A 15 "class Task"

# Check the pipeline execution
echo "=== Pipeline Execution ==="
rg -A 10 "async def execute" 

Length of output: 4686

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
examples/python/code_graph_example.py (1)

14-14: Make the boolean parsing more robust
This lambda-based approach only checks for "true" or "1". Consider also supporting "false" or "0" for a symmetric user experience — or at least documenting that any other string value will default to False.

-parser.add_argument("--include_docs", type=lambda x: x.lower() in ("true", "1"), default=True, help="Whether or not to process non-code files")
+def str_to_bool(value: str) -> bool:
+    return value.lower() in ("true", "1")

+parser.add_argument(
+    "--include_docs",
+    type=str_to_bool,
+    default=True,
+    help="Whether or not to process non-code files"
+)
cognee/tasks/storage/index_data_points.py (1)

36-39: Consider adding more robust error handling
Simply printing the error notice might not be sufficient in production. Consider logging it at an appropriate level and possibly re-raising or implementing retry logic for recoverable errors.

 except (OpenAIError, BadRequestError) as e:
-    print(f"Failed to index data points for {index_name}.{field_name}: {e}")
+    logger.error(f"Failed to index data points for {index_name}.{field_name}: {e}")
+    # Potentially add retry logic or re-raise
cognee/api/v1/cognify/code_graph_pipeline.py (3)

61-61: Reevaluate batch processing removal
Previously, enrich_dependency_graph had a batch size. Removing it can lead to memory overhead if the dataset is large. Evaluate whether you need a batch size to prevent performance bottlenecks.


63-63: Handling embedding model
Passing the embedding model is a nice improvement. If the model can change, consider injecting it dynamically to avoid coupling tasks to a single model.


70-70: Adjust tasks for doc exclusion
Switching to get_non_py_files is valid if your doc files are outside Python. Ensure that you still handle other file types that need processing (like Markdown, text, etc.).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2911c1 and 762df11.

📒 Files selected for processing (3)
  • cognee/api/v1/cognify/code_graph_pipeline.py (5 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
  • examples/python/code_graph_example.py (1 hunks)
🔇 Additional comments (6)
cognee/tasks/storage/index_data_points.py (1)

1-2: Imports are consistent with usage
Using these exceptions from the litellm library is appropriate if the engine can raise them. Make sure they align with the actual exceptions thrown in your environment.

✅ Verification successful

Let me search for the usage of these exceptions in the codebase to verify their alignment with the actual exceptions being thrown.


Imports are properly used and aligned with error handling

The verification shows that both BadRequestError and OpenAIError are correctly used in the error handling of the index_data_points function. These exceptions are caught specifically when calling vector_engine.index_data_points, which is appropriate as the codebase uses LiteLLM for embeddings (as seen in LiteLLMEmbeddingEngine.py). The error handling pattern is consistent with other LiteLLM exception handling in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for usage of BadRequestError and OpenAIError
rg "BadRequestError|OpenAIError" -A 3

Length of output: 800


Script:

#!/bin/bash
# Look for exception handling patterns in the index_data_points.py file
cat cognee/tasks/storage/index_data_points.py

Length of output: 4031


Script:

#!/bin/bash
# Search for any litellm related error handling
rg "except.*litellm" -A 3

Length of output: 1004

cognee/api/v1/cognify/code_graph_pipeline.py (5)

82-82: Conditional docs processing looks good
Skipping the document-related tasks when include_docs is False helps performance. Confirm that user expectations are met if doc tasks are always optional.


20-20: Check new naming convention
Renaming “get_non_code_files” to “get_non_py_files” makes sense if you only want to exclude Python files. Confirm that this does not omit other file types that might be critical.

✅ Verification successful

Let me gather information about the file naming conventions and related functions in the codebase.


Let me check the implementation of the function to understand what files it actually processes.


Function name accurately reflects its purpose

The function get_non_py_files is correctly named as it specifically filters out Python files (.py) while also excluding common build artifacts and system files through IGNORED_PATTERNS (.git, __pycache__, .pyc, .pyo, .pyd, node_modules, .egg-info). The name change from get_non_code_files to get_non_py_files is more precise and better reflects its actual implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for the function definition and its usage
ast-grep --pattern 'def get_non_py_files'

# Search for any references to this function
rg "get_non_py_files"

# Search for the previous function name to see the change
rg "get_non_code_files"

# Look for file extension patterns or filtering logic
rg -A 5 "\..*['\"].*exclude" 

Length of output: 654


Script:

#!/bin/bash
# Get the implementation of get_non_py_files
cat cognee/tasks/repo_processor/get_non_code_files.py

Length of output: 1884


6-7: Validate that the embedding engine is appropriate
Ensure the selected engine works well for your tokenization approach. Larger embedding models might slow down indexing or exceed token limits.

✅ Verification successful

Let me gather information about the embedding engine implementation and its usage.


Let me try different commands to gather information about the embedding engine.


Based on the gathered information, I can now provide a final response:

Embedding configuration is well-defined with appropriate defaults
The system uses LiteLLM with text-embedding-3-large (3072 dimensions) by default, which is a good choice for code understanding. The implementation includes proper configuration options for customization through environment variables, including model selection, dimensions, and API settings. The embedding engine is consistently used across vector stores and graph operations.

Key configurations found:

  • Default model: text-embedding-3-large
  • Embedding dimensions: 3072
  • Configurable via environment variables: endpoint, API key, and API version
  • Used by both vector databases and graph operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Check the embedding engine implementation
ast-grep --pattern 'def get_embedding_engine'

# Look for configuration or settings related to the embedding engine
rg -l "embedding" --type python

# Check for any token limit or model size configurations
rg -i "token.*limit|model.*size" --type python

Length of output: 190


Script:

#!/bin/bash
# Find Python files containing embedding-related code
fd -e py | xargs rg "embedding" -l

# Look for the embedding engine implementation
fd -e py | xargs rg "get_embedding_engine" -A 5

# Search for configuration files that might contain embedding settings
fd "config|settings" -e py -e yaml -e json | xargs rg -i "embedding|token|model"

Length of output: 10695


54-55: Ensure embedding engine meets chunk constraints
This code retrieves the default embedding engine, but confirm it respects any size or token constraints used by get_source_code_chunks.

✅ Verification successful

Let me gather information about the embedding engine and chunk constraints by examining the codebase.


Let me search for more specific information about the embedding engine and its relationship with token constraints.


Based on the gathered information, I can now generate the final response:

Embedding engine and chunk constraints are properly aligned
The embedding engine (text-embedding-3-large) and source code chunking are well-coordinated. The get_source_code_chunks function explicitly handles token limits (max 8192 tokens) and uses the same model's tokenizer (tiktoken.encoding_for_model("text-embedding-3-large")), ensuring chunks are compatible with the embedding engine's constraints.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Find the implementation of get_embedding_engine and get_source_code_chunks
ast-grep --pattern 'def get_embedding_engine'
ast-grep --pattern 'def get_source_code_chunks'

# Search for any configuration or constants related to chunk constraints
rg -i "chunk.*size|token.*limit|embedding.*limit" -A 3

Length of output: 26793


Script:

#!/bin/bash
# Find the implementation of get_embedding_engine and related token constraints
ast-grep --pattern 'def get_embedding_engine'
rg -A 5 "get_embedding_engine"

# Look for token limit configurations in the embedding engine
rg -i "max.*tokens|token.*limit" -A 3

Length of output: 11300


22-23: Confirm get_source_code_chunks usage
Ensure that the chunking logic in “get_source_code_chunks” matches downstream tasks’ expectations. Changes in chunk size or metadata might impact summarization steps later.

✅ Verification successful

Let me gather information about the usage of get_source_code_chunks and its relationship with downstream tasks.


Let me search for the summarization task implementation to understand how it uses the chunks.


Let me check the summarization implementation to understand how it processes the chunks.


Based on the gathered information, I can now provide a final response:

Chunking configuration is safely handled by default parameters
The get_source_code_chunks function uses well-defined default parameters (max_tokens=8192, overlap=0.25, granularity=0.1) that are compatible with the summarization pipeline. The summarization task (summarize_code) processes these chunks without any specific size requirements, as it extracts summaries from the source code content directly. The chunking logic maintains context through proper overlap handling and the chunks are properly linked through the previous_chunk reference.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Find the implementation of get_source_code_chunks
ast-grep --pattern 'def get_source_code_chunks($$$)'

# Search for imports and usage of get_source_code_chunks
rg "get_source_code_chunks" -A 3

Length of output: 2039


Script:

#!/bin/bash
# Find the implementation of summarize_code
ast-grep --pattern 'def summarize_code($$$)'

# Look for SourceCodeChunk usage in summarization
rg "SourceCodeChunk" -A 3

Length of output: 2654


Script:

#!/bin/bash
# Find the implementation of summarize_code and its usage of chunks
rg -A 10 "def summarize_code"

# Check if there are any specific chunk size or metadata requirements in summarization
rg -A 5 "max_tokens|granularity|overlap" 

Length of output: 24085

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

38-84: 🛠️ Refactor suggestion

Consider replacing recursion with iteration for better stability.

While the max_depth parameter helps, an iterative approach using a stack would be more robust for handling large or deeply nested code files.

🧹 Nitpick comments (4)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)

14-16: Add input validation for source_code parameter.

The function should handle edge cases where source_code might be None or empty.

 def _count_tokens(tokenizer: tiktoken.Encoding, source_code: str) -> int:
+    if not source_code:
+        return 0
     return len(tokenizer.encode(source_code))

18-36: Optimize string concatenation in token decoding loop.

Consider using list comprehension and join for better performance when concatenating decoded tokens.

-        subchunk = ''.join(
-            tokenizer.decode_single_token_bytes(token_id).decode('utf-8', errors='replace')
-            for token_id in subchunk_token_ids
-        )
+        subchunk = ''.join([
+            tokenizer.decode_single_token_bytes(token_id).decode('utf-8', errors='replace')
+            for token_id in subchunk_token_ids
+        ])

86-111: Document overlap behavior and its implications.

The overlap calculation could result in 100% of lines (except beginning and end) being part of two chunks when overlap is set to 0.5. This should be documented for clarity.

-    """Generates a chunk of source code from tokenized subchunks with overlap handling."""
+    """Generates a chunk of source code from tokenized subchunks with overlap handling.
+    
+    Note: With an overlap of 0.5, most lines will appear in two chunks, as the actual
+    overlap ratio is overlap/(1-overlap). This is intentional to ensure context
+    continuity between chunks.
+    """

113-142: Add parameter validation for overlap and granularity.

These parameters should be validated to ensure they are within reasonable bounds.

 def get_source_code_chunks_from_code_part(
         code_file_part: CodePart,
         max_tokens: int = 8192,
         overlap: float = 0.25,
         granularity: float = 0.1,
         model_name: str = "text-embedding-3-large"
 ) -> Generator[SourceCodeChunk, None, None]:
+    if not 0 <= overlap < 1:
+        raise ValueError("Overlap must be between 0 and 1")
+    if not 0 < granularity <= 1:
+        raise ValueError("Granularity must be between 0 and 1")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 762df11 and 35071b5.

📒 Files selected for processing (1)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

1-11: LGTM! Well-organized imports and proper logger setup.

The imports are cleanly organized and the task-specific logger follows best practices.

cognee/tasks/repo_processor/get_source_code_chunks.py Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

38-44: ⚠️ Potential issue

Replace recursive implementation with iterative approach.

Based on past learnings, this recursive implementation can cause stack overflow with deeply nested code.

The previous suggestion to use an iterative approach with a stack should be implemented to prevent recursion errors. See the implementation provided in the past review comments.

🧹 Nitpick comments (4)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)

14-16: Add input validation and docstring.

The function is concise but could benefit from input validation and documentation.

 def _count_tokens(tokenizer: tiktoken.Encoding, source_code: str) -> int:
+    """Count the number of tokens in the source code using the provided tokenizer.
+    
+    Args:
+        tokenizer: The tiktoken Encoding instance to use for tokenization
+        source_code: The source code string to tokenize
+    
+    Returns:
+        int: The number of tokens in the source code
+    """
+    if not source_code:
+        return 0
     return len(tokenizer.encode(source_code))

18-20: Document the magic number and add input validation.

The default value of 8000 for max_subchunk_tokens should be documented and validated.

+# Maximum number of tokens per subchunk, chosen based on typical model context window sizes
+MAX_DEFAULT_SUBCHUNK_TOKENS = 8000
+
 def _get_naive_subchunk_token_counts(
-        tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = 8000
+        tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = MAX_DEFAULT_SUBCHUNK_TOKENS
 ) -> list[tuple[str, int]]:
     """Splits source code into subchunks of up to max_subchunk_tokens and counts tokens."""
+    if max_subchunk_tokens <= 0:
+        raise ValueError("max_subchunk_tokens must be positive")
+    if not source_code:
+        return []

113-119: Enhance logging and document parameters.

The function would benefit from more detailed logging and parameter documentation.

 def get_source_code_chunks_from_code_part(
         code_file_part: CodePart,
         max_tokens: int = 8192,
         overlap: float = 0.25,
         granularity: float = 0.1,
         model_name: str = "text-embedding-3-large"
 ) -> Generator[SourceCodeChunk, None, None]:
-    """Yields source code chunks from a CodePart object, with configurable token limits and overlap."""
+    """Yields source code chunks from a CodePart object, with configurable token limits and overlap.
+    
+    Args:
+        code_file_part: CodePart object containing source code
+        max_tokens: Maximum tokens per chunk (default: 8192 for GPT-4)
+        overlap: Overlap ratio between chunks (default: 0.25)
+        granularity: Ratio of max_tokens to use for subchunks (default: 0.1)
+        model_name: Name of the embedding model to use for tokenization
+    
+    Yields:
+        SourceCodeChunk objects containing chunks of the source code
+    """
     if not code_file_part.source_code:
-        logger.error(f"No source code in CodeFile {code_file_part.id}")
+        logger.error(f"No source code in CodeFile {code_file_part.id} of type {type(code_file_part)}")
         return

147-147: Track TODO for embedding model support.

The TODO comment about adding support for other embedding models should be tracked.

Would you like me to create a GitHub issue to track the implementation of a max_token mapping for different embedding models?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35071b5 and 68a9d27.

📒 Files selected for processing (1)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)

1-11: LGTM! Well-structured imports and logger setup.

The imports are well-organized and the logger setup follows best practices with a task-specific namespace.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (4)

13-13: No issues with this blank line.
It might be just a formatting or readability adjustment.


23-28: Constructor enhancements are clear and consistent.
The parameter list is more readable. Consider adding type hints (e.g., api_key: Optional[str] = None) for the string parameters to fully align with best practices.

 def __init__(
     self,
-    model: Optional[str] = "text-embedding-3-large",
-    dimensions: Optional[int] = 3072,
-    api_key: str = None,
-    endpoint: str = None,
-    api_version: str = None,
+    model: Optional[str] = "text-embedding-3-large",
+    dimensions: Optional[int] = 3072,
+    api_key: Optional[str] = None,
+    endpoint: Optional[str] = None,
+    api_version: Optional[str] = None,
 ):

48-48: Superfluous blank line.
If there's no specific purpose for this empty line, consider removing it for consistency.


102-102: Ensure the logger captures the original error context as well.
Moving the logger statement before raising the exception might help preserve the original traceback. Consider enhancing the logs for better debugging.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68a9d27 and cf63dbc.

📒 Files selected for processing (3)
  • cognee/infrastructure/databases/exceptions/embedding_exception.py (1 hunks)
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (6 hunks)
  • cognee/tasks/storage/index_data_points.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tasks/storage/index_data_points.py
🔇 Additional comments (7)
cognee/infrastructure/databases/exceptions/embedding_exception.py (1)

1-3: Class definition looks good.
This custom exception is straightforward, well-documented, and appropriately named for embedding-specific errors.

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (6)

8-8: Great job introducing the custom exception.
This import centralizes embedding error handling and improves readability and maintainability.


20-20: Explicitly declaring the mock attribute is a good practice.
Declaring the mock attribute here promotes clarity as to the data members of the class.


38-38: Smart check to ensure environment variable correctness.
Lowercasing and verifying string-based booleans is a reliable approach.


61-64: Parameter alignment looks good.
This improves the readability of the call to the async embedding API.


100-101: Raising EmbeddingException clarifies embedding errors.
Catching related library-specific exceptions and raising a custom exception helps maintain consistent error handling across the application.


76-76: Splitting the text array is correct, but watch out for large input edge cases.
Be sure to confirm the maximum text length that can be handled here, and perhaps consider more than just splitting into two halves if extremely large.

@lxobr lxobr merged commit 262deee into dev Dec 26, 2024
22 of 24 checks passed
@lxobr lxobr deleted the COG-813-source-code-chunks branch December 26, 2024 12:53
borisarzentar added a commit that referenced this pull request Jan 10, 2025
* feat: Add error handling in case user is already part of database and permission already given to group

Added error handling in case permission is already given to group and user is already part of group

Feature COG-656

* feat: Add user verification for accessing data

Verify user has access to data before returning it

Feature COG-656

* feat: Add compute search to cognee

Add compute search to cognee which makes searches human readable

Feature COG-656

* feat: Add simple instruction for system prompt

Add simple instruction for system prompt

Feature COG-656

* pass pydantic model tocognify

* feat: Add unauth access error to getting data

Raise unauth access error when trying to read data without access

Feature COG-656

* refactor: Rename query compute to query completion

Rename searching type from compute to completion

Refactor COG-656

* chore: Update typo in code

Update typo in string in code

Chore COG-656

* Add mcp to cognee

* Add simple README

* Update cognee-mcp/mcpcognee/__main__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Create dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve reflection issue when running cognee a second time after pruning data

When running cognee a second time after pruning data some metadata doesn't get pruned.
This makes cognee believe some tables exist that have been deleted

Fix

* fix: Add metadata reflection fix to sqlite as well

Added fix when reflecting metadata to sqlite as well

Fix

* update

* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add type to DataPoint metadata

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* Fixes

* Fixes to our demo

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Implement PR review

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

---------

Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
borisarzentar added a commit that referenced this pull request Jan 13, 2025
* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix diagram

* Fix instructions

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Implement PR review

* Comment out profiling

* Comment out profiling

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Fix visualization

* Fix visualization

* Fix visualization

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Fix visualization

* Fix visualization

* Fix visualization

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Fix visualization

* Fix poetry issues

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* test: Test ubuntu 24.04

* test: change all actions to ubuntu-latest

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* docs: Add LlamaIndex Cognee integration notebook

Added LlamaIndex Cognee integration notebook

* test: Add github action for testing llama index cognee integration notebook

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

* fix: update dependencies of the mcp server

* Update README.md

* Fix: Fixes logging setup

* feat: deletes on the fly embeddings as uses edge collections

* fix: Change nbformat on llama index integration notebook

* fix: Resolve api key issue with llama index integration notebook

* fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault

* version: Increase version to 0.1.22

---------

Co-authored-by: vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants