-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 813 source code chunks #383
Conversation
WalkthroughThis pull request introduces significant changes to the code graph processing pipeline, focusing on source code chunking, embedding, and summarization. The modifications span multiple files, including Changes
Sequence DiagramsequenceDiagram
participant Pipeline as Code Graph Pipeline
participant Processor as Repo Processor
participant Chunker as Source Code Chunker
participant Embedder as Embedding Engine
participant Summarizer as Code Summarizer
participant Indexer as Data Indexer
Pipeline->>Processor: Get repository files
Processor-->>Pipeline: Return code files
Pipeline->>Chunker: Process source code chunks
Chunker->>Embedder: Generate embeddings
Embedder-->>Chunker: Return embeddings
Chunker-->>Pipeline: Return code chunks
Pipeline->>Summarizer: Generate code summaries
Summarizer-->>Pipeline: Return summaries
Pipeline->>Indexer: Index data points
Indexer-->>Pipeline: Confirm indexing
Possibly related PRs
Suggested reviewers
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
🔇 Additional comments (1)cognee/infrastructure/databases/exceptions/EmbeddingException.py (1)
The custom exception is straightforward, correctly inherits from Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@borisarzentar , can you please check whether the DataPoints in CodeGraphEntities.py look alright? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (8)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
46-63
: Clarify cutoff logic in _get_chunk_source_code.
The variable names (e.g., “current_cutoff”) can be confusing. Consider renaming it to something like “overlap_cutoff_index” for readability. Also verify that line 57 sets an index one behind the current iteration. If “i - 1” happens to be negative, we might skip the first subchunk.- current_cutoff = i - 1 + overlap_cutoff_index = max(0, i - 1)
65-90
: Check concurrency or parallel chunk generation in get_source_code_chunks_from_code_part.
This function yields chunks in a sequential manner, which is likely fine for many uses. However, if a user requires parallel chunking (e.g., for large repositories), you might consider asynchronous partitioning. Assess whether sequential generation could become a bottleneck and if asynchronous patterns would help.cognee/shared/CodeGraphEntities.py (2)
32-40
: SourceCodeChunk design appears coherent, but consider linking to CodeFile for traceability.
Right now, a SourceCodeChunk belongs to a CodePart, which in turn belongs to a CodeFile. For direct referencing or analytics, you might need a direct link to the parent CodeFile. Evaluate whether hooking SourceCodeChunk directly to a CodeFile could simplify queries and indexing.
45-45
: Coordinate model rebuild calls more systematically.
All classes call model_rebuild() individually. While this is correct, consider grouping the calls together or leveraging an auto-rebuild pattern if the engine supports it. This could reduce duplication and ensure consistency across all models.cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
112-112
: Yielding code_files as a single list might disrupt existing pipeline logic.
Some tasks may anticipate sequential DataPoints. Double-check any tasks that handle CodeFile objects to ensure they correctly process a list of them at once.cognee/api/v1/cognify/code_graph_pipeline.py (3)
66-66
: Document differences between run_pipeline and run_code_graph_pipeline.
Although run_pipeline is deprecated, it still has an effect. Clarify in the docstring that run_code_graph_pipeline is the recommended approach so that future maintainers understand the differences.
141-141
: Revisit enrich_dependency_graph ordering.
This task precedes expanding dependencies and extracting source code chunks. If additional metadata or code relationships introduced by chunking are needed for the dependency graph, you may want to reorder these tasks.
143-144
: Integrate concurrency or partial streaming strategies.
The code_graph_pipeline tasks can handle large repositories. When dealing with thousands of files, running get_source_code_chunks followed by summarize_code might be time-consuming. Consider partial updates or concurrency to handle tasks without waiting for the entire operation to complete.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cognee/api/v1/cognify/code_graph_pipeline.py
(4 hunks)cognee/shared/CodeGraphEntities.py
(1 hunks)cognee/tasks/repo_processor/get_repo_file_dependencies.py
(3 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(1 hunks)
🔇 Additional comments (5)
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
11-13
: Consider validating or sanitizing input in _count_tokens.
If the input source code string contains unexpected characters or is extremely large, it may cause performance overhead or unexpected tokenization results. You might want to add a guard clause or logging messages for extremely large inputs to avoid potential memory strain.
92-104
: Review data point yielding order and potential duplication.
This function yields the original data point first, and then yields each contained CodePart followed by newly generated SourceCodeChunks. Ensure that no duplicate DataPoints are being introduced back into the pipeline, especially when combining tasks that rely on unique DataPoint identities.
✅ Verification successful
Let me gather more information about the data point types and their relationships to better understand the potential for duplicates.
Let me check how the source code chunks are generated and if there's any deduplication mechanism in place.
Based on the gathered information, I can now provide a final response:
The yielding order is correct and duplicates are prevented by design.
The implementation is safe because:
- Each DataPoint has a unique UUID identifier generated at creation (via
uuid4
) - The hierarchical yielding (CodeFile -> CodePart -> SourceCodeChunk) follows the natural containment relationship
- Each level represents different entity types (CodeFile, CodePart, SourceCodeChunk) that are processed differently in the pipeline
The sequential yielding is intentional to allow downstream tasks to process each level of abstraction while maintaining the relationship between parent and child entities.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Confirm that no DataPoint is processed multiple times, leading to duplicates.
# Test: Look for yield statements in get_source_code_chunks that might reintroduce duplicates.
rg -A 2 'yield data_point'
rg -A 2 'yield code_part'
rg -A 2 'yield source_code_chunk'
Length of output: 1254
Script:
#!/bin/bash
# Check the DataPoint class hierarchy and relationships
ast-grep --pattern 'class DataPoint'
ast-grep --pattern 'class CodeFile'
ast-grep --pattern 'class CodePart'
ast-grep --pattern 'class SourceCodeChunk'
# Look for any deduplication or unique identifier logic
rg -l 'def.*id' cognee/tasks/repo_processor/
rg -A 5 'class.*DataPoint'
Length of output: 18863
Script:
#!/bin/bash
# Check the implementation of get_source_code_chunks_from_code_part
rg -A 10 'def get_source_code_chunks_from_code_part'
# Check if there's any unique identifier or deduplication logic in DataPoint base class
ast-grep --pattern 'class DataPoint'
rg -A 10 'class DataPoint'
Length of output: 2196
cognee/tasks/repo_processor/get_repo_file_dependencies.py (2)
Line range hint 97-110
: Confirm that depends_on
references for CodeFile objects are accurate.
If some of these dependencies cannot be resolved (not found in py_files_dict
), you might end up with partial or inconsistent references. That could break data relationships in subsequent tasks.
74-74
: Yielding [repo] changes the downstream format.
Previously, a single DataPoint might have been yielded. Now a list containing the repo is yielded. Ensure that all downstream tasks that expect a single Repository
object can handle a list instead.
cognee/api/v1/cognify/code_graph_pipeline.py (1)
10-12
: Ensure the newly imported modules are fully utilized.
You have introduced imports for SourceCodeGraph and SummarizedContent. Check if the references to these objects in the pipeline are correct and valid, or if any are unused.
b5847c5
to
1c5ca84
Compare
# Conflicts: # cognee/api/v1/cognify/code_graph_pipeline.py
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | b524e94 | notebooks/hr_demo.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
35-41
:⚠️ Potential issuePotential risk of deep recursion in _get_subchunk_token_counts.
Due to nested structures or large modules, this function might recurse many levels deep. Although max_depth is set to 100, extremely pathological code might still cause performance or memory issues. Consider an iterative approach as recommended in past reviews for safer handling of large or heavily-nested code.
🧹 Nitpick comments (5)
cognee/tasks/repo_processor/get_source_code_chunks.py (3)
11-12
: Consider caching or single-pass token counting instead of repeated encoding.The function _count_tokens calls tokenizer.encode on each source code snippet. If used frequently in tight loops, encoding overhead may become a bottleneck. Consider caching or carefully scheduling these calls if performance becomes an issue.
83-108
: _checking overlap ratio and chunk boundaries.In _get_chunk_source_code, the logic for overlap-based trimming is correct, but the partial chunk boundary might lead to frequent context break if overlap is large. Monitor whether this approach causes any confusion in subsequent processing or embedding steps.
122-135
: Efficient repeated chunk generation.Where the loop repeatedly calls _get_chunk_source_code, watch out for performance overhead. In large code files, this could loop many times. If performance is acceptable for your use case, ignore. Otherwise, consider more direct or streamed chunk generation.
cognee/shared/CodeGraphEntities.py (1)
28-29
: Optional source_code recommended for partial code parts?Changing source_code to Optional helps avoid frequent null checks. However, ensure that the rest of the code generation pipeline can handle missing code gracefully.
cognee/tasks/repo_processor/get_repo_file_dependencies.py (1)
Line range hint
93-111
: Consider optimizing dependency resolution.The current implementation loads source code for all dependencies upfront, which might not be necessary if the dependent files are processed later in the pipeline.
Consider lazy loading of dependency source code:
depends_on=[ CodeFile( id=uuid5(NAMESPACE_OID, dependency), extracted_id=dependency, part_of=repo, - source_code=py_files_dict.get(dependency, {}).get("source_code"), + source_code=None, # Lazy load when needed ) for dependency in dependencies ] if dependencies else None,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cognee/api/v1/cognify/code_graph_pipeline.py
(4 hunks)cognee/shared/CodeGraphEntities.py
(2 hunks)cognee/shared/data_models.py
(0 hunks)cognee/tasks/repo_processor/get_repo_file_dependencies.py
(3 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(1 hunks)cognee/tasks/summarization/models.py
(2 hunks)cognee/tasks/summarization/summarize_code.py
(3 hunks)
💤 Files with no reviewable changes (1)
- cognee/shared/data_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/api/v1/cognify/code_graph_pipeline.py
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (15)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)
56-59
: Handle potential single-child node carefully.
When a module has only one real child, you reassign module = module.children[0]. If that child is also minimal or invalid, we risk reassigning to a node that’s not guaranteed to be parseable. Ensure we don’t inadvertently skip children or cause infinite loops if child is also near-empty or missing.
72-74
: Special handling for string nodes is correct but be mindful of edges.
By calling _get_naive_subchunk_token_counts for string nodes, you avoid further parsing. This is logical for large string literals. However, watch out for multi-line strings or docstrings that may contain code-like content. Users might store code blocks in docstrings.
137-149
: Asynchronous generator pipeline clarity.
In get_source_code_chunks, you yield the original data_point, then code_part, then each chunk. This is logical but ensure the order of yields is correct for downstream consumers. Also, confirm that none of them assume a single type per iteration.
15-33
: Warn about naive splitting approach for large source code.
The _get_naive_subchunk_token_counts function can generate many subchunks for very large files, potentially leading to high memory usage. Additionally, if logic upstream feeds extremely large source code strings, consider streaming or iterative chunking to avoid ballooning memory usage.
cognee/tasks/summarization/models.py (2)
1-1
: Import seems fine. No issues.
The addition of Union from typing is consistent with usage below.
22-22
: Flexible summarization with union of types.
Allowing summarizes to handle CodeFile, CodePart, or SourceCodeChunk is a good approach. Ensure that downstream logic checks the actual type to handle each variant correctly.
cognee/tasks/summarization/summarize_code.py (2)
24-24
: Renaming extracted_id to id ensures consistent referencing.
This dictionary key change is straightforward and beneficial if the new code references node.id.
38-39
: Renamed attribute for clarity.
Swapping made_from=node for summarizes=node aligns with the new attribute in CodeSummary. This fosters better readability and consistency.
cognee/shared/CodeGraphEntities.py (5)
9-9
: Index fields are empty.
Setting index_fields to an empty list on Repository means less overhead but also fewer search capabilities. Verify that it’s deliberate if you need to search on path or other fields.
22-22
: Index fields removed from CodeFile.
No immediate issues, but if the removal of index_fields was not intentional, it may affect searching or referencing CodeFile by those fields.
31-31
: Index fields removed from CodePart.
Similar to CodeFile, confirm if removal is intended and that searching by code content is not needed.
35-43
: Introduction of SourceCodeChunk class.
Bringing in code_chunk_of, source_code, and previous_chunk is crucial for chunk chaining. Good job. But be mindful that previous_chunk references can form cyclical references if used incorrectly.
48-48
: Model rebuild call.
Ensuring that SourceCodeChunk.model_rebuild() is invoked helps keep Pydantic model definitions updated. Confirm that these model rebuild calls are necessary in your environment.
cognee/tasks/repo_processor/get_repo_file_dependencies.py (2)
Line range hint 93-111
: Memory optimization needed for large repositories.
Accumulating all CodeFile instances in memory before yielding could cause issues with large repositories. Consider yielding in batches instead.
Here's a suggested implementation using batching:
- code_files = []
+ batch_size = 100 # Adjust based on typical file sizes
+ current_batch = []
for (file_path, metadata), dependencies in zip(py_files_dict.items(), results):
source_code = metadata.get("source_code")
- code_files.append(CodeFile(
+ current_batch.append(CodeFile(
id=uuid5(NAMESPACE_OID, file_path),
source_code=source_code,
extracted_id=file_path,
part_of=repo,
depends_on=[
CodeFile(
id=uuid5(NAMESPACE_OID, dependency),
extracted_id=dependency,
part_of=repo,
source_code=py_files_dict.get(dependency, {}).get("source_code"),
) for dependency in dependencies
] if dependencies else None,
))
-
- yield code_files
+ if len(current_batch) >= batch_size:
+ yield current_batch
+ current_batch = []
+ if current_batch: # Yield any remaining files
+ yield current_batch
74-74
: Verify consumers of this generator function.
The change to yield [repo]
instead of just repo
standardizes the output format but might impact existing consumers of this generator.
✅ Verification successful
Let me analyze the findings and generate additional verification scripts to understand the impact better.
This will help us understand how the tasks process the yielded data and verify if they can handle the list format.
Let me analyze the Task implementation one more time to be completely certain.
Based on the gathered evidence, I can now provide a final response:
The change to yield [repo]
is safe and aligns with the pipeline's design
The verification shows that:
- All downstream tasks (
enrich_dependency_graph
,expand_dependency_graph
) expect and process a list of data points - The pipeline tasks are designed to handle lists as shown by their function signatures:
async def enrich_dependency_graph(data_points: list[DataPoint])
- The change standardizes the output format, making both yields return lists consistently
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for direct usage of get_repo_file_dependencies
rg -l "get_repo_file_dependencies" | while read -r file; do
echo "=== $file ==="
rg -A 5 "get_repo_file_dependencies" "$file"
done
Length of output: 4807
Script:
#!/bin/bash
# Check the implementation of tasks that consume the generator output
echo "=== Pipeline Task Implementation ==="
ast-grep --pattern 'class Task {
$$$
async def process($$$) {
$$$
}
$$$
}'
echo "=== Checking enrich_dependency_graph implementation ==="
rg -A 10 "async def enrich_dependency_graph"
echo "=== Checking expand_dependency_graph implementation ==="
rg -A 10 "def expand_dependency_graph"
Length of output: 2457
Script:
#!/bin/bash
# Check the Task class implementation
echo "=== Task Class Implementation ==="
rg -A 15 "class Task"
# Check the pipeline execution
echo "=== Pipeline Execution ==="
rg -A 10 "async def execute"
Length of output: 4686
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
examples/python/code_graph_example.py (1)
14-14
: Make the boolean parsing more robust
This lambda-based approach only checks for "true" or "1". Consider also supporting "false" or "0" for a symmetric user experience — or at least documenting that any other string value will default to False.-parser.add_argument("--include_docs", type=lambda x: x.lower() in ("true", "1"), default=True, help="Whether or not to process non-code files") +def str_to_bool(value: str) -> bool: + return value.lower() in ("true", "1") +parser.add_argument( + "--include_docs", + type=str_to_bool, + default=True, + help="Whether or not to process non-code files" +)cognee/tasks/storage/index_data_points.py (1)
36-39
: Consider adding more robust error handling
Simply printing the error notice might not be sufficient in production. Consider logging it at an appropriate level and possibly re-raising or implementing retry logic for recoverable errors.except (OpenAIError, BadRequestError) as e: - print(f"Failed to index data points for {index_name}.{field_name}: {e}") + logger.error(f"Failed to index data points for {index_name}.{field_name}: {e}") + # Potentially add retry logic or re-raisecognee/api/v1/cognify/code_graph_pipeline.py (3)
61-61
: Reevaluate batch processing removal
Previously, enrich_dependency_graph had a batch size. Removing it can lead to memory overhead if the dataset is large. Evaluate whether you need a batch size to prevent performance bottlenecks.
63-63
: Handling embedding model
Passing the embedding model is a nice improvement. If the model can change, consider injecting it dynamically to avoid coupling tasks to a single model.
70-70
: Adjust tasks for doc exclusion
Switching to get_non_py_files is valid if your doc files are outside Python. Ensure that you still handle other file types that need processing (like Markdown, text, etc.).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/api/v1/cognify/code_graph_pipeline.py
(5 hunks)cognee/tasks/storage/index_data_points.py
(2 hunks)examples/python/code_graph_example.py
(1 hunks)
🔇 Additional comments (6)
cognee/tasks/storage/index_data_points.py (1)
1-2
: Imports are consistent with usage
Using these exceptions from the litellm library is appropriate if the engine can raise them. Make sure they align with the actual exceptions thrown in your environment.
✅ Verification successful
Let me search for the usage of these exceptions in the codebase to verify their alignment with the actual exceptions being thrown.
Imports are properly used and aligned with error handling
The verification shows that both BadRequestError
and OpenAIError
are correctly used in the error handling of the index_data_points
function. These exceptions are caught specifically when calling vector_engine.index_data_points
, which is appropriate as the codebase uses LiteLLM for embeddings (as seen in LiteLLMEmbeddingEngine.py). The error handling pattern is consistent with other LiteLLM exception handling in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Search for usage of BadRequestError and OpenAIError
rg "BadRequestError|OpenAIError" -A 3
Length of output: 800
Script:
#!/bin/bash
# Look for exception handling patterns in the index_data_points.py file
cat cognee/tasks/storage/index_data_points.py
Length of output: 4031
Script:
#!/bin/bash
# Search for any litellm related error handling
rg "except.*litellm" -A 3
Length of output: 1004
cognee/api/v1/cognify/code_graph_pipeline.py (5)
82-82
: Conditional docs processing looks good
Skipping the document-related tasks when include_docs is False helps performance. Confirm that user expectations are met if doc tasks are always optional.
20-20
: Check new naming convention
Renaming “get_non_code_files” to “get_non_py_files” makes sense if you only want to exclude Python files. Confirm that this does not omit other file types that might be critical.
✅ Verification successful
Let me gather information about the file naming conventions and related functions in the codebase.
Let me check the implementation of the function to understand what files it actually processes.
Function name accurately reflects its purpose
The function get_non_py_files
is correctly named as it specifically filters out Python files (.py
) while also excluding common build artifacts and system files through IGNORED_PATTERNS
(.git
, __pycache__
, .pyc
, .pyo
, .pyd
, node_modules
, .egg-info
). The name change from get_non_code_files
to get_non_py_files
is more precise and better reflects its actual implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Search for the function definition and its usage
ast-grep --pattern 'def get_non_py_files'
# Search for any references to this function
rg "get_non_py_files"
# Search for the previous function name to see the change
rg "get_non_code_files"
# Look for file extension patterns or filtering logic
rg -A 5 "\..*['\"].*exclude"
Length of output: 654
Script:
#!/bin/bash
# Get the implementation of get_non_py_files
cat cognee/tasks/repo_processor/get_non_code_files.py
Length of output: 1884
6-7
: Validate that the embedding engine is appropriate
Ensure the selected engine works well for your tokenization approach. Larger embedding models might slow down indexing or exceed token limits.
✅ Verification successful
Let me gather information about the embedding engine implementation and its usage.
Let me try different commands to gather information about the embedding engine.
Based on the gathered information, I can now provide a final response:
Embedding configuration is well-defined with appropriate defaults
The system uses LiteLLM with text-embedding-3-large (3072 dimensions) by default, which is a good choice for code understanding. The implementation includes proper configuration options for customization through environment variables, including model selection, dimensions, and API settings. The embedding engine is consistently used across vector stores and graph operations.
Key configurations found:
- Default model: text-embedding-3-large
- Embedding dimensions: 3072
- Configurable via environment variables: endpoint, API key, and API version
- Used by both vector databases and graph operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Check the embedding engine implementation
ast-grep --pattern 'def get_embedding_engine'
# Look for configuration or settings related to the embedding engine
rg -l "embedding" --type python
# Check for any token limit or model size configurations
rg -i "token.*limit|model.*size" --type python
Length of output: 190
Script:
#!/bin/bash
# Find Python files containing embedding-related code
fd -e py | xargs rg "embedding" -l
# Look for the embedding engine implementation
fd -e py | xargs rg "get_embedding_engine" -A 5
# Search for configuration files that might contain embedding settings
fd "config|settings" -e py -e yaml -e json | xargs rg -i "embedding|token|model"
Length of output: 10695
54-55
: Ensure embedding engine meets chunk constraints
This code retrieves the default embedding engine, but confirm it respects any size or token constraints used by get_source_code_chunks.
✅ Verification successful
Let me gather information about the embedding engine and chunk constraints by examining the codebase.
Let me search for more specific information about the embedding engine and its relationship with token constraints.
Based on the gathered information, I can now generate the final response:
Embedding engine and chunk constraints are properly aligned
The embedding engine (text-embedding-3-large) and source code chunking are well-coordinated. The get_source_code_chunks
function explicitly handles token limits (max 8192 tokens) and uses the same model's tokenizer (tiktoken.encoding_for_model("text-embedding-3-large")
), ensuring chunks are compatible with the embedding engine's constraints.
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Find the implementation of get_embedding_engine and get_source_code_chunks
ast-grep --pattern 'def get_embedding_engine'
ast-grep --pattern 'def get_source_code_chunks'
# Search for any configuration or constants related to chunk constraints
rg -i "chunk.*size|token.*limit|embedding.*limit" -A 3
Length of output: 26793
Script:
#!/bin/bash
# Find the implementation of get_embedding_engine and related token constraints
ast-grep --pattern 'def get_embedding_engine'
rg -A 5 "get_embedding_engine"
# Look for token limit configurations in the embedding engine
rg -i "max.*tokens|token.*limit" -A 3
Length of output: 11300
22-23
: Confirm get_source_code_chunks usage
Ensure that the chunking logic in “get_source_code_chunks” matches downstream tasks’ expectations. Changes in chunk size or metadata might impact summarization steps later.
✅ Verification successful
Let me gather information about the usage of get_source_code_chunks
and its relationship with downstream tasks.
Let me search for the summarization task implementation to understand how it uses the chunks.
Let me check the summarization implementation to understand how it processes the chunks.
Based on the gathered information, I can now provide a final response:
Chunking configuration is safely handled by default parameters
The get_source_code_chunks
function uses well-defined default parameters (max_tokens=8192, overlap=0.25, granularity=0.1) that are compatible with the summarization pipeline. The summarization task (summarize_code
) processes these chunks without any specific size requirements, as it extracts summaries from the source code content directly. The chunking logic maintains context through proper overlap handling and the chunks are properly linked through the previous_chunk
reference.
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Find the implementation of get_source_code_chunks
ast-grep --pattern 'def get_source_code_chunks($$$)'
# Search for imports and usage of get_source_code_chunks
rg "get_source_code_chunks" -A 3
Length of output: 2039
Script:
#!/bin/bash
# Find the implementation of summarize_code
ast-grep --pattern 'def summarize_code($$$)'
# Look for SourceCodeChunk usage in summarization
rg "SourceCodeChunk" -A 3
Length of output: 2654
Script:
#!/bin/bash
# Find the implementation of summarize_code and its usage of chunks
rg -A 10 "def summarize_code"
# Check if there are any specific chunk size or metadata requirements in summarization
rg -A 5 "max_tokens|granularity|overlap"
Length of output: 24085
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
38-84
: 🛠️ Refactor suggestionConsider replacing recursion with iteration for better stability.
While the max_depth parameter helps, an iterative approach using a stack would be more robust for handling large or deeply nested code files.
🧹 Nitpick comments (4)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)
14-16
: Add input validation for source_code parameter.The function should handle edge cases where source_code might be None or empty.
def _count_tokens(tokenizer: tiktoken.Encoding, source_code: str) -> int: + if not source_code: + return 0 return len(tokenizer.encode(source_code))
18-36
: Optimize string concatenation in token decoding loop.Consider using list comprehension and join for better performance when concatenating decoded tokens.
- subchunk = ''.join( - tokenizer.decode_single_token_bytes(token_id).decode('utf-8', errors='replace') - for token_id in subchunk_token_ids - ) + subchunk = ''.join([ + tokenizer.decode_single_token_bytes(token_id).decode('utf-8', errors='replace') + for token_id in subchunk_token_ids + ])
86-111
: Document overlap behavior and its implications.The overlap calculation could result in 100% of lines (except beginning and end) being part of two chunks when overlap is set to 0.5. This should be documented for clarity.
- """Generates a chunk of source code from tokenized subchunks with overlap handling.""" + """Generates a chunk of source code from tokenized subchunks with overlap handling. + + Note: With an overlap of 0.5, most lines will appear in two chunks, as the actual + overlap ratio is overlap/(1-overlap). This is intentional to ensure context + continuity between chunks. + """
113-142
: Add parameter validation for overlap and granularity.These parameters should be validated to ensure they are within reasonable bounds.
def get_source_code_chunks_from_code_part( code_file_part: CodePart, max_tokens: int = 8192, overlap: float = 0.25, granularity: float = 0.1, model_name: str = "text-embedding-3-large" ) -> Generator[SourceCodeChunk, None, None]: + if not 0 <= overlap < 1: + raise ValueError("Overlap must be between 0 and 1") + if not 0 < granularity <= 1: + raise ValueError("Granularity must be between 0 and 1")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/repo_processor/get_source_code_chunks.py
(1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
1-11
: LGTM! Well-organized imports and proper logger setup.
The imports are cleanly organized and the task-specific logger follows best practices.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
38-44
:⚠️ Potential issueReplace recursive implementation with iterative approach.
Based on past learnings, this recursive implementation can cause stack overflow with deeply nested code.
The previous suggestion to use an iterative approach with a stack should be implemented to prevent recursion errors. See the implementation provided in the past review comments.
🧹 Nitpick comments (4)
cognee/tasks/repo_processor/get_source_code_chunks.py (4)
14-16
: Add input validation and docstring.The function is concise but could benefit from input validation and documentation.
def _count_tokens(tokenizer: tiktoken.Encoding, source_code: str) -> int: + """Count the number of tokens in the source code using the provided tokenizer. + + Args: + tokenizer: The tiktoken Encoding instance to use for tokenization + source_code: The source code string to tokenize + + Returns: + int: The number of tokens in the source code + """ + if not source_code: + return 0 return len(tokenizer.encode(source_code))
18-20
: Document the magic number and add input validation.The default value of 8000 for
max_subchunk_tokens
should be documented and validated.+# Maximum number of tokens per subchunk, chosen based on typical model context window sizes +MAX_DEFAULT_SUBCHUNK_TOKENS = 8000 + def _get_naive_subchunk_token_counts( - tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = 8000 + tokenizer: tiktoken.Encoding, source_code: str, max_subchunk_tokens: int = MAX_DEFAULT_SUBCHUNK_TOKENS ) -> list[tuple[str, int]]: """Splits source code into subchunks of up to max_subchunk_tokens and counts tokens.""" + if max_subchunk_tokens <= 0: + raise ValueError("max_subchunk_tokens must be positive") + if not source_code: + return []
113-119
: Enhance logging and document parameters.The function would benefit from more detailed logging and parameter documentation.
def get_source_code_chunks_from_code_part( code_file_part: CodePart, max_tokens: int = 8192, overlap: float = 0.25, granularity: float = 0.1, model_name: str = "text-embedding-3-large" ) -> Generator[SourceCodeChunk, None, None]: - """Yields source code chunks from a CodePart object, with configurable token limits and overlap.""" + """Yields source code chunks from a CodePart object, with configurable token limits and overlap. + + Args: + code_file_part: CodePart object containing source code + max_tokens: Maximum tokens per chunk (default: 8192 for GPT-4) + overlap: Overlap ratio between chunks (default: 0.25) + granularity: Ratio of max_tokens to use for subchunks (default: 0.1) + model_name: Name of the embedding model to use for tokenization + + Yields: + SourceCodeChunk objects containing chunks of the source code + """ if not code_file_part.source_code: - logger.error(f"No source code in CodeFile {code_file_part.id}") + logger.error(f"No source code in CodeFile {code_file_part.id} of type {type(code_file_part)}") return
147-147
: Track TODO for embedding model support.The TODO comment about adding support for other embedding models should be tracked.
Would you like me to create a GitHub issue to track the implementation of a max_token mapping for different embedding models?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/tasks/repo_processor/get_source_code_chunks.py
(1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
Learnt from: alekszievr
PR: topoteretes/cognee#383
File: cognee/tasks/repo_processor/get_source_code_chunks.py:15-44
Timestamp: 2024-12-19T14:01:34.118Z
Learning: When large or deeply nested code leads to recursion errors in `_get_subchunk_token_counts`, an iterative approach (using a stack) or a maximum recursion depth is required to handle pathological inputs gracefully.
🔇 Additional comments (1)
cognee/tasks/repo_processor/get_source_code_chunks.py (1)
1-11
: LGTM! Well-structured imports and logger setup.
The imports are well-organized and the logger setup follows best practices with a task-specific namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (4)
13-13
: No issues with this blank line.
It might be just a formatting or readability adjustment.
23-28
: Constructor enhancements are clear and consistent.
The parameter list is more readable. Consider adding type hints (e.g.,api_key: Optional[str] = None
) for the string parameters to fully align with best practices.def __init__( self, - model: Optional[str] = "text-embedding-3-large", - dimensions: Optional[int] = 3072, - api_key: str = None, - endpoint: str = None, - api_version: str = None, + model: Optional[str] = "text-embedding-3-large", + dimensions: Optional[int] = 3072, + api_key: Optional[str] = None, + endpoint: Optional[str] = None, + api_version: Optional[str] = None, ):
48-48
: Superfluous blank line.
If there's no specific purpose for this empty line, consider removing it for consistency.
102-102
: Ensure the logger captures the original error context as well.
Moving the logger statement before raising the exception might help preserve the original traceback. Consider enhancing the logs for better debugging.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/infrastructure/databases/exceptions/embedding_exception.py
(1 hunks)cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
(6 hunks)cognee/tasks/storage/index_data_points.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tasks/storage/index_data_points.py
🔇 Additional comments (7)
cognee/infrastructure/databases/exceptions/embedding_exception.py (1)
1-3
: Class definition looks good.
This custom exception is straightforward, well-documented, and appropriately named for embedding-specific errors.
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (6)
8-8
: Great job introducing the custom exception.
This import centralizes embedding error handling and improves readability and maintainability.
20-20
: Explicitly declaring the mock attribute is a good practice.
Declaring the mock attribute here promotes clarity as to the data members of the class.
38-38
: Smart check to ensure environment variable correctness.
Lowercasing and verifying string-based booleans is a reliable approach.
61-64
: Parameter alignment looks good.
This improves the readability of the call to the async embedding API.
100-101
: Raising EmbeddingException clarifies embedding errors.
Catching related library-specific exceptions and raising a custom exception helps maintain consistent error handling across the application.
76-76
: Splitting the text array is correct, but watch out for large input edge cases.
Be sure to confirm the maximum text length that can be handled here, and perhaps consider more than just splitting into two halves if extremely large.
* feat: Add error handling in case user is already part of database and permission already given to group Added error handling in case permission is already given to group and user is already part of group Feature COG-656 * feat: Add user verification for accessing data Verify user has access to data before returning it Feature COG-656 * feat: Add compute search to cognee Add compute search to cognee which makes searches human readable Feature COG-656 * feat: Add simple instruction for system prompt Add simple instruction for system prompt Feature COG-656 * pass pydantic model tocognify * feat: Add unauth access error to getting data Raise unauth access error when trying to read data without access Feature COG-656 * refactor: Rename query compute to query completion Rename searching type from compute to completion Refactor COG-656 * chore: Update typo in code Update typo in string in code Chore COG-656 * Add mcp to cognee * Add simple README * Update cognee-mcp/mcpcognee/__main__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Create dockerhub.yml * Update get_cognify_router.py * fix: Resolve reflection issue when running cognee a second time after pruning data When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted Fix * fix: Add metadata reflection fix to sqlite as well Added fix when reflecting metadata to sqlite as well Fix * update * Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add type to DataPoint metadata * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * Fixes * Fixes to our demo * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Implement PR review * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 --------- Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Henry Mao <[email protected]>
* Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix diagram * Fix instructions * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Implement PR review * Comment out profiling * Comment out profiling * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Fix visualization * Fix visualization * Fix visualization * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Fix visualization * Fix visualization * Fix visualization * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Fix visualization * Fix poetry issues * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * test: Test ubuntu 24.04 * test: change all actions to ubuntu-latest * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * docs: Add LlamaIndex Cognee integration notebook Added LlamaIndex Cognee integration notebook * test: Add github action for testing llama index cognee integration notebook * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 * fix: update dependencies of the mcp server * Update README.md * Fix: Fixes logging setup * feat: deletes on the fly embeddings as uses edge collections * fix: Change nbformat on llama index integration notebook * fix: Resolve api key issue with llama index integration notebook * fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault * version: Increase version to 0.1.22 --------- Co-authored-by: vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: Henry Mao <[email protected]>
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor
Style