Code graph pipeline improvements and fixes #414

alekszievr · 2025-01-07T12:01:05Z

Improvements and fixes in the code graph pipeline:

specify which extension types should be processed by cognify next to the code graph pipeline
adhere UnstructuredDocument.read() to Document
time code graph run and add mock support
fix ContextWindowExceededError

Summary by CodeRabbit

Release Notes

New Features
- Added optional max_tokens parameter across multiple document processing methods to control token limits during chunking
- Enhanced token management in text chunking and document processing workflows
Improvements
- Simplified embedding model handling in code graph and source code processing
- Expanded file processing capabilities for non-code files
- Improved argument parsing in code graph example script
Configuration
- Added max_tokens configuration option to support token-based document processing limits

…nking (#413)

coderabbitai · 2025-01-07T12:01:13Z

Walkthrough

The changes introduce a comprehensive enhancement to token management across multiple modules in the Cognee project. The primary modification involves adding an optional max_tokens parameter to various document processing methods, chunking functions, and configuration settings. This allows for more granular control over token limits during document processing, chunking, and embedding. The changes span across document types, chunking mechanisms, and configuration management, providing a unified approach to token-based processing.

Changes

File	Change Summary
`cognee/api/v1/cognify/code_graph_pipeline.py`	Removed `embedding_engine` instantiation, simplified `Task` signatures
`cognee/modules/chunking/TextChunker.py`	Added token count tracking, new validation method for chunk processing
`cognee/modules/data/processing/document_types/*`	Updated `read` methods across document types to include `max_tokens` parameter
`cognee/tasks/chunks/chunk_by_paragraph.py`	Added token count logic, modified chunk generation
`cognee/tasks/documents/extract_chunks_from_documents.py`	Added `max_tokens` parameter to document chunk extraction
`cognee/modules/cognify/config.py`	Added `max_tokens` configuration option
`cognee/tasks/repo_processor/get_source_code_chunks.py`	Removed explicit embedding model parameters

Poem

🐰 Tokens dance, limits unfurl
In Cognee's code, a new world swirls
Chunks measured, precise and neat
With max_tokens, our parsing's sweet
A rabbit's logic, crisp and clear! 🔍

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🔭 Outside diff range comments (5)

cognee/modules/data/processing/document_types/PdfDocument.py (1)
Fix inconsistent spacing around = operators in parameter assignments

In chunker_func call, maintain consistent spacing around = operators:

chunk_size = chunk_size, get_text = get_text → Remove spaces

Or embedding_model=embedding_model, max_tokens=max_tokens → Add spaces

🔗 Analysis chain

Line range hint 1-21: Fix formatting issues according to Ruff standards

The file requires formatting according to the pipeline failures. Run the Ruff formatter to fix spacing around operators and other style issues.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check specific formatting issues in the file
rg -A 1 "chunk_size = " "cognee/modules/data/processing/document_types/PdfDocument.py"
Length of output: 226
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/chunking/TextChunker.py (2)
Line range hint 71-81: Improve exception handling

The bare except and print statements should be replaced with proper error handling.
-                        except Exception as e:
-                            print(e)
+                        except Exception as e:
+                            logger.error(f"Failed to create DocumentChunk: {e}")
+                            raise
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

Fix formatting issues in TextChunker.py

The verification confirms formatting issues that need to be addressed:

Several lines exceed the 88-character limit (e.g., __init__ method signature)

Inconsistent spacing around operators (e.g., id = uuid5, self.chunk_size += chunk_data["word_count"])

Inconsistent spacing after commas in parameter lists

🔗 Analysis chain

Line range hint 1-100: Fix formatting issues

The file requires formatting according to Ruff standards.

Run the following command to fix formatting:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Fix formatting issues using Ruff
ruff format cognee/modules/chunking/TextChunker.py
Length of output: 147

Script:
#!/bin/bash
# Check maximum line length and show lines exceeding 88 characters
awk 'length($0) > 88 { print length($0) ": " $0 }' cognee/modules/chunking/TextChunker.py

# Check for inconsistent spacing around operators and in parameter lists
grep -n "=[^ ]\|[^ ]=" cognee/modules/chunking/TextChunker.py
grep -n ", [^ ]" cognee/modules/chunking/TextChunker.py

# Check indentation consistency
grep -n "^[ ]\{1,\}[^ ]" cognee/modules/chunking/TextChunker.py
Length of output: 6870
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
Fix formatting issues in method signature and TextChunker instantiation
Line 13: Break down the long method signature into multiple lines

Line 32: Remove extra spaces around = in keyword arguments and consider line wrapping
Example fix for line 32:
chunker = TextChunker(
    self,
    chunk_size=chunk_size,
    get_text=get_text,
    embedding_model=embedding_model,
    max_tokens=max_tokens
)
🔗 Analysis chain

Line range hint 1-35: Fix formatting issues

The file requires formatting according to Ruff standards.

Run the following command to fix formatting:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Fix formatting issues using Ruff
ruff format cognee/modules/data/processing/document_types/UnstructuredDocument.py
Length of output: 178

Script:
#!/bin/bash
# Check for inconsistent indentation
ast-grep --pattern 'class $_($$$) {
  $$$
}'

# Check import organization
cat cognee/modules/data/processing/document_types/UnstructuredDocument.py | grep -n "^import\|^from"

# Check line lengths
awk 'length > 100 { print NR, length, $0 }' cognee/modules/data/processing/document_types/UnstructuredDocument.py
Length of output: 760
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/repo_processor/get_non_code_files.py (1)
Fix formatting issues: trailing whitespaces and long lines

The following formatting issues need to be addressed:

Remove trailing whitespaces from lines 29-43 and 52

Wrap lines that exceed 79 characters (lines 48, 53, and 60)

🔗 Analysis chain

Line range hint 1-70: Fix formatting issues

The file requires formatting according to Ruff standards.

Run the following command to fix formatting:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Fix formatting issues using Ruff
ruff format cognee/tasks/repo_processor/get_non_code_files.py
Length of output: 158

Script:
#!/bin/bash
# Check for basic formatting issues
echo "=== Checking line lengths ==="
awk 'length > 79 { print NR ":", $0 }' cognee/tasks/repo_processor/get_non_code_files.py

echo -e "\n=== Checking for trailing whitespaces ==="
grep -n '[[:space:]]$' cognee/tasks/repo_processor/get_non_code_files.py

echo -e "\n=== Checking for tabs ==="
grep -n $'\t' cognee/tasks/repo_processor/get_non_code_files.py

echo -e "\n=== Checking indentation consistency ==="
sed -n '/^[[:space:]]\+[^[:space:]]/p' cognee/tasks/repo_processor/get_non_code_files.py | uniq -w4 | wc -l
Length of output: 1944
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

🧹 Nitpick comments (10)

cognee/tasks/chunks/chunk_by_paragraph.py (1)
29-34: Move tokenizer initialization outside the loop for efficiency

The tokenizer is initialized within the loop, which may lead to unnecessary re-initializations and impact performance. Consider initializing the tokenizer once before the loop if embedding_model is provided.

Apply this diff to move the tokenizer initialization:
+    tokenizer = None
     for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length):
         # Check if this sentence would exceed length limit
-        if embedding_model:
-            tokenizer = tiktoken.encoding_for_model(embedding_model)
+        if embedding_model and tokenizer is None:
+            tokenizer = tiktoken.encoding_for_model(embedding_model)
         if embedding_model:
             token_count = len(tokenizer.encode(sentence))
         else:
             token_count = 0
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/documents/extract_chunks_from_documents.py (2)
1-2: Import Optional from the correct module

The Optional type is imported but not used correctly. Ensure that Optional is imported from typing, and all type annotations use the correct types.

Apply this diff to adjust the import:
-from typing import Optional
+from typing import Optional, List
And update the type hint in line 7:
 async def extract_chunks_from_documents(
-        documents: list[Document],
+        documents: List[Document],
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

6-11: Adjust default values and type hints for new parameters

The new parameters embedding_model and max_tokens should have consistent type hints and default values matching those in other files.

Apply this diff to adjust the type hints and default values:
 async def extract_chunks_from_documents(
     documents: List[Document],
     chunk_size: int = 1024,
     chunker='text_chunker',
-    embedding_model: Optional[str] = None,
-    max_tokens: Optional[int] = None,
+    embedding_model: Optional[str] = None,
+    max_tokens: Optional[Union[int, float]] = None,
 ):
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/TextDocument.py (1)
Line range hint 12-16: Consider making the buffer size configurable

The fixed buffer size of 1024 bytes might not be optimal for all cases. Consider making it configurable or adjusting based on the file size.
-                text = file.read(1024)
+                text = file.read(self.buffer_size)
Add to class initialization:
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.buffer_size = 4096  # Default buffer size, can be configured
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/ImageDocument.py (1)
22-22: Consider using a regular function instead of lambda

The lambda function could be replaced with a more explicit generator function for better readability and maintainability.
+    def _get_text(self, text):
+        yield text
+
     def read(self, chunk_size: int, chunker: str, embedding_model:Optional[str], max_tokens: Optional[int]):
         text = self.transcribe_image()
         chunker_func = ChunkerConfig.get_chunker(chunker)
-        chunker = chunker_func(self, chunk_size = chunk_size, get_text = lambda: [text], embedding_model=embedding_model, max_tokens=max_tokens)
+        chunker = chunker_func(self, chunk_size=chunk_size, get_text=lambda: self._get_text(text), embedding_model=embedding_model, max_tokens=max_tokens)
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
examples/python/code_graph_example.py (1)
58-65: Use more precise time measurement

For more accurate timing, consider using time.perf_counter() instead of time.time(). Also, consider adding memory usage metrics.
     if args.time:
         import time
-        start_time = time.time()
+        start_time = time.perf_counter()
         asyncio.run(main(args.repo_path, args.include_docs))
-        end_time = time.time()
+        end_time = time.perf_counter()
         print("\n" + "="*50)
         print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds")
+        import psutil
+        process = psutil.Process()
+        print(f"Memory Usage: {process.memory_info().rss / 1024 / 1024:.2f} MB")
         print("="*50 + "\n")
cognee/api/v1/cognify/code_graph_pipeline.py (1)
74-74: Consider making max_tokens configurable

The max_tokens value is hardcoded to 8192. Consider moving this to configuration to make it more maintainable and flexible.
-            Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=8192),
+            Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=cognee_config.max_tokens),
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/chunking/TextChunker.py (1)
17-22: Add type hints for better code clarity

Consider adding type hints for better code maintainability and IDE support.
-    def __init__(self, document, get_text: callable, embedding_model: Optional[str] = None, max_tokens: Optional[int] = None, chunk_size: int = 1024):
+    def __init__(self, document: Document, get_text: Callable[[], Iterator[str]], embedding_model: Optional[str] = None, max_tokens: Optional[int] = None, chunk_size: int = 1024) -> None:
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
Line range hint 13-31: Improve file reading efficiency

The current implementation reads the file in fixed 1024-byte chunks, which might split words. Consider using a line-based approach for text files.
             while True:
-                text = in_memory_file.read(1024)
+                text = in_memory_file.readline()
 
                 if len(text.strip()) == 0:
                     break
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/tasks/repo_processor/get_non_code_files.py (1)
28-44: Remove duplicate extensions and improve organization

There are several duplicate extensions in the ALLOWED_EXTENSIONS set. Also, consider organizing extensions by type for better maintainability.
     ALLOWED_EXTENSIONS = {
-        '.txt', '.md', '.csv', '.json', '.xml', '.yaml', '.yml', '.html', 
-        '.css', '.js', '.ts', '.jsx', '.tsx', '.sql', '.log', '.ini', 
-        '.toml', '.properties', '.sh', '.bash', '.dockerfile', '.gitignore', 
-        '.gitattributes', '.makefile', '.pyproject', '.requirements', 
-        '.env', '.pdf', '.doc', '.docx', '.dot', '.dotx', '.rtf', 
-        '.wps', '.wpd', '.odt', '.ott', '.ottx', '.txt', '.wp', 
-        '.sdw', '.sdx', '.docm', '.dotm', 
-        # Additional extensions for other programming languages
-        '.java', '.c', '.cpp', '.h', '.cs', '.go', '.php', '.rb', 
-        '.swift', '.pl', '.lua', '.rs', '.scala', '.kt', '.sh', 
-        '.sql', '.v', '.asm', '.pas', '.d', '.ml', '.clj', '.cljs', 
-        '.erl', '.ex', '.exs', '.f', '.fs', '.r', '.pyi', 
-        '.pdb', '.ipynb', '.rmd', '.cabal', '.hs', '.nim', 
-        '.vhdl', '.verilog', '.svelte', '.html', '.css', '.scss', 
-        '.less', '.json5', '.yaml', '.yml'
+        # Documentation and text files
+        '.txt', '.md', '.log', '.rtf',
+        # Configuration files
+        '.json', '.json5', '.yaml', '.yml', '.toml', '.ini', '.properties', '.env',
+        # Web files
+        '.html', '.css', '.scss', '.less', '.js', '.ts', '.jsx', '.tsx', '.svelte',
+        # Document files
+        '.pdf', '.doc', '.docx', '.dot', '.dotx', '.odt', '.ott', '.ottx', '.wp',
+        '.wps', '.wpd', '.sdw', '.sdx', '.docm', '.dotm',
+        # Programming languages
+        '.java', '.c', '.cpp', '.h', '.cs', '.go', '.php', '.rb', '.swift',
+        '.pl', '.lua', '.rs', '.scala', '.kt', '.sql', '.v', '.asm', '.pas',
+        '.d', '.ml', '.clj', '.cljs', '.erl', '.ex', '.exs', '.f', '.fs',
+        '.r', '.pyi', '.hs', '.nim', '.vhdl', '.verilog',
+        # Notebooks and special files
+        '.ipynb', '.rmd', '.cabal',
+        # Build and system files
+        '.sh', '.bash', '.dockerfile', '.gitignore', '.gitattributes',
+        '.makefile', '.pyproject', '.requirements'
     }
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe672ce and fbf8fc9.

📒 Files selected for processing (11)

cognee/api/v1/cognify/code_graph_pipeline.py (1 hunks)
cognee/modules/chunking/TextChunker.py (2 hunks)
cognee/modules/data/processing/document_types/Document.py (2 hunks)
cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
cognee/modules/data/processing/document_types/UnstructuredDocument.py (2 hunks)
cognee/tasks/chunks/chunk_by_paragraph.py (5 hunks)
cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
examples/python/code_graph_example.py (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: ruff format

cognee/tasks/repo_processor/get_non_code_files.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/api/v1/cognify/code_graph_pipeline.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/ImageDocument.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/Document.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/PdfDocument.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/TextDocument.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/chunking/TextChunker.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/tasks/chunks/chunk_by_paragraph.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/tasks/documents/extract_chunks_from_documents.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/UnstructuredDocument.py

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

🪛 GitHub Actions: test | multimedia example

cognee/tasks/documents/extract_chunks_from_documents.py

[error] 14-14: TypeError: AudioDocument.read() method does not accept 'embedding_model' as a parameter

🪛 GitHub Actions: test | multimedia notebook

cognee/tasks/documents/extract_chunks_from_documents.py

[error] 14-14: TypeError: AudioDocument.read() method does not accept 'embedding_model' as a parameter

🔇 Additional comments (6)

cognee/tasks/chunks/chunk_by_paragraph.py (1)

29-35: 🛠️ Refactor suggestion

Handle cases where embedding_model is None but max_tokens is provided

If max_tokens is provided without an embedding_model, the token count will always be zero, rendering the max_tokens parameter ineffective. Consider raising an error or defaulting to a standard tokenizer if embedding_model is None but max_tokens is specified.

Do you want me to implement a default tokenizer or raise an error when embedding_model is None but max_tokens is provided?

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/modules/data/processing/document_types/PdfDocument.py (1)

12-12: LGTM! Method signature update is consistent with interface changes

The addition of embedding_model and max_tokens parameters aligns with the PR objectives for improving the code graph pipeline. The implementation correctly passes these parameters to the chunker.

Also applies to: 21-21

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
cognee/modules/data/processing/document_types/TextDocument.py (1)

23-23: Fix formatting issues with operator spacing

Remove spaces around = operator in keyword arguments according to Ruff standards.

✅ Verification successful

Confirmed: Remove spaces around = in keyword arguments

The line shows inconsistent formatting where some keyword arguments have spaces around = while others don't. According to Ruff standards, there should be no spaces around = in keyword arguments.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check specific formatting issues in the file
rg -A 1 "chunk_size = " "cognee/modules/data/processing/document_types/TextDocument.py"
Length of output: 227
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.
examples/python/code_graph_example.py (2)

11-43: LGTM! Well-structured argument parsing

The separation of argument parsing into a dedicated function improves code organization and maintainability. The type conversion using lambda functions is a clean approach.

26-30: Review default values for mock flags

Setting mock flags to True by default might not be suitable for production use. Consider setting these to False by default to prevent accidental use of mocks in production.
-        default=True,
+        default=False,
Also applies to: 32-36
cognee/api/v1/cognify/code_graph_pipeline.py (1)

Line range hint 1-85: Fix formatting issues

The file requires formatting according to Ruff standards.

Run the following command to fix formatting:

✅ Verification successful

Fix formatting issues in code_graph_pipeline.py

The review comment is correct. The file has several formatting issues:

Replace backslash line continuations with parentheses (lines 6, 17, 22)

Remove trailing whitespace (line 82)

Fix long lines exceeding line length limit (lines 45, 47, 63, 74, 75)

Organize imports according to PEP 8
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Fix formatting issues using Ruff
ruff format cognee/api/v1/cognify/code_graph_pipeline.py
Length of output: 153

Script:
#!/bin/bash
# Check for Python style configuration files
fd -e toml -e cfg -e ini -e yapf . -x cat {}

# Check import formatting and line lengths
rg -n "^.*\\\\.*$|^.{89,}$|^import\s+|^from\s+" "cognee/api/v1/cognify/code_graph_pipeline.py"

# Look for inconsistent spacing
rg -n "=\s{2,}|\s{2,}=|\s+$" "cognee/api/v1/cognify/code_graph_pipeline.py"
Length of output: 14027
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/tasks/chunks/chunk_by_paragraph.py

coderabbitai · 2025-01-07T12:07:45Z

cognee/modules/data/processing/document_types/Document.py

@@ -13,5 +14,5 @@ class Document(DataPoint):
        "type": "Document"
    }

-    def read(self, chunk_size: int, chunker = str) -> str:
+    def read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str:


⚠️ Potential issue

Update default values for new parameters in read method

The new parameters embedding_model and max_tokens in the read method lack default values, which may cause issues in subclasses that do not override these parameters. Set default values to None to maintain compatibility.

Apply this diff to set default values:

def read( self, chunk_size: int, - embedding_model: Optional[str], - max_tokens: Optional[int], + embedding_model: Optional[str] = None, + max_tokens: Optional[int] = None, chunker=str, ) -> str: pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str:

def read(

self,

chunk_size: int,

embedding_model: Optional[str] = None,

max_tokens: Optional[int] = None,

chunker=str

) -> str:

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

💡 Codebase verification

Update missing read methods in ImageDocument and AudioDocument

The following classes need to implement the read method with the updated signature read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str:

cognee/modules/data/processing/document_types/ImageDocument.py

cognee/modules/data/processing/document_types/AudioDocument.py

🔗 Analysis chain

Ensure all subclasses override the updated read method

The read method signature has changed. Ensure that all subclasses of Document, such as AudioDocument, VideoDocument, etc., override the read method with the new parameters to prevent TypeError.

The pipeline failure indicates that AudioDocument.read() does not accept embedding_model as a parameter. Please update all subclasses accordingly.

Run the following script to identify classes that need updating:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Find all subclasses of Document that need to update the `read` method. # Find files containing classes that inherit from Document but do not have updated `read` methods fd --type f --extension py '.*' | xargs grep -l 'class .*Document(Document):' | xargs grep -L 'def read(self, chunk_size' # Alternatively, use ast-grep to find method definitions ast-grep --pattern 'class $CLASS_NAME(Document): {$$$ def read(self, chunk_size: int, $$$): $$$ }' --ignore 'test_*'

Length of output: 490

Script:

#!/bin/bash # Find Document subclasses echo "=== Document subclasses ===" ast-grep --pattern 'class $NAME(Document)' echo -e "\n=== Read method implementations ===" # Find read method implementations ast-grep --pattern 'def read(self, $$$)' echo -e "\n=== Backup search with ripgrep ===" # Backup search with ripgrep rg "class.*$Document$" -A 5

Length of output: 3351

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues.

cognee/tasks/documents/extract_chunks_from_documents.py

cognee/modules/data/processing/document_types/ImageDocument.py

hajdul88

For now it doesn't work with our Azure LLM settings

hajdul88 · 2025-01-07T13:28:02Z

examples/python/code_graph_example.py

I would use environmental variables for the parameters instead of arguments. Also, overwriting environmental variables is not a good practice and can later cause misunderstanding.

We already use envs, and I think they are more appropriate/easier to handle in a production environment as we don't always have to structure the arguments + we don't mix envs and args.

@borisarzentar What do you think?

hajdul88 · 2025-01-07T13:31:33Z

examples/python/code_graph_example.py

+
+    if args.mock_embedding:
+        os.environ["MOCK_EMBEDDING"] = "true"


This I wouldn't do. It can cause misunderstanding later, we should either use envs or args.

hajdul88 · 2025-01-07T13:31:42Z

examples/python/code_graph_example.py

+
+    if args.mock_code_summary:
+        os.environ["MOCK_CODE_SUMMARY"] = "true"


coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

cognee/tasks/chunks/chunk_by_paragraph.py (1)
Fix trailing whitespaces and long lines

The code has formatting issues that need to be addressed:

Remove trailing whitespaces from function parameters and empty lines

Break down long lines (>100 characters):

The chunk_by_sentence call

The condition check combining word_count and token_count

🔗 Analysis chain

Line range hint 1-96: Fix formatting issues to pass pipeline checks.

The code requires formatting according to Ruff standards. Run ruff format to fix the formatting issues.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify formatting issues
ruff format --diff cognee/tasks/chunks/chunk_by_paragraph.py
Length of output: 157

Script:
#!/bin/bash
# Check for tabs and trailing whitespaces
echo "=== Checking for tabs ==="
grep -P '\t' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No tabs found"

echo -e "\n=== Checking for trailing whitespaces ==="
grep -P ' $' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No trailing whitespaces found"

echo -e "\n=== Checking for lines longer than 100 characters ==="
awk 'length($0) > 100' cognee/tasks/chunks/chunk_by_paragraph.py || echo "No long lines found"

echo -e "\n=== Checking for inconsistent indentation ==="
grep -P '^( *)[^ ]' cognee/tasks/chunks/chunk_by_paragraph.py | grep -Pv '^( {4})*[^ ]' || echo "No inconsistent indentation found"
Length of output: 1260
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

♻️ Duplicate comments (1)

cognee/tasks/chunks/chunk_by_paragraph.py (1)
39-39: ⚠️ Potential issue

Ensure max_tokens is not None before comparison.

This is a duplicate of a previous review comment. The condition needs to be fixed to prevent potential TypeError.
-        if current_word_count > 0 and (current_word_count + word_count > paragraph_length or current_token_count + token_count > max_tokens):
+        if current_word_count > 0 and (
+            current_word_count + word_count > paragraph_length or
+            (max_tokens is not None and current_token_count + token_count > max_tokens)
+        ):
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

🧹 Nitpick comments (3)

cognee/modules/data/processing/document_types/AudioDocument.py (1)
22-22: Improve code readability with proper formatting

The line is quite long and could benefit from better formatting. Consider breaking it into multiple lines for better readability.
-        chunker = chunker_func(self, chunk_size = chunk_size, get_text = lambda: [text], embedding_model=embedding_model, max_tokens=max_tokens)
+        chunker = chunker_func(
+            self,
+            chunk_size=chunk_size,
+            get_text=lambda: [text],
+            embedding_model=embedding_model,
+            max_tokens=max_tokens
+        )
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py (2)
31-37: Consider caching the tokenizer instance.

Creating a new tokenizer for each sentence could impact performance. Consider creating it once at the beginning of the function when embedding_model is provided.
def chunk_by_paragraph(...):
    current_token_count = 0
    if not max_tokens:
        max_tokens = float("inf")
+    tokenizer = None
+    if embedding_model:
+        if embedding_model.startswith("azure/"):
+            embedding_model = embedding_model.split("/")[-1]
+        tokenizer = tiktoken.encoding_for_model(embedding_model)
    
    for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(...):
-        if embedding_model:
-            if embedding_model.startswith("azure/"):
-                embedding_model = embedding_model.split("/")[-1]
-            tokenizer = tiktoken.encoding_for_model(embedding_model)
-            token_count = len(tokenizer.encode(sentence))
-        else:
-            token_count = 0
+        token_count = len(tokenizer.encode(sentence)) if tokenizer else 0
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

32-33: Make Azure model name extraction more robust.

The current Azure model name extraction assumes a simple path format. Consider adding validation or using a more robust parsing method.
-        if embedding_model.startswith("azure/"):
-            embedding_model = embedding_model.split("/")[-1]
+        if embedding_model.startswith("azure/"):
+            parts = embedding_model.split("/")
+            if len(parts) < 2:
+                raise ValueError("Invalid Azure model format. Expected 'azure/model_name'")
+            embedding_model = parts[-1]
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbf8fc9 and fb13a1b.

📒 Files selected for processing (2)

cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
cognee/tasks/chunks/chunk_by_paragraph.py (5 hunks)

🧰 Additional context used

🪛 GitHub Actions: ruff format

cognee/modules/data/processing/document_types/AudioDocument.py

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

cognee/tasks/chunks/chunk_by_paragraph.py

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test

🔇 Additional comments (6)

cognee/modules/data/processing/document_types/AudioDocument.py (3)

1-1: LGTM! Clean import addition

The Optional type import is appropriately added to support the new type hints.

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

Line range hint 1-24: Format code using ruff

The CI pipeline indicates formatting issues. Please run ruff format to fix formatting issues across the codebase.

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

16-16: Verify Document interface implementation across all document types

The read method signature has been updated with new optional parameters. Let's verify this change is consistent across all document types implementing the Document interface.

✅ Verification successful

Interface implementation verified across all document types

All document types (TextDocument, PdfDocument, ImageDocument, AudioDocument) implement the read method with consistent signatures, including the new optional parameters.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for all Document implementations and their read method signatures
ast-grep --pattern 'class $CLASSNAME(Document):
  $$$
  def read($$$):
    $$$'
Length of output: 7077
🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.
cognee/tasks/chunks/chunk_by_paragraph.py (3)

1-15: LGTM! Well-structured function signature with proper type hints.

The updated imports and function signature are well-organized and properly typed. The new parameters embedding_model and max_tokens are correctly marked as optional.

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

25-27: LGTM! Good initialization of token counting.

The initialization of current_token_count and the default handling of max_tokens is well-implemented.

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

Line range hint 41-96: LGTM! Consistent token count tracking in chunk dictionaries.

The token count field is consistently added to all chunk dictionary creations throughout the function.

🧰 Tools

🪛 GitHub Actions: ruff format

[error] 308 files require reformatting using Ruff formatter. Run 'ruff format' to fix formatting issues.

dexters1 · 2025-01-08T11:11:44Z

cognee/api/v1/cognify/code_graph_pipeline.py

@@ -71,7 +71,7 @@ async def run_code_graph_pipeline(repo_path, include_docs=True):
            Task(ingest_data_with_metadata, dataset_name="repo_docs", user=user),
            Task(get_data_list_for_user, dataset_name="repo_docs", user=user),
            Task(classify_documents),
-            Task(extract_chunks_from_documents),
+            Task(extract_chunks_from_documents, embedding_model=embedding_engine.model, max_tokens=8192),


As the embedding_engine shouldn't change in code and should only be changed through environment configuration I think we should remove it as a parameter to functions and instead get it dynamically with the getter when needed

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

examples/python/code_graph_example.py (2)

23-24: Refactor duplicate type conversion logic.

The boolean conversion lambda function is repeated across multiple arguments. Consider extracting it into a helper function for better maintainability.

+def str_to_bool(value: str) -> bool:
+    return value.lower() in ("true", "1")
+
 def parse_args():
     parser = argparse.ArgumentParser()
     # ... other arguments ...
     parser.add_argument(
         "--include_docs",
-        type=lambda x: x.lower() in ("true", "1"),
+        type=str_to_bool,
         default=True,
         help="Whether or not to process non-code files"
     )

Also applies to: 29-30, 35-36, 41-42

62-69: Refactor timing logic into a reusable component.

The timing logic could be encapsulated in a decorator or context manager for better reusability and cleaner code.

from contextlib import contextmanager
import time

@contextmanager
def timer(description: str):
    start_time = time.time()
    yield
    end_time = time.time()
    duration = end_time - start_time
    separator = "=" * 50
    print(f"\n{separator}")
    print(f"{description}: {duration:.2f} seconds")
    print(f"{separator}\n")

# Usage in main:
if args.time:
    with timer("Pipeline Execution Time"):
        asyncio.run(main(args.repo_path, args.include_docs))
else:
    asyncio.run(main(args.repo_path, args.include_docs))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb13a1b and 8ffef50.

📒 Files selected for processing (1)

examples/python/code_graph_example.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: profiler

🔇 Additional comments (4)

examples/python/code_graph_example.py (4)

1-11: LGTM! Clean imports and well-structured main function.

The imports are organized logically, and the main function follows good async/await patterns.

13-45: Consider using environment variables consistently instead of command-line arguments.

Previous reviews have indicated a preference for using environment variables in production environments. Consider refactoring to use environment variables exclusively to avoid mixing configuration methods.

29-31: Review default values for mock flags.

The mock flags default to True, which might not be suitable for production use. Consider defaulting to False to prevent accidental mocking in production environments.

Also applies to: 35-37

54-60: Avoid overwriting environment variables at runtime.

Setting environment variables at runtime based on command-line arguments is not recommended as it:

Can cause side effects in other parts of the application

Makes the configuration flow harder to understand

May lead to race conditions in concurrent environments

Consider reading these values from environment variables at startup instead.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

examples/python/code_graph_example.py (2)
40-42: Move the time import to the top of the file.

Conditional imports can make code harder to maintain and understand. It's better to import all required modules at the top of the file.
 import argparse
 import asyncio
 import logging
+import time
 
 from cognee.api.v1.cognify.code_graph_pipeline import run_code_graph_pipeline
 from cognee.shared.utils import setup_logging
40-47: Consider extracting timing logic into a decorator or utility function.

The timing logic could be made more reusable by extracting it into a decorator or utility function. This would make it easier to time other functions in the future and reduce code duplication.
def time_execution(func):
    async def wrapper(*args, **kwargs):
        start_time = time.time()
        result = await func(*args, **kwargs)
        end_time = time.time()
        print("\n" + "="*50)
        print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds")
        print("="*50 + "\n")
        return result
    return wrapper

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ffef50 and f4397bf.

📒 Files selected for processing (1)

examples/python/code_graph_example.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: profiler

🔇 Additional comments (2)

examples/python/code_graph_example.py (2)

3-11: LGTM! Clean imports and well-structured main function.

The addition of logging import and the async main function implementation look good.

13-33: Consider using environment variables instead of command-line arguments.

Based on the team's established practices and previous discussions, environment variables are preferred over command-line arguments in production environments. This makes configuration more consistent and easier to manage.

Consider refactoring to use environment variables:

-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--repo_path", 
-        type=str, 
-        required=True, 
-        help="Path to the repository"
-    )
-    parser.add_argument(
-        "--include_docs",
-        type=lambda x: x.lower() in ("true", "1"),
-        default=True,
-        help="Whether or not to process non-code files"
-    )
-    parser.add_argument(
-        "--time",
-        type=lambda x: x.lower() in ("true", "1"),
-        default=True,
-        help="Whether or not to time the pipeline run"
-    )
-    return parser.parse_args()
+def get_config():
+    return {
+        'repo_path': os.environ['REPO_PATH'],
+        'include_docs': os.getenv('INCLUDE_DOCS', 'true').lower() in ('true', '1'),
+        'time': os.getenv('TIME_PIPELINE', 'true').lower() in ('true', '1')
+    }

… instead of direct getter.

alekszievr · 2025-01-08T12:47:03Z

@hajdul88 @dexters1
I made the changes you requested, please take a look again 🙏

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

cognee/tasks/repo_processor/get_source_code_chunks.py (2)
130-132: Add error handling for model name extraction.

While centralizing the model configuration is good, the model name extraction could be more robust. The current implementation assumes a specific format and might fail if the model name doesn't contain a "/".

Consider adding error handling:
     vector_engine = get_vector_engine()
     embedding_model = vector_engine.embedding_engine.model
-    model_name = embedding_model.split("/")[-1]
+    try:
+        model_name = embedding_model.split("/")[-1]
+    except (AttributeError, IndexError) as e:
+        logger.error(f"Invalid embedding model format: {embedding_model}")
+        model_name = "text-embedding-3-large"  # fallback to default
157-157: Update the TODO comment to reflect the new design.

The TODO comment about adding support for other embedding models should be updated since the embedding model handling has been centralized through the vector engine.

Consider updating the TODO comment to:
-    # TODO: Add support for other embedding models, with max_token mapping
+    # TODO: Add max_token mapping for different embedding models in vector engine configuration

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97814e3 and 2e37f85.

📒 Files selected for processing (12)

cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
cognee/modules/chunking/TextChunker.py (2 hunks)
cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
cognee/modules/data/processing/document_types/Document.py (2 hunks)
cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
cognee/modules/data/processing/document_types/UnstructuredDocument.py (2 hunks)
cognee/tasks/chunks/chunk_by_paragraph.py (5 hunks)
cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
cognee/tasks/repo_processor/get_source_code_chunks.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

cognee/tasks/documents/extract_chunks_from_documents.py
cognee/modules/data/processing/document_types/ImageDocument.py
cognee/modules/data/processing/document_types/AudioDocument.py
cognee/modules/chunking/TextChunker.py
cognee/modules/data/processing/document_types/PdfDocument.py
cognee/api/v1/cognify/code_graph_pipeline.py
cognee/tasks/chunks/chunk_by_paragraph.py
cognee/modules/data/processing/document_types/Document.py
cognee/tasks/repo_processor/get_non_code_files.py
cognee/modules/data/processing/document_types/TextDocument.py
cognee/modules/data/processing/document_types/UnstructuredDocument.py

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: profiler
GitHub Check: docker-compose-test

🔇 Additional comments (3)

cognee/tasks/repo_processor/get_source_code_chunks.py (3)

8-8: LGTM: Import addition aligns with centralized vector engine approach.

The addition of get_vector_engine import supports the refactoring to use a centralized vector engine configuration.

172-172: LGTM: Function call correctly updated.

The call to get_source_code_chunks_from_code_part has been properly updated to align with the new signature, and appropriate error handling is maintained.

130-132: Verify all callers are updated for the new signatures.

Let's ensure all callers have been updated to work with the new function signatures that no longer accept the embedding model parameter.

Also applies to: 157-157, 172-172

✅ Verification successful

All callers have been properly updated

The changes are consistent across the codebase with no remaining references to the old function signatures. The only model references found are in the embedding infrastructure files where they belong.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for any remaining direct references to embedding models or old function signatures

# Check for any remaining hardcoded embedding model references
rg -l "text-embedding-[0-9]-large" --type py

# Look for potential calls to get_source_code_chunks with embedding_model parameter
ast-grep --pattern 'get_source_code_chunks($$$, embedding_model=$_)'

# Look for potential calls to get_source_code_chunks_from_code_part with model_name parameter
ast-grep --pattern 'get_source_code_chunks_from_code_part($$$, model_name=$_)'
Length of output: 388

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Add test coverage for token-based chunking

The implementation includes token-based chunking features that aren't covered by tests:

No test cases for the max_tokens parameter

No verification of the token_count field in chunk output

Missing tests for interaction between max_tokens and paragraph_length limits

Suggested test additions:

Test chunks respect token limits when max_tokens is set

Verify correct token_count values in chunk output

Test chunking behavior when both token and paragraph length limits are active

🔗 Analysis chain

Line range hint 1-51: Verify ground truth values and test coverage.

Based on the retrieved learning from PR #205, when changes are made to the chunking implementation, the ground truth values in the tests need to be updated accordingly. Additionally, consider adding test cases for:

The new max_tokens parameter mentioned in the AI summary

Edge cases where token limits affect paragraph chunking

Let's verify the test coverage:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check test coverage for chunk_by_paragraph
rg -A 5 "def test.*chunk_by_paragraph" cognee/tests/unit/processing/chunks/
Length of output: 1557

Script:
#!/bin/bash
# Check chunk_by_paragraph implementation and max_tokens parameter
ast-grep --pattern 'def chunk_by_paragraph($$$)'

# Search for max_tokens in test files
rg "max_tokens" cognee/tests/unit/processing/chunks/
Length of output: 103

Script:
#!/bin/bash
# Find implementation files
fd "chunks.py" --type f

# Search for chunk_by_paragraph in all Python files
rg -t py "def chunk_by_paragraph" -A 3

# Look for additional test files
fd "chunk.*test.py" --type f
Length of output: 825

Script:
#!/bin/bash
# Check content of the other test file
cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py

# Get full implementation to understand max_tokens usage
cat cognee/tasks/chunks/chunk_by_paragraph.py
Length of output: 5970

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e37f85 and 6762039.

📒 Files selected for processing (3)

cognee/tasks/chunks/chunk_by_paragraph.py (5 hunks)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2 hunks)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tasks/chunks/chunk_by_paragraph.py

🧰 Additional context used

📓 Learnings (2)

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)

Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)

Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: docker-compose-test

🔇 Additional comments (2)

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)

52-54: LGTM! Improved readability with keyword arguments.

The conversion from positional to keyword arguments enhances code clarity and maintainability.

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)

49-51: LGTM! Consistent with the previous change.

The modification to use keyword arguments is consistent with the previous change and improves code readability.

coderabbitai · 2025-01-09T10:03:17Z

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py

+    for chunk_data in chunk_by_paragraph(
+        data=test_text, paragraph_length=12, batch_paragraphs=False
+    ):


💡 Codebase verification

The max_tokens parameter needs test coverage

The implementation shows that max_tokens is a critical parameter affecting chunk boundaries and output metadata, but it lacks test coverage. Neither test file includes tests for:

Chunks respecting the token limit

Token count accuracy in output

Edge cases with different token limits

Interaction between max_tokens and paragraph_length

🔗 Analysis chain

Verify test coverage for the new max_tokens parameter.

The chunk_by_paragraph function has been updated to include an optional max_tokens parameter, but it's not being tested here.

Let's verify the implementation and test coverage:

Consider adding test cases to verify the behavior with different max_tokens values. I can help generate these test cases if needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check implementation and test coverage of max_tokens parameter # Check the implementation echo "Checking chunk_by_paragraph implementation:" rg -A 5 "def chunk_by_paragraph" cognee/tasks/chunks/ # Check for existing tests of max_tokens echo "Checking for max_tokens tests:" rg -l "max_tokens.*chunk_by_paragraph" cognee/tests/

Length of output: 721

Script:

#!/bin/bash # Get the complete implementation of chunk_by_paragraph echo "Complete chunk_by_paragraph implementation:" cat cognee/tasks/chunks/chunk_by_paragraph.py # Find and show all test files related to chunk_by_paragraph echo -e "\nAll chunk_by_paragraph test files:" fd "test.*py" cognee/tests | xargs rg -l "chunk_by_paragraph" # Show content of the main test file echo -e "\nContent of the main test file:" cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py

Length of output: 6419

Script:

#!/bin/bash # Check the content of the second test file echo "Content of the second test file:" cat cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py

Length of output: 2277

coderabbitai · 2025-01-09T10:03:18Z

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py

+    chunks = list(
+        chunk_by_paragraph(
+            data=input_text, paragraph_length=paragraph_length, batch_paragraphs=batch_paragraphs
+        )
+    )


💡 Codebase verification

Add test coverage for the max_tokens parameter

The implementation in cognee/tasks/chunks/chunk_by_paragraph.py includes an optional max_tokens parameter, but there are no tests covering this functionality. Consider adding test cases to verify the behavior when:

max_tokens is set to a specific value

max_tokens is None (default case)

max_tokens is set to different numeric types (int/float)

🔗 Analysis chain

LGTM! Consider adding test coverage for max_tokens parameter.

The change to use keyword arguments improves code readability. However, based on the AI summary, the chunk_by_paragraph function now supports a max_tokens parameter that isn't covered by these tests.

Let's verify the implementation of max_tokens parameter:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check if max_tokens parameter exists in the implementation ast-grep --pattern 'def chunk_by_paragraph($params) { $$$ }'

Length of output: 64

Script:

#!/bin/bash # Search for chunk_by_paragraph function definition with correct Python syntax ast-grep --pattern 'def chunk_by_paragraph($$$): $$$' # Fallback to ripgrep for function definition rg "def chunk_by_paragraph" -A 5 # Check imports rg "from.*chunk_by_paragraph.*import|import.*chunk_by_paragraph"

Length of output: 1022

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

cognee/tests/integration/documents/UnstructuredDocument_test.py (1)

Line range hint 1-102: Restructure tests for better organization and isolation.

The current test structure combines all document type tests into a single function. Consider:

Using a test class with proper setup/teardown
Separating tests by document type
Adding parameterized tests for different configurations
Implementing proper test isolation

Example restructure:

import pytest
from unittest import TestCase

class TestUnstructuredDocument(TestCase):
    def setUp(self):
        # Setup test files and document instances
        self.setup_test_files()
        
    def setup_test_files(self):
        # Initialize test file paths and document instances
        pass
        
    def test_pptx_document(self):
        # PPTX specific tests
        pass
        
    def test_docx_document(self):
        # DOCX specific tests
        pass
        
    @pytest.mark.parametrize("chunk_size", [512, 1024, 2048])
    def test_csv_document_with_different_chunks(self, chunk_size):
        # CSV tests with different chunk sizes
        pass
        
    def test_xlsx_document(self):
        # Excel specific tests
        pass

🧹 Nitpick comments (3)

cognee/tests/integration/documents/UnstructuredDocument_test.py (3)
71-77: Enhance test coverage for PPTX document processing.

The test could be improved in several ways:

Add test cases for the new max_tokens parameter

Verify the total number of chunks processed

Test edge cases with different chunk sizes

Consider using test data constants instead of hard-coded values

Example improvement:
def test_pptx_document_read():
    # Test with different max_tokens values
    chunks = list(pptx_document.read(chunk_size=1024, chunker="text_chunker", max_tokens=100))
    assert len(chunks) > 0, "Expected at least one chunk"
    
    # Test without max_tokens
    chunks_no_limit = list(pptx_document.read(chunk_size=1024, chunker="text_chunker"))
    assert len(chunks_no_limit) >= len(chunks), "Expected more chunks without token limit"
    
    # Verify first chunk properties
    first_chunk = chunks[0]
    assert first_chunk.word_count == 19
    assert len(first_chunk.text) == 104
    assert first_chunk.cut_type == "sentence_cut"
Line range hint 87-95: Improve CSV test data and coverage.

The current test uses artificial data ("A A A A...") which may not represent real-world CSV scenarios. Consider:

Using more realistic test data with actual column headers and varied content

Verifying CSV-specific properties (column separation, handling of quotes, etc.)

Testing with the new max_tokens parameter

Testing with different chunk sizes to verify consistent behavior

Example improvement:
def test_csv_document_read_with_realistic_data():
    # Test with realistic CSV content
    csv_content = "Name,Age,City\nJohn Doe,30,New York\nJane Smith,25,Los Angeles"
    for chunk in csv_document.read(chunk_size=1024, chunker="text_chunker"):
        assert "," in chunk.text, "Expected CSV delimiter in chunk"
        assert chunk.text.count(",") <= 2, "Expected maximum 2 columns per row"
Line range hint 97-102: Add Excel-specific test scenarios.

The current test treats Excel files similarly to other document types. Consider adding tests for Excel-specific features:

Multiple worksheet handling

Formula handling

Different cell types (text, numbers, dates)

Empty cell handling

Example improvement:
def test_xlsx_document_read_comprehensive():
    # Test with max_tokens
    chunks = list(xlsx_document.read(chunk_size=1024, chunker="text_chunker", max_tokens=100))
    
    # Verify Excel-specific features
    found_sheets = set()
    for chunk in chunks:
        if "Sheet" in chunk.text:  # Assuming sheet names are preserved in text
            found_sheets.add(chunk.text.split("Sheet")[1].split()[0])
    assert len(found_sheets) > 0, "Expected content from at least one sheet"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6762039 and abb3ea6.

📒 Files selected for processing (7)

cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
cognee/modules/data/processing/document_types/Document.py (2 hunks)
cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
cognee/modules/data/processing/document_types/UnstructuredDocument.py (2 hunks)
cognee/tests/integration/documents/UnstructuredDocument_test.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

cognee/modules/data/processing/document_types/AudioDocument.py
cognee/modules/data/processing/document_types/TextDocument.py
cognee/modules/data/processing/document_types/Document.py
cognee/modules/data/processing/document_types/ImageDocument.py
cognee/modules/data/processing/document_types/UnstructuredDocument.py

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: run_notebook_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: docker-compose-test

🔇 Additional comments (4)

cognee/modules/data/processing/document_types/PdfDocument.py (4)

1-1: LGTM! Clean import addition.

The Optional type import follows Python typing best practices.

12-12: LGTM! Well-structured method signature update.

The read method signature is properly updated with:

Type-safe Optional[int] parameter

Sensible None default value

Backward compatibility maintained

21-23: LGTM! Clean parameter passing.

The chunker_func call correctly propagates the max_tokens parameter while maintaining the existing structure.

Let's verify that other document types have consistent implementations:

✅ Verification successful

Consistent chunker_func implementation verified across document types

All document types (Text, Image, PDF, Audio) implement chunker_func calls consistently with identical parameter handling, including the max_tokens parameter.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for consistency in read method signatures across document types
# Expected: All document types should have similar max_tokens parameter handling

# Search for read method definitions in document type classes
ast-grep --pattern 'class $_Document(Document):
  $$$
  def read(self, chunk_size: int, chunker: str$_) {
    $$$
  }
  $$$'
Length of output: 126

Script:
#!/bin/bash
# First, find all Python files that might contain document type implementations
echo "=== Document Type Classes ==="
ast-grep --pattern 'class $_Document(Document):'

echo -e "\n=== Read Methods ==="
# Look for read method implementations
ast-grep --pattern 'def read(self, $_)'

echo -e "\n=== Chunker Usage ==="
# Look for chunker_func usage
rg "chunker_func\(" -A 2
Length of output: 1568
Line range hint 12-27: Verify error handling for max_tokens.

While the implementation looks good, we should ensure proper error handling when max_tokens is exceeded.

Let's check if there's consistent error handling across the codebase:

coderabbitai · 2025-01-09T10:34:17Z

cognee/tests/integration/documents/UnstructuredDocument_test.py

+    for paragraph_data in docx_document.read(chunk_size=1024, chunker="text_chunker"):
        assert 16 == paragraph_data.word_count, f" 16 != {paragraph_data.word_count = }"
        assert 145 == len(paragraph_data.text), f" 145 != {len(paragraph_data.text) = }"
        assert (
            "sentence_end" == paragraph_data.cut_type
        ), f" sentence_end != {paragraph_data.cut_type = }"



💡 Codebase verification

Inconsistent cut type values between document types

The codebase shows that DOCX uses "sentence_end" while PPTX uses "sentence_cut", but the chunking implementation in chunk_by_paragraph.py suggests that "sentence_cut" should be used when cutting at word boundaries. This inconsistency appears to be a bug.

cognee/tasks/chunks/chunk_by_paragraph.py: Uses "sentence_cut" as the standard cut type

cognee/tests/integration/documents/UnstructuredDocument_test.py: Expects "sentence_end" for DOCX

cognee/tests/integration/documents/PowerPointDocument_test.py: Expects "sentence_cut" for PPTX

🔗 Analysis chain

Document and verify the different cut types.

The test shows that DOCX uses "sentence_end" while PPTX uses "sentence_cut", but this difference isn't documented or verified across different scenarios.

Let's check if this behavior is documented:

Consider:

Adding docstrings explaining the different cut types

Testing with different chunk sizes to verify cut type consistency

Adding test cases for the max_tokens parameter

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for documentation about cut types rg -i "sentence_(end|cut)" --type md rg -i "cut_type.*=.*['\"]sentence_(end|cut)['\"]" --type py -A 2 -B 2

Length of output: 114

Script:

#!/bin/bash # Search for all cut_type assignments and chunker implementations rg "cut_type.*=.*['\"].*['\"]" --type py -A 2 -B 2 rg "class.*Chunker" --type py -A 5 rg "def.*chunk" --type py -A 5 # Search for test files related to chunking fd ".*chunk.*test.*\.py$"

Length of output: 31805

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

cognee/tasks/repo_processor/expand_dependency_graph.py (1)
9-11: Standardize import spacing

Add a single blank line between the third-party imports and the standard library imports to follow PEP 8 guidelines.
from cognee.tasks.repo_processor.extract_code_parts import extract_code_parts
-
import logging
cognee/tasks/repo_processor/get_local_dependencies.py (1)
Line range hint 82-84: Consider adding context to file operation errors

For better debugging, consider including the file encoding being used when logging IO errors.
     except IOError as e:
-        logger.error(f"Error opening {script_path}: {e}")
+        logger.error(f"Error opening {script_path} with default encoding: {e}")
         return []

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abb3ea6 and cdaae16.

📒 Files selected for processing (4)

cognee/tasks/repo_processor/__init__.py (0 hunks)
cognee/tasks/repo_processor/expand_dependency_graph.py (1 hunks)
cognee/tasks/repo_processor/extract_code_parts.py (1 hunks)
cognee/tasks/repo_processor/get_local_dependencies.py (1 hunks)

💤 Files with no reviewable changes (1)

cognee/tasks/repo_processor/init.py

🧰 Additional context used

🪛 GitHub Actions: test | profiling

cognee/tasks/repo_processor/extract_code_parts.py

[error] Circular import detected: Cannot import 'logger' from 'cognee.tasks.repo_processor' due to circular dependency

⏰ Context from checks skipped due to timeout of 90000ms (15)

GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: run_multimedia_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: docker-compose-test

🔇 Additional comments (3)

cognee/tasks/repo_processor/extract_code_parts.py (1)

Line range hint 4-11: Good fix for the circular import issue!

The change from importing logger from cognee.tasks.repo_processor to using Python's standard logging module resolves the circular dependency issue indicated in the pipeline failure. This is a more maintainable approach that follows Python's logging best practices.

🧰 Tools

🪛 GitHub Actions: test | profiling

[error] Circular import detected: Cannot import 'logger' from 'cognee.tasks.repo_processor' due to circular dependency

cognee/tasks/repo_processor/expand_dependency_graph.py (1)

11-11: Well-structured logging with appropriate log levels!

Good use of different logging levels:

DEBUG for non-critical information (no code parts to add)

WARNING for potential issues (empty code parts)

ERROR for exceptions

Also applies to: 19-21, 54-56

cognee/tasks/repo_processor/get_local_dependencies.py (1)

13-15: Excellent error handling with appropriate logging levels!

The error handling has been significantly improved with proper logging:

ERROR level for critical failures (file operations, code analysis)

WARNING level for non-critical issues (path resolution)

Also applies to: 63-65, 82-84, 121-124, 132-135

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

cognee/modules/cognify/config.py (1)
Line range hint 14-19: Update to_dict() method to include max_tokens.

The to_dict() method should be updated to include all configuration attributes, including the newly added max_tokens. This ensures consistent serialization of the configuration.

Apply this diff:
     def to_dict(self) -> dict:
         return {
             "classification_model": self.classification_model,
             "summarization_model": self.summarization_model,
+            "max_tokens": self.max_tokens,
         }

🧹 Nitpick comments (1)

cognee/modules/cognify/config.py (1)
11-11: Document the MAX_TOKENS environment variable.

Since this configuration relies on an environment variable, it would be helpful to add a docstring explaining its purpose, expected value type, and any constraints.

Add documentation above the attribute:
+    # Maximum number of tokens allowed for document processing.
+    # Set via MAX_TOKENS environment variable.
+    # Expected value: positive integer or None
     max_tokens: Optional[int] = os.getenv("MAX_TOKENS")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdaae16 and 626bc76.

📒 Files selected for processing (2)

cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
cognee/modules/cognify/config.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/api/v1/cognify/code_graph_pipeline.py

⏰ Context from checks skipped due to timeout of 90000ms (15)

GitHub Check: run_multimedia_example_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: docker-compose-test

🔇 Additional comments (2)

cognee/modules/cognify/config.py (2)

4-4: LGTM!

The import of Optional from typing is correctly added to support type hinting for the new max_tokens attribute.

11-11: Verify the usage of max_tokens across the codebase.

Let's verify how this configuration is used in document processing classes mentioned in the AI summary.

✅ Verification successful

Implementation of max_tokens is complete and consistent

The max_tokens configuration is properly propagated through the entire document processing pipeline and correctly implemented in the chunking logic to limit token counts in document chunks.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check the usage of max_tokens parameter in document processing classes

# Search for document class definitions and their read methods
echo "Checking document class implementations:"
ast-grep --pattern 'class $DOC_CLASS($_):
  $$$
  def read($$$):
    $$$'

# Search for TextChunker usage of max_tokens
echo -e "\nChecking TextChunker implementation:"
rg -A 5 'class TextChunker'

# Search for places where CognifyConfig.max_tokens is accessed
echo -e "\nChecking max_tokens usage:"
rg 'max_tokens.*=.*cognee_config\.max_tokens'
Length of output: 8151

Script:
#!/bin/bash
# Check complete implementation of TextChunker and related classes
echo "Checking TextChunker implementation:"
cat cognee/modules/chunking/TextChunker.py

echo -e "\nChecking extract_chunks_from_documents implementation:"
ast-grep --pattern 'def extract_chunks_from_documents($$$):
  $$$'

echo -e "\nChecking additional max_tokens usages:"
rg -g '!*.pyc' 'max_tokens' -A 3
Length of output: 18909

coderabbitai · 2025-01-09T11:56:13Z

cognee/modules/cognify/config.py



 class CognifyConfig(BaseSettings):
    classification_model: object = DefaultContentPrediction
    summarization_model: object = SummarizedContent
-
+    max_tokens: Optional[int] = os.getenv("MAX_TOKENS")


⚠️ Potential issue

Consider type validation for MAX_TOKENS environment variable.

While the type hint correctly indicates Optional[int], the direct use of os.getenv() doesn't perform type conversion. This could lead to runtime type errors if the environment variable exists but contains a non-integer value.

Apply this diff to add type validation:

- max_tokens: Optional[int] = os.getenv("MAX_TOKENS") + max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

max_tokens: Optional[int] = os.getenv("MAX_TOKENS")

max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

evals/eval_swe_bench.py (2)
51-58: Improve collections configuration and documentation.

The hardcoded collections list could benefit from:

Moving collections to a configuration file or constants

Adding validation to ensure collections exist

Documenting the purpose and importance of each collection

Considering the search performance impact of collection order
+# At the top of the file
+SEARCH_COLLECTIONS = {
+    "code_summary_text": "Contains code summary information",
+    "data_point_name": "Contains data point identifiers",
+    "document_chunk_text": "Contains document text chunks",
+    "entity_name": "Contains entity names",
+    "entity_type_name": "Contains entity type information",
+    "sourcecodechunk_source_code": "Contains source code chunks",
+}

# In the function
-        collections=[
-            "code_summary_text",
-            "data_point_name",
-            "document_chunk_text",
-            "entity_name",
-            "entity_type_name",
-            "sourcecodechunk_source_code",
-        ],
+        collections=list(SEARCH_COLLECTIONS.keys()),
Line range hint 1-1: Consider overall improvements to the evaluation script.

Move the asyncio import to the top with other imports

Enhance package installation with version checks and retry logic

Add more configuration options (e.g., timeout, retry attempts, logging level)

Would you like me to provide a detailed implementation for any of these improvements?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 626bc76 and d7b2186.

📒 Files selected for processing (1)

evals/eval_swe_bench.py (2 hunks)

coderabbitai · 2025-01-09T13:34:30Z

evals/eval_swe_bench.py


-    async for result in pipeline:
+    async for result in run_code_graph_pipeline(repo_path, include_docs=True):


🛠️ Refactor suggestion

Enhance error handling and result utilization.

The current implementation only prints the pipeline results without proper error handling or result processing. Consider:

Adding try-catch block for the async iteration

Implementing proper logging instead of print statements

Storing or processing the pipeline results for later use

- async for result in run_code_graph_pipeline(repo_path, include_docs=True): - print(result) + try: + results = [] + async for result in run_code_graph_pipeline(repo_path, include_docs=True): + results.append(result) + logging.info(f"Pipeline result: {result}") + except Exception as e: + logging.error(f"Error in code graph pipeline: {str(e)}") + raise

Committable suggestion skipped: line range outside the PR's diff.

…COG-949

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cognee/tasks/repo_processor/get_source_code_chunks.py (1)
157-157: Update or remove the outdated TODO comment.

The TODO comment about adding support for other embedding models (line 159) is now inconsistent with the centralized model management approach.

Either remove the comment or update it to reflect the current architecture:
-    # TODO: Add support for other embedding models, with max_token mapping
+    # TODO: Add max_token mapping for different embedding models in vector engine configuration

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18bb282 and a11b914.

📒 Files selected for processing (1)

cognee/tasks/repo_processor/get_source_code_chunks.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (17)

GitHub Check: run_multimedia_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: run_simple_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: run_notebook_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: profiler
GitHub Check: docker-compose-test

🔇 Additional comments (2)

cognee/tasks/repo_processor/get_source_code_chunks.py (2)

8-8: LGTM: Import aligns with vector engine integration.

The addition of the vector engine import supports the centralized model management approach.

172-172: LGTM: Consistent with centralized model management.

The simplified function call aligns with the removal of the model_name parameter.

coderabbitai · 2025-01-10T09:06:41Z

cognee/tasks/repo_processor/get_source_code_chunks.py

+    vector_engine = get_vector_engine()
+    embedding_model = vector_engine.embedding_engine.model
+    model_name = embedding_model.split("/")[-1]


🛠️ Refactor suggestion

Add error handling for vector engine initialization and model name extraction.

The current implementation makes assumptions about the vector engine and model path format without proper error handling.

Consider this safer implementation:

- vector_engine = get_vector_engine() - embedding_model = vector_engine.embedding_engine.model - model_name = embedding_model.split("/")[-1] + try: + vector_engine = get_vector_engine() + if not vector_engine or not vector_engine.embedding_engine: + raise ValueError("Vector engine or embedding engine not initialized") + embedding_model = vector_engine.embedding_engine.model + # Handle cases where model path doesn't contain '/' + model_name = embedding_model.split("/")[-1] if "/" in embedding_model else embedding_model + except Exception as e: + logger.error(f"Failed to initialize vector engine or get model name: {e}") + # Fallback to default model + model_name = "text-embedding-3-large"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

vector_engine = get_vector_engine()

embedding_model = vector_engine.embedding_engine.model

model_name = embedding_model.split("/")[-1]

try:

vector_engine = get_vector_engine()

if not vector_engine or not vector_engine.embedding_engine:

raise ValueError("Vector engine or embedding engine not initialized")

embedding_model = vector_engine.embedding_engine.model

# Handle cases where model path doesn't contain '/'

model_name = embedding_model.split("/")[-1] if "/" in embedding_model else embedding_model

except Exception as e:

logger.error(f"Failed to initialize vector engine or get model name: {e}")

# Fallback to default model

model_name = "text-embedding-3-large"

alekszievr and others added 7 commits December 20, 2024 10:54

Ingest non-code files

a8a83ff

Fixing review findings

399faf9

Merge branch 'dev' of https://github.com/topoteretes/cognee into dev

a6dfff8

fix: add allowed extensions

4cee9a1

fix: adhere UnstructuredDocument.read() to Document

dbc33a6

feat: time code graph run and add mock support

5e79dc5

Overcome ContextWindowExceededError by checking token count while chu…

4802567

…nking (#413)

Merge branch 'dev' into COG-949

fbf8fc9

alekszievr changed the title ~~Cog 949~~ Code graph pipeline improvements and fixes Jan 7, 2025

alekszievr added the run-checks label Jan 7, 2025

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

Adjust AudioDocument and handle None token limit

a774191

alekszievr requested review from hajdul88 and lxobr January 7, 2025 12:52

hajdul88 reviewed Jan 7, 2025

View reviewed changes

Handle azure models as well

fb13a1b

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

alekszievr requested a review from dexters1 January 7, 2025 14:22

Merge branch 'dev' into COG-949

0dec704

hajdul88 added run-checks and removed run-checks labels Jan 8, 2025

dexters1 reviewed Jan 8, 2025

View reviewed changes

Add clean logging to code graph example

8ffef50

coderabbitai bot reviewed Jan 8, 2025

View reviewed changes

Remove setting envvars from arg

f4397bf

coderabbitai bot reviewed Jan 8, 2025

View reviewed changes

alekszievr added 2 commits January 8, 2025 13:23

Get embedding engine instead of passing it. Get it from vector engine…

34a9267

… instead of direct getter.

Get embedding engine instead of passing it in code chunking.

97814e3

ruff formatting

2e37f85

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

alekszievr added 2 commits January 9, 2025 10:53

Adjust unit tests

5635da6

Merge remote-tracking branch 'origin/COG-949' into COG-949

6762039

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

Adjust integration tests

abb3ea6

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

Handle circular import

cdaae16

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

Set max_tokens in config

626bc76

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

Adjust SWE-bench script to code graph pipeline call

d7b2186

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

Adjust SWE-bench script to code graph pipeline call

18bb282

alekszievr force-pushed the COG-949 branch from d7b2186 to 18bb282 Compare January 9, 2025 13:52

hajdul88 and others added 2 commits January 9, 2025 15:25

Merge branch 'COG-949' of https://github.com/topoteretes/cognee into …

7e10349

…COG-949

Merge branch 'dev' into COG-949

a11b914

coderabbitai bot reviewed Jan 10, 2025

View reviewed changes

Merge branch 'dev' into COG-949

5839ab0

Vasilije1990 merged commit f7e808e into dev Jan 10, 2025
23 of 24 checks passed

Vasilije1990 deleted the COG-949 branch January 10, 2025 13:32

This was referenced Jan 10, 2025

Format with Ruff 0.9.0 #425

Merged

feat: Add data visualization for Anthropic #432

Merged

Feature/cog 186 run cognee on windows #449

Merged

fix: Update ruff version for cognee #450

Merged

chore: add docstrings any typing to cognee tasks #451

Merged

This was referenced Jan 17, 2025

Add Gemini as an Adapter #419

Merged

feat: Added base modal runner and changed eval script #448

Closed

Cog 793 metadata rework #460

Merged

feat: use fastmcp for mcp server #465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code graph pipeline improvements and fixes #414

Code graph pipeline improvements and fixes #414

alekszievr commented Jan 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 7, 2025

hajdul88 left a comment

hajdul88 Jan 7, 2025

hajdul88 Jan 7, 2025

hajdul88 Jan 7, 2025

hajdul88 Jan 7, 2025

coderabbitai bot left a comment

dexters1 Jan 8, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

alekszievr commented Jan 8, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Jan 9, 2025

coderabbitai bot Jan 9, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 9, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Jan 9, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 9, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 10, 2025


		if args.mock_embedding:
		os.environ["MOCK_EMBEDDING"] = "true"


		if args.mock_code_summary:
		os.environ["MOCK_CODE_SUMMARY"] = "true"

	max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
	max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None


		async for result in pipeline:
		async for result in run_code_graph_pipeline(repo_path, include_docs=True):

-    vector_engine = get_vector_engine()
-    embedding_model = vector_engine.embedding_engine.model
-    model_name = embedding_model.split("/")[-1]
+    try:
+        vector_engine = get_vector_engine()
+        if not vector_engine or not vector_engine.embedding_engine:
+            raise ValueError("Vector engine or embedding engine not initialized")
+        embedding_model = vector_engine.embedding_engine.model
+        # Handle cases where model path doesn't contain '/'
+        model_name = embedding_model.split("/")[-1] if "/" in embedding_model else embedding_model
+    except Exception as e:
+        logger.error(f"Failed to initialize vector engine or get model name: {e}")
+        # Fallback to default model
+        model_name = "text-embedding-3-large"

Code graph pipeline improvements and fixes #414

Code graph pipeline improvements and fixes #414

Conversation

alekszievr commented Jan 7, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Jan 7, 2025 • edited Loading

Walkthrough

Changes

Poem

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 7, 2025

Choose a reason for hiding this comment

hajdul88 left a comment

Choose a reason for hiding this comment

hajdul88 Jan 7, 2025

Choose a reason for hiding this comment

hajdul88 Jan 7, 2025

Choose a reason for hiding this comment

hajdul88 Jan 7, 2025

Choose a reason for hiding this comment

hajdul88 Jan 7, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

dexters1 Jan 8, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

alekszievr commented Jan 8, 2025

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 9, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 9, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 9, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 9, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 9, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 10, 2025

Choose a reason for hiding this comment

alekszievr commented Jan 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)