Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable different chunking methods #128

Merged
merged 5 commits into from
Aug 9, 2024
Merged

Enable different chunking methods #128

merged 5 commits into from
Aug 9, 2024

Conversation

Vasilije1990
Copy link
Contributor

@Vasilije1990 Vasilije1990 commented Aug 8, 2024

Summary by CodeRabbit

  • New Features

    • Introduced customizable chunking strategies for document types (PDF, Audio, Text).
    • Added AWS integration for handling translations and language detection.
    • Implemented dynamic registration for chunking functions.
  • Bug Fixes

    • Improved error handling in translation and language detection features.
  • Documentation

    • Enhanced documentation for new features and usage of chunking strategies.
  • Chores

    • Restructured imports for better organization.

Copy link
Contributor

coderabbitai bot commented Aug 8, 2024

Walkthrough

The recent updates introduce significant enhancements across various modules, focusing on improving document handling and processing capabilities. Key changes include the addition of a customizable chunking_strategy parameter in multiple document classes, enhancing flexibility in text and audio processing. Furthermore, new AWS integration for translation and language detection features expands the application's functionality, promoting better document management and interaction with external services.

Changes

File(s) Change Summary
cognee/api/v1/cognify/cognify_v2.py Updated document constructors to include a chunking_strategy parameter set to "paragraph".
cognee/base_config.py Added optional attributes for AWS credentials: aws_access_key_id and aws_secret_access_key.
cognee/modules/data/processing/document_types/*.py Modified document readers (AudioDocument, PdfDocument, TextDocument) to include a chunking_strategy parameter.
cognee/tasks/chunk_translate/translate_chunk.py Introduced an asynchronous translate_text function utilizing AWS Translate for text translation.
cognee/tasks/chunking/*.py Implemented a chunking registry with decorators for chunking methods (chunk_by_paragraph, chunk_by_sentence).
cognee/tasks/document_language_detection.py Created an asynchronous detect_language function for detecting language from text input.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CognifyPipeline
    participant DocumentReader
    participant AWS_Translate
    participant LanguageDetector

    User->>CognifyPipeline: Initiate document processing
    CognifyPipeline->>DocumentReader: Create document instance
    DocumentReader->>DocumentReader: Apply chunking strategy
    DocumentReader->>AWS_Translate: Translate text
    AWS_Translate-->>DocumentReader: Return translated text
    DocumentReader->>LanguageDetector: Detect document language
    LanguageDetector-->>DocumentReader: Return language code
    DocumentReader-->>CognifyPipeline: Processed document output
    CognifyPipeline-->>User: Return results
Loading

🐰 In the garden of code, I hop with glee,
New chunks of wisdom await, just for me!
AWS whispers sweet translations at play,
While languages dance in a colorful sway.
With each brave leap, the features unfold,
A tale of progress beautifully told! 🌼✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (3)
cognee/modules/data/processing/document_types/Document.py (1)

9-9: Add documentation for chunking_strategy.

Consider adding comments or documentation to explain the purpose and possible values of the chunking_strategy attribute. This will help other developers understand how to use this feature effectively.

cognee/tasks/chunking/chunking_registry.py (1)

3-7: Add type hints to register_chunking_function.

Consider adding type hints to the register_chunking_function decorator to improve code readability and maintainability.

from typing import Callable

def register_chunking_function(name: str) -> Callable:
    def decorator(func: Callable) -> Callable:
        chunking_registry[name] = func
        return func
    return decorator
cognee/base_config.py (1)

12-13: Ensure secure handling of AWS credentials.

Storing AWS credentials directly in the configuration file can pose security risks. Consider using environment variables or a secure vault to manage these credentials.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d1ae1ae and e80d391.

Files selected for processing (14)
  • cognee/api/v1/cognify/cognify_v2.py (1 hunks)
  • cognee/base_config.py (1 hunks)
  • cognee/infrastructure/data/chunking/DefaultChunkEngine.py (1 hunks)
  • cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/modules/data/processing/document_types/ImageDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
  • cognee/tasks/chunk_translate/translate_chunk.py (1 hunks)
  • cognee/tasks/chunking/tests/chunk_by_paragraph.test.py (1 hunks)
  • cognee/tasks/chunking/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunking/chunk_by_sentence.py (1 hunks)
  • cognee/tasks/chunking/chunking_registry.py (1 hunks)
  • cognee/tasks/document_language_detection/document_language_detection.py (1 hunks)
Files skipped from review due to trivial changes (2)
  • cognee/infrastructure/data/chunking/DefaultChunkEngine.py
  • cognee/tasks/chunking/tests/chunk_by_paragraph.test.py
Additional comments not posted (15)
cognee/tasks/chunking/chunk_by_sentence.py (2)

8-8: Decorator addition approved.

The addition of the @register_chunking_function("sentence") decorator is a good approach to integrate the function into a chunking strategy registry.


Line range hint 8-38:
Verify the registration process.

Ensure that the register_chunking_function decorator correctly registers the function and that the chunking strategy registry is functioning as intended.

Verification successful

Registration Process Verified

The register_chunking_function decorator correctly registers the chunk_by_sentence function under the "sentence" strategy. The implementation ensures that the function is added to the chunking_registry dictionary as expected. No issues were found with the registration process.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the registration of `chunk_by_sentence` in the chunking strategy registry.

# Test: Check the registration process. Expect: The function to be registered under the "sentence" strategy.
rg --type python 'register_chunking_function\("sentence"\)'

Length of output: 92


Script:

#!/bin/bash
# Description: Verify the registration of `chunk_by_sentence` in the chunking strategy registry.

# Find all instances of the `register_chunking_function` decorator with "sentence".
rg '@register_chunking_function\("sentence"\)' -A 2

# Find the implementation of the `register_chunking_function` decorator.
rg 'def register_chunking_function' -A 10

Length of output: 883

cognee/tasks/chunking/chunk_by_paragraph.py (1)

3-5: Decorator addition approved.

The addition of @register_chunking_function("paragraph") enhances the integration of chunk_by_paragraph into the chunking framework without altering its logic.

cognee/modules/data/processing/document_types/TextDocument.py (3)

3-5: Import addition approved.

The import of get_chunking_function is necessary for the new chunking strategy functionality.


11-17: Constructor changes approved.

The addition of chunking_strategy and the initialization of chunking_function enhance the flexibility of the TextReader class.


48-48: Chunking function usage approved.

The use of self.chunking_function in the read method allows for dynamic chunking based on the specified strategy, enhancing flexibility.

cognee/modules/data/processing/document_types/PdfDocument.py (3)

9-9: Import addition approved.

The import of get_chunking_function is necessary for the new chunking strategy functionality.


16-20: Constructor changes approved.

The addition of chunking_strategy and the initialization of chunking_function enhance the flexibility of the PdfReader class.


41-41: Chunking function usage approved.

The use of self.chunking_function in the read method allows for dynamic chunking based on the specified strategy, enhancing flexibility.

cognee/modules/data/processing/document_types/AudioDocument.py (2)

16-20: Constructor changes approved.

The addition of the chunking_strategy parameter and the initialization of chunking_function using get_chunking_function enhance flexibility.


43-43: Verify the usage of chunking_function.

Ensure that chunking_function is correctly implemented and used across different chunking strategies.

Verification successful

chunking_function is correctly implemented and used across document types

The chunking_function is consistently used across different document types (AudioDocument, PdfDocument, ImageDocument, TextDocument) by retrieving it through get_chunking_function based on the specified chunking_strategy. This indicates a well-structured approach to handling different chunking strategies.

  • cognee/modules/data/processing/document_types/AudioDocument.py
  • cognee/modules/data/processing/document_types/PdfDocument.py
  • cognee/modules/data/processing/document_types/ImageDocument.py
  • cognee/modules/data/processing/document_types/TextDocument.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of `chunking_function` for different strategies.

# Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage.
rg --type python -A 5 $'get_chunking_function'

Length of output: 76


Script:

#!/bin/bash
# Description: Verify the implementation and usage of `chunking_function` for different strategies.

# Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage.
rg --type py -A 5 'get_chunking_function'

Length of output: 4591

cognee/modules/data/processing/document_types/ImageDocument.py (2)

17-21: Constructor changes approved.

The addition of the chunking_strategy parameter and the initialization of chunking_function using get_chunking_function enhance flexibility.


21-21: Verify the usage of chunking_function.

Ensure that chunking_function is correctly implemented and used across different chunking strategies.

Verification successful

Verification of chunking_function usage is successful.

The chunking_function is correctly implemented and consistently used across different document types (TextDocument, ImageDocument, AudioDocument, PdfDocument). Each document type imports and utilizes the function in a similar manner to handle chunking strategies. No issues were found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of `chunking_function` for different strategies.

# Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage.
rg --type python -A 5 $'get_chunking_function'

Length of output: 76


Script:

#!/bin/bash
# Corrected script to search for the definition and usage of `get_chunking_function`.

# Search for the definition of `get_chunking_function`.
rg 'def get_chunking_function' --type=py -A 5

# Search for the usage of `get_chunking_function`.
rg 'get_chunking_function' --type=py -A 5

Length of output: 4802

cognee/api/v1/cognify/cognify_v2.py (2)

58-61: Integration of chunking_strategy approved.

The addition of the chunking_strategy parameter to document constructors enhances flexibility for future processing.


58-61: Verify the integration of chunking_strategy.

Ensure that the chunking_strategy parameter is correctly integrated and doesn't affect existing functionality.

Comment on lines +9 to +10
def get_chunking_function(name: str):
return chunking_registry.get(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling for get_chunking_function.

Consider adding error handling or logging if a requested chunking function is not found in the registry. This will help diagnose issues when an invalid function name is used.

def get_chunking_function(name: str):
    func = chunking_registry.get(name)
    if func is None:
        raise ValueError(f"Chunking function '{name}' not found in registry.")
    return func

Comment on lines +6 to +36
async def detect_language(data:str):
"""
Detect the language of the given text and return its ISO 639-1 language code.
If the detected language is Croatian ('hr'), it maps to Serbian ('sr').
The text is trimmed to the first 100 characters for efficient processing.
Parameters:
text (str): The text for language detection.
Returns:
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error.
"""

# Trim the text to the first 100 characters
from langdetect import detect, LangDetectException
trimmed_text = data[:100]

try:
# Detect the language using langdetect
detected_lang_iso639_1 = detect(trimmed_text)
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}")

# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2)
if detected_lang_iso639_1 == 'hr':
yield 'sr'
yield detected_lang_iso639_1

except LangDetectException as e:
logging.error(f"Language detection error: {e}")
except Exception as e:
logging.error(f"Unexpected error: {e}")

yield None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor detect_language to use return instead of yield.

Using yield in an async function is unconventional and may cause unexpected behavior. Consider refactoring to use return or an async-compatible structure.

-async def detect_language(data:str):
+async def detect_language(data: str) -> str:
    ...
-        if detected_lang_iso639_1 == 'hr':
-            yield 'sr'
-        yield detected_lang_iso639_1
+        return 'sr' if detected_lang_iso639_1 == 'hr' else detected_lang_iso639_1
    ...
-    yield None
+    return None
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def detect_language(data:str):
"""
Detect the language of the given text and return its ISO 639-1 language code.
If the detected language is Croatian ('hr'), it maps to Serbian ('sr').
The text is trimmed to the first 100 characters for efficient processing.
Parameters:
text (str): The text for language detection.
Returns:
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error.
"""
# Trim the text to the first 100 characters
from langdetect import detect, LangDetectException
trimmed_text = data[:100]
try:
# Detect the language using langdetect
detected_lang_iso639_1 = detect(trimmed_text)
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}")
# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2)
if detected_lang_iso639_1 == 'hr':
yield 'sr'
yield detected_lang_iso639_1
except LangDetectException as e:
logging.error(f"Language detection error: {e}")
except Exception as e:
logging.error(f"Unexpected error: {e}")
yield None
async def detect_language(data: str) -> str:
"""
Detect the language of the given text and return its ISO 639-1 language code.
If the detected language is Croatian ('hr'), it maps to Serbian ('sr').
The text is trimmed to the first 100 characters for efficient processing.
Parameters:
text (str): The text for language detection.
Returns:
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error.
"""
# Trim the text to the first 100 characters
from langdetect import detect, LangDetectException
trimmed_text = data[:100]
try:
# Detect the language using langdetect
detected_lang_iso639_1 = detect(trimmed_text)
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}")
# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2)
return 'sr' if detected_lang_iso639_1 == 'hr' else detected_lang_iso639_1
except LangDetectException as e:
logging.error(f"Language detection error: {e}")
except Exception as e:
logging.error(f"Unexpected error: {e}")
return None

Comment on lines +8 to +39
async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'):
"""
Translate text from source language to target language using AWS Translate.
Parameters:
data (str): The text to be translated.
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
region_name (str): AWS region name.
Returns:
str: Translated text or an error message.
"""
import boto3
from botocore.exceptions import BotoCoreError, ClientError

if not data:
yield "No text provided for translation."

if not source_language or not target_language:
yield "Both source and target language codes are required."

try:
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True)
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language)
yield result.get('TranslatedText', 'No translation found.')

except BotoCoreError as e:
logging.info(f"BotoCoreError occurred: {e}")
yield "Error with AWS Translate service configuration or request."

except ClientError as e:
logging.info(f"ClientError occurred: {e}")
yield "Error with AWS client or network issue."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor translate_text to use return instead of yield.

Using yield in an async function is unconventional and may cause unexpected behavior. Consider refactoring to use return or an async-compatible structure.

-async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'):
+async def translate_text(data, source_language: str = 'sr', target_language: str = 'en', region_name: str = 'eu-west-1') -> str:
    ...
-        yield "No text provided for translation."
+        return "No text provided for translation."
    ...
-        yield "Both source and target language codes are required."
+        return "Both source and target language codes are required."
    ...
-        yield result.get('TranslatedText', 'No translation found.')
+        return result.get('TranslatedText', 'No translation found.')
    ...
-        yield "Error with AWS Translate service configuration or request."
+        return "Error with AWS Translate service configuration or request."
    ...
-        yield "Error with AWS client or network issue."
+        return "Error with AWS client or network issue."
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'):
"""
Translate text from source language to target language using AWS Translate.
Parameters:
data (str): The text to be translated.
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
region_name (str): AWS region name.
Returns:
str: Translated text or an error message.
"""
import boto3
from botocore.exceptions import BotoCoreError, ClientError
if not data:
yield "No text provided for translation."
if not source_language or not target_language:
yield "Both source and target language codes are required."
try:
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True)
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language)
yield result.get('TranslatedText', 'No translation found.')
except BotoCoreError as e:
logging.info(f"BotoCoreError occurred: {e}")
yield "Error with AWS Translate service configuration or request."
except ClientError as e:
logging.info(f"ClientError occurred: {e}")
yield "Error with AWS client or network issue."
async def translate_text(data, source_language: str = 'sr', target_language: str = 'en', region_name: str = 'eu-west-1') -> str:
"""
Translate text from source language to target language using AWS Translate.
Parameters:
data (str): The text to be translated.
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php
region_name (str): AWS region name.
Returns:
str: Translated text or an error message.
"""
import boto3
from botocore.exceptions import BotoCoreError, ClientError
if not data:
return "No text provided for translation."
if not source_language or not target_language:
return "Both source and target language codes are required."
try:
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True)
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language)
return result.get('TranslatedText', 'No translation found.')
except BotoCoreError as e:
logging.info(f"BotoCoreError occurred: {e}")
return "Error with AWS Translate service configuration or request."
except ClientError as e:
logging.info(f"ClientError occurred: {e}")
return "Error with AWS client or network issue."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e80d391 and 7c7b8a3.

Files selected for processing (2)
  • cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
Files skipped from review as they are similar to previous changes (2)
  • cognee/modules/data/processing/document_types/AudioDocument.py
  • cognee/modules/data/processing/document_types/PdfDocument.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
cognee/modules/data/processing/document_types/PdfDocument.py (1)

17-22: Add Tests for All Chunking Strategies

Currently, only the "paragraph" chunking strategy is tested in PdfDocument. Ensure that all registered chunking strategies, including "sentence", are covered by tests.

  • File: cognee/modules/data/processing/document_types/__tests__/PdfDocument.test.py
  • Action: Add tests for the "sentence" chunking strategy.
Analysis chain

Ensure chunking_strategy is correctly utilized.

The chunking_strategy parameter is integrated into the PdfReader class. Verify that all possible strategies are supported and tested.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify usage and support for all chunking strategies in PdfReader.

# Test: Search for all chunking functions. Expect: All functions are defined and used.
rg --type python 'def get_chunking_function'

Length of output: 77


Script:

#!/bin/bash
# Correct the search for the `get_chunking_function` definition and references to `chunking_strategy`.

# Search for the definition of `get_chunking_function` to understand supported strategies.
rg 'def get_chunking_function' --glob '*.py' -A 10

# Search for references to `chunking_strategy` to verify its usage in the codebase.
rg 'chunking_strategy' --glob '*.py' -A 5

Length of output: 12071


Script:

#!/bin/bash
# Search for the `chunking_registry` to verify registered strategies.
rg 'chunking_registry' --glob '*.py' -A 10

# Check for tests related to chunking strategies to ensure they are covered.
rg 'chunking_strategy' --glob '*test*.py' -A 5

Length of output: 7328

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7c7b8a3 and 44b9e2b.

Files selected for processing (5)
  • cognee/api/v1/cognify/cognify_v2.py (1 hunks)
  • cognee/modules/data/processing/document_types/AudioDocument.py (3 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (3 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (3 hunks)
  • cognee/modules/data/processing/document_types/tests/PdfDocument.test.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • cognee/api/v1/cognify/cognify_v2.py
  • cognee/modules/data/processing/document_types/AudioDocument.py
  • cognee/modules/data/processing/document_types/TextDocument.py
Additional comments not posted (3)
cognee/modules/data/processing/document_types/__tests__/PdfDocument.test.py (1)

6-6: Verify test coverage for new chunking strategies.

The addition of the chunking_strategy parameter is appropriate. Ensure that tests cover different strategies to validate the functionality.

cognee/modules/data/processing/document_types/PdfDocument.py (2)

42-42: Utilize dynamic chunking functions.

The use of self.chunking_function allows for dynamic chunking. Ensure that all chunking functions handle text correctly.


96-107: Integrate chunking_strategy in PdfDocument.

The chunking_strategy parameter is integrated into the PdfDocument class. Verify that the parameter is consistently passed to PdfReader.

@Vasilije1990 Vasilije1990 merged commit e494ec6 into main Aug 9, 2024
21 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant