-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable different chunking methods #128
Conversation
WalkthroughThe recent updates introduce significant enhancements across various modules, focusing on improving document handling and processing capabilities. Key changes include the addition of a customizable Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CognifyPipeline
participant DocumentReader
participant AWS_Translate
participant LanguageDetector
User->>CognifyPipeline: Initiate document processing
CognifyPipeline->>DocumentReader: Create document instance
DocumentReader->>DocumentReader: Apply chunking strategy
DocumentReader->>AWS_Translate: Translate text
AWS_Translate-->>DocumentReader: Return translated text
DocumentReader->>LanguageDetector: Detect document language
LanguageDetector-->>DocumentReader: Return language code
DocumentReader-->>CognifyPipeline: Processed document output
CognifyPipeline-->>User: Return results
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Outside diff range, codebase verification and nitpick comments (3)
cognee/modules/data/processing/document_types/Document.py (1)
9-9
: Add documentation forchunking_strategy
.Consider adding comments or documentation to explain the purpose and possible values of the
chunking_strategy
attribute. This will help other developers understand how to use this feature effectively.cognee/tasks/chunking/chunking_registry.py (1)
3-7
: Add type hints toregister_chunking_function
.Consider adding type hints to the
register_chunking_function
decorator to improve code readability and maintainability.from typing import Callable def register_chunking_function(name: str) -> Callable: def decorator(func: Callable) -> Callable: chunking_registry[name] = func return func return decoratorcognee/base_config.py (1)
12-13
: Ensure secure handling of AWS credentials.Storing AWS credentials directly in the configuration file can pose security risks. Consider using environment variables or a secure vault to manage these credentials.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (14)
- cognee/api/v1/cognify/cognify_v2.py (1 hunks)
- cognee/base_config.py (1 hunks)
- cognee/infrastructure/data/chunking/DefaultChunkEngine.py (1 hunks)
- cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
- cognee/modules/data/processing/document_types/Document.py (1 hunks)
- cognee/modules/data/processing/document_types/ImageDocument.py (1 hunks)
- cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
- cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
- cognee/tasks/chunk_translate/translate_chunk.py (1 hunks)
- cognee/tasks/chunking/tests/chunk_by_paragraph.test.py (1 hunks)
- cognee/tasks/chunking/chunk_by_paragraph.py (1 hunks)
- cognee/tasks/chunking/chunk_by_sentence.py (1 hunks)
- cognee/tasks/chunking/chunking_registry.py (1 hunks)
- cognee/tasks/document_language_detection/document_language_detection.py (1 hunks)
Files skipped from review due to trivial changes (2)
- cognee/infrastructure/data/chunking/DefaultChunkEngine.py
- cognee/tasks/chunking/tests/chunk_by_paragraph.test.py
Additional comments not posted (15)
cognee/tasks/chunking/chunk_by_sentence.py (2)
8-8
: Decorator addition approved.The addition of the
@register_chunking_function("sentence")
decorator is a good approach to integrate the function into a chunking strategy registry.
Line range hint
8-38
:
Verify the registration process.Ensure that the
register_chunking_function
decorator correctly registers the function and that the chunking strategy registry is functioning as intended.Verification successful
Registration Process Verified
The
register_chunking_function
decorator correctly registers thechunk_by_sentence
function under the "sentence" strategy. The implementation ensures that the function is added to thechunking_registry
dictionary as expected. No issues were found with the registration process.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the registration of `chunk_by_sentence` in the chunking strategy registry. # Test: Check the registration process. Expect: The function to be registered under the "sentence" strategy. rg --type python 'register_chunking_function\("sentence"\)'Length of output: 92
Script:
#!/bin/bash # Description: Verify the registration of `chunk_by_sentence` in the chunking strategy registry. # Find all instances of the `register_chunking_function` decorator with "sentence". rg '@register_chunking_function\("sentence"\)' -A 2 # Find the implementation of the `register_chunking_function` decorator. rg 'def register_chunking_function' -A 10Length of output: 883
cognee/tasks/chunking/chunk_by_paragraph.py (1)
3-5
: Decorator addition approved.The addition of
@register_chunking_function("paragraph")
enhances the integration ofchunk_by_paragraph
into the chunking framework without altering its logic.cognee/modules/data/processing/document_types/TextDocument.py (3)
3-5
: Import addition approved.The import of
get_chunking_function
is necessary for the new chunking strategy functionality.
11-17
: Constructor changes approved.The addition of
chunking_strategy
and the initialization ofchunking_function
enhance the flexibility of theTextReader
class.
48-48
: Chunking function usage approved.The use of
self.chunking_function
in theread
method allows for dynamic chunking based on the specified strategy, enhancing flexibility.cognee/modules/data/processing/document_types/PdfDocument.py (3)
9-9
: Import addition approved.The import of
get_chunking_function
is necessary for the new chunking strategy functionality.
16-20
: Constructor changes approved.The addition of
chunking_strategy
and the initialization ofchunking_function
enhance the flexibility of thePdfReader
class.
41-41
: Chunking function usage approved.The use of
self.chunking_function
in theread
method allows for dynamic chunking based on the specified strategy, enhancing flexibility.cognee/modules/data/processing/document_types/AudioDocument.py (2)
16-20
: Constructor changes approved.The addition of the
chunking_strategy
parameter and the initialization ofchunking_function
usingget_chunking_function
enhance flexibility.
43-43
: Verify the usage ofchunking_function
.Ensure that
chunking_function
is correctly implemented and used across different chunking strategies.Verification successful
chunking_function
is correctly implemented and used across document typesThe
chunking_function
is consistently used across different document types (AudioDocument
,PdfDocument
,ImageDocument
,TextDocument
) by retrieving it throughget_chunking_function
based on the specifiedchunking_strategy
. This indicates a well-structured approach to handling different chunking strategies.
cognee/modules/data/processing/document_types/AudioDocument.py
cognee/modules/data/processing/document_types/PdfDocument.py
cognee/modules/data/processing/document_types/ImageDocument.py
cognee/modules/data/processing/document_types/TextDocument.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of `chunking_function` for different strategies. # Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage. rg --type python -A 5 $'get_chunking_function'Length of output: 76
Script:
#!/bin/bash # Description: Verify the implementation and usage of `chunking_function` for different strategies. # Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage. rg --type py -A 5 'get_chunking_function'Length of output: 4591
cognee/modules/data/processing/document_types/ImageDocument.py (2)
17-21
: Constructor changes approved.The addition of the
chunking_strategy
parameter and the initialization ofchunking_function
usingget_chunking_function
enhance flexibility.
21-21
: Verify the usage ofchunking_function
.Ensure that
chunking_function
is correctly implemented and used across different chunking strategies.Verification successful
Verification of
chunking_function
usage is successful.The
chunking_function
is correctly implemented and consistently used across different document types (TextDocument
,ImageDocument
,AudioDocument
,PdfDocument
). Each document type imports and utilizes the function in a similar manner to handle chunking strategies. No issues were found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of `chunking_function` for different strategies. # Test: Search for the `get_chunking_function` usage. Expect: Correct implementation and usage. rg --type python -A 5 $'get_chunking_function'Length of output: 76
Script:
#!/bin/bash # Corrected script to search for the definition and usage of `get_chunking_function`. # Search for the definition of `get_chunking_function`. rg 'def get_chunking_function' --type=py -A 5 # Search for the usage of `get_chunking_function`. rg 'get_chunking_function' --type=py -A 5Length of output: 4802
cognee/api/v1/cognify/cognify_v2.py (2)
58-61
: Integration ofchunking_strategy
approved.The addition of the
chunking_strategy
parameter to document constructors enhances flexibility for future processing.
58-61
: Verify the integration ofchunking_strategy
.Ensure that the
chunking_strategy
parameter is correctly integrated and doesn't affect existing functionality.
def get_chunking_function(name: str): | ||
return chunking_registry.get(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for get_chunking_function
.
Consider adding error handling or logging if a requested chunking function is not found in the registry. This will help diagnose issues when an invalid function name is used.
def get_chunking_function(name: str):
func = chunking_registry.get(name)
if func is None:
raise ValueError(f"Chunking function '{name}' not found in registry.")
return func
async def detect_language(data:str): | ||
""" | ||
Detect the language of the given text and return its ISO 639-1 language code. | ||
If the detected language is Croatian ('hr'), it maps to Serbian ('sr'). | ||
The text is trimmed to the first 100 characters for efficient processing. | ||
Parameters: | ||
text (str): The text for language detection. | ||
Returns: | ||
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error. | ||
""" | ||
|
||
# Trim the text to the first 100 characters | ||
from langdetect import detect, LangDetectException | ||
trimmed_text = data[:100] | ||
|
||
try: | ||
# Detect the language using langdetect | ||
detected_lang_iso639_1 = detect(trimmed_text) | ||
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}") | ||
|
||
# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2) | ||
if detected_lang_iso639_1 == 'hr': | ||
yield 'sr' | ||
yield detected_lang_iso639_1 | ||
|
||
except LangDetectException as e: | ||
logging.error(f"Language detection error: {e}") | ||
except Exception as e: | ||
logging.error(f"Unexpected error: {e}") | ||
|
||
yield None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor detect_language
to use return
instead of yield
.
Using yield
in an async function is unconventional and may cause unexpected behavior. Consider refactoring to use return
or an async-compatible structure.
-async def detect_language(data:str):
+async def detect_language(data: str) -> str:
...
- if detected_lang_iso639_1 == 'hr':
- yield 'sr'
- yield detected_lang_iso639_1
+ return 'sr' if detected_lang_iso639_1 == 'hr' else detected_lang_iso639_1
...
- yield None
+ return None
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def detect_language(data:str): | |
""" | |
Detect the language of the given text and return its ISO 639-1 language code. | |
If the detected language is Croatian ('hr'), it maps to Serbian ('sr'). | |
The text is trimmed to the first 100 characters for efficient processing. | |
Parameters: | |
text (str): The text for language detection. | |
Returns: | |
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error. | |
""" | |
# Trim the text to the first 100 characters | |
from langdetect import detect, LangDetectException | |
trimmed_text = data[:100] | |
try: | |
# Detect the language using langdetect | |
detected_lang_iso639_1 = detect(trimmed_text) | |
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}") | |
# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2) | |
if detected_lang_iso639_1 == 'hr': | |
yield 'sr' | |
yield detected_lang_iso639_1 | |
except LangDetectException as e: | |
logging.error(f"Language detection error: {e}") | |
except Exception as e: | |
logging.error(f"Unexpected error: {e}") | |
yield None | |
async def detect_language(data: str) -> str: | |
""" | |
Detect the language of the given text and return its ISO 639-1 language code. | |
If the detected language is Croatian ('hr'), it maps to Serbian ('sr'). | |
The text is trimmed to the first 100 characters for efficient processing. | |
Parameters: | |
text (str): The text for language detection. | |
Returns: | |
str: The ISO 639-1 language code of the detected language, or 'None' in case of an error. | |
""" | |
# Trim the text to the first 100 characters | |
from langdetect import detect, LangDetectException | |
trimmed_text = data[:100] | |
try: | |
# Detect the language using langdetect | |
detected_lang_iso639_1 = detect(trimmed_text) | |
logging.info(f"Detected ISO 639-1 code: {detected_lang_iso639_1}") | |
# Special case: map 'hr' (Croatian) to 'sr' (Serbian ISO 639-2) | |
return 'sr' if detected_lang_iso639_1 == 'hr' else detected_lang_iso639_1 | |
except LangDetectException as e: | |
logging.error(f"Language detection error: {e}") | |
except Exception as e: | |
logging.error(f"Unexpected error: {e}") | |
return None |
async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'): | ||
""" | ||
Translate text from source language to target language using AWS Translate. | ||
Parameters: | ||
data (str): The text to be translated. | ||
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | ||
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | ||
region_name (str): AWS region name. | ||
Returns: | ||
str: Translated text or an error message. | ||
""" | ||
import boto3 | ||
from botocore.exceptions import BotoCoreError, ClientError | ||
|
||
if not data: | ||
yield "No text provided for translation." | ||
|
||
if not source_language or not target_language: | ||
yield "Both source and target language codes are required." | ||
|
||
try: | ||
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True) | ||
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language) | ||
yield result.get('TranslatedText', 'No translation found.') | ||
|
||
except BotoCoreError as e: | ||
logging.info(f"BotoCoreError occurred: {e}") | ||
yield "Error with AWS Translate service configuration or request." | ||
|
||
except ClientError as e: | ||
logging.info(f"ClientError occurred: {e}") | ||
yield "Error with AWS client or network issue." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor translate_text
to use return
instead of yield
.
Using yield
in an async function is unconventional and may cause unexpected behavior. Consider refactoring to use return
or an async-compatible structure.
-async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'):
+async def translate_text(data, source_language: str = 'sr', target_language: str = 'en', region_name: str = 'eu-west-1') -> str:
...
- yield "No text provided for translation."
+ return "No text provided for translation."
...
- yield "Both source and target language codes are required."
+ return "Both source and target language codes are required."
...
- yield result.get('TranslatedText', 'No translation found.')
+ return result.get('TranslatedText', 'No translation found.')
...
- yield "Error with AWS Translate service configuration or request."
+ return "Error with AWS Translate service configuration or request."
...
- yield "Error with AWS client or network issue."
+ return "Error with AWS client or network issue."
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def translate_text(data, source_language:str='sr', target_language:str='en', region_name='eu-west-1'): | |
""" | |
Translate text from source language to target language using AWS Translate. | |
Parameters: | |
data (str): The text to be translated. | |
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | |
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | |
region_name (str): AWS region name. | |
Returns: | |
str: Translated text or an error message. | |
""" | |
import boto3 | |
from botocore.exceptions import BotoCoreError, ClientError | |
if not data: | |
yield "No text provided for translation." | |
if not source_language or not target_language: | |
yield "Both source and target language codes are required." | |
try: | |
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True) | |
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language) | |
yield result.get('TranslatedText', 'No translation found.') | |
except BotoCoreError as e: | |
logging.info(f"BotoCoreError occurred: {e}") | |
yield "Error with AWS Translate service configuration or request." | |
except ClientError as e: | |
logging.info(f"ClientError occurred: {e}") | |
yield "Error with AWS client or network issue." | |
async def translate_text(data, source_language: str = 'sr', target_language: str = 'en', region_name: str = 'eu-west-1') -> str: | |
""" | |
Translate text from source language to target language using AWS Translate. | |
Parameters: | |
data (str): The text to be translated. | |
source_language (str): The source language code (e.g., 'sr' for Serbian). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | |
target_language (str): The target language code (e.g., 'en' for English). ISO 639-2 Code https://www.loc.gov/standards/iso639-2/php/code_list.php | |
region_name (str): AWS region name. | |
Returns: | |
str: Translated text or an error message. | |
""" | |
import boto3 | |
from botocore.exceptions import BotoCoreError, ClientError | |
if not data: | |
return "No text provided for translation." | |
if not source_language or not target_language: | |
return "Both source and target language codes are required." | |
try: | |
translate = boto3.client(service_name='translate', region_name=region_name, use_ssl=True) | |
result = translate.translate_text(Text=data, SourceLanguageCode=source_language, TargetLanguageCode=target_language) | |
return result.get('TranslatedText', 'No translation found.') | |
except BotoCoreError as e: | |
logging.info(f"BotoCoreError occurred: {e}") | |
return "Error with AWS Translate service configuration or request." | |
except ClientError as e: | |
logging.info(f"ClientError occurred: {e}") | |
return "Error with AWS client or network issue." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
- cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
Files skipped from review as they are similar to previous changes (2)
- cognee/modules/data/processing/document_types/AudioDocument.py
- cognee/modules/data/processing/document_types/PdfDocument.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (1)
cognee/modules/data/processing/document_types/PdfDocument.py (1)
17-22
: Add Tests for All Chunking StrategiesCurrently, only the "paragraph" chunking strategy is tested in
PdfDocument
. Ensure that all registered chunking strategies, including "sentence", are covered by tests.
- File:
cognee/modules/data/processing/document_types/__tests__/PdfDocument.test.py
- Action: Add tests for the "sentence" chunking strategy.
Analysis chain
Ensure
chunking_strategy
is correctly utilized.The
chunking_strategy
parameter is integrated into thePdfReader
class. Verify that all possible strategies are supported and tested.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify usage and support for all chunking strategies in PdfReader. # Test: Search for all chunking functions. Expect: All functions are defined and used. rg --type python 'def get_chunking_function'Length of output: 77
Script:
#!/bin/bash # Correct the search for the `get_chunking_function` definition and references to `chunking_strategy`. # Search for the definition of `get_chunking_function` to understand supported strategies. rg 'def get_chunking_function' --glob '*.py' -A 10 # Search for references to `chunking_strategy` to verify its usage in the codebase. rg 'chunking_strategy' --glob '*.py' -A 5Length of output: 12071
Script:
#!/bin/bash # Search for the `chunking_registry` to verify registered strategies. rg 'chunking_registry' --glob '*.py' -A 10 # Check for tests related to chunking strategies to ensure they are covered. rg 'chunking_strategy' --glob '*test*.py' -A 5Length of output: 7328
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- cognee/api/v1/cognify/cognify_v2.py (1 hunks)
- cognee/modules/data/processing/document_types/AudioDocument.py (3 hunks)
- cognee/modules/data/processing/document_types/PdfDocument.py (3 hunks)
- cognee/modules/data/processing/document_types/TextDocument.py (3 hunks)
- cognee/modules/data/processing/document_types/tests/PdfDocument.test.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
- cognee/api/v1/cognify/cognify_v2.py
- cognee/modules/data/processing/document_types/AudioDocument.py
- cognee/modules/data/processing/document_types/TextDocument.py
Additional comments not posted (3)
cognee/modules/data/processing/document_types/__tests__/PdfDocument.test.py (1)
6-6
: Verify test coverage for new chunking strategies.The addition of the
chunking_strategy
parameter is appropriate. Ensure that tests cover different strategies to validate the functionality.cognee/modules/data/processing/document_types/PdfDocument.py (2)
42-42
: Utilize dynamic chunking functions.The use of
self.chunking_function
allows for dynamic chunking. Ensure that all chunking functions handle text correctly.
96-107
: Integratechunking_strategy
in PdfDocument.The
chunking_strategy
parameter is integrated into thePdfDocument
class. Verify that the parameter is consistently passed toPdfReader
.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores