-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code graph pipeline improvements and fixes #414
Changes from 5 commits
a8a83ff
399faf9
a6dfff8
4cee9a1
dbc33a6
5e79dc5
4802567
fbf8fc9
a774191
fb13a1b
0dec704
8ffef50
f4397bf
34a9267
97814e3
d1d81eb
2e37f85
5635da6
6762039
abb3ea6
cdaae16
626bc76
d7b2186
18bb282
7e10349
a11b914
5839ab0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,3 +1,4 @@ | ||||||||||||||||||
from typing import Optional | ||||||||||||||||||
from uuid import UUID | ||||||||||||||||||
|
||||||||||||||||||
from cognee.infrastructure.engine import DataPoint | ||||||||||||||||||
|
@@ -13,5 +14,5 @@ class Document(DataPoint): | |||||||||||||||||
"type": "Document" | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
def read(self, chunk_size: int, chunker = str) -> str: | ||||||||||||||||||
def read(self, chunk_size: int, embedding_model: Optional[str], max_tokens: Optional[int], chunker = str) -> str: | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update default values for new parameters in The new parameters Apply this diff to set default values: def read(
self,
chunk_size: int,
- embedding_model: Optional[str],
- max_tokens: Optional[int],
+ embedding_model: Optional[str] = None,
+ max_tokens: Optional[int] = None,
chunker=str,
) -> str:
pass 📝 Committable suggestion
Suggested change
🧰 Tools🪛 GitHub Actions: ruff format[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues. 💡 Codebase verification Update missing read methods in ImageDocument and AudioDocument The following classes need to implement the read method with the updated signature
🔗 Analysis chainEnsure all subclasses override the updated The The pipeline failure indicates that Run the following script to identify classes that need updating: 🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Find all subclasses of Document that need to update the `read` method.
# Find files containing classes that inherit from Document but do not have updated `read` methods
fd --type f --extension py '.*' | xargs grep -l 'class .*Document(Document):' | xargs grep -L 'def read(self, chunk_size'
# Alternatively, use ast-grep to find method definitions
ast-grep --pattern 'class $CLASS_NAME(Document): {$$$ def read(self, chunk_size: int, $$$): $$$ }' --ignore 'test_*'
Length of output: 490 Script: #!/bin/bash
# Find Document subclasses
echo "=== Document subclasses ==="
ast-grep --pattern 'class $NAME(Document)'
echo -e "\n=== Read method implementations ==="
# Find read method implementations
ast-grep --pattern 'def read(self, $$$)'
echo -e "\n=== Backup search with ripgrep ==="
# Backup search with ripgrep
rg "class.*\(Document\)" -A 5
Length of output: 3351 🧰 Tools🪛 GitHub Actions: ruff format[error] 308 files require formatting according to Ruff standards. Run Ruff formatter to fix formatting issues. |
||||||||||||||||||
pass |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,15 @@ | ||
from typing import Optional | ||
|
||
from cognee.modules.data.processing.document_types.Document import Document | ||
|
||
|
||
async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024, chunker = 'text_chunker'): | ||
async def extract_chunks_from_documents( | ||
documents: list[Document], | ||
chunk_size: int = 1024, | ||
chunker='text_chunker', | ||
embedding_model: Optional[str] = None, | ||
max_tokens: Optional[int] = None, | ||
): | ||
for document in documents: | ||
for document_chunk in document.read(chunk_size = chunk_size, chunker = chunker): | ||
for document_chunk in document.read(chunk_size=chunk_size, chunker=chunker, embedding_model=embedding_model, max_tokens=max_tokens): | ||
alekszievr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
yield document_chunk |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,9 +8,61 @@ async def main(repo_path, include_docs): | |
async for result in run_code_graph_pipeline(repo_path, include_docs): | ||
print(result) | ||
|
||
if __name__ == "__main__": | ||
def parse_args(): | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--repo_path", type=str, required=True, help="Path to the repository") | ||
parser.add_argument("--include_docs", type=lambda x: x.lower() in ("true", "1"), default=True, help="Whether or not to process non-code files") | ||
args = parser.parse_args() | ||
asyncio.run(main(args.repo_path, args.include_docs)) | ||
parser.add_argument( | ||
"--repo_path", | ||
type=str, | ||
required=True, | ||
help="Path to the repository" | ||
) | ||
parser.add_argument( | ||
"--include_docs", | ||
type=lambda x: x.lower() in ("true", "1"), | ||
default=True, | ||
help="Whether or not to process non-code files" | ||
) | ||
parser.add_argument( | ||
"--mock_embedding", | ||
type=lambda x: x.lower() in ("true", "1"), | ||
default=True, | ||
help="Whether or not to mock embedding and code summary" | ||
) | ||
parser.add_argument( | ||
"--mock_code_summary", | ||
type=lambda x: x.lower() in ("true", "1"), | ||
default=True, | ||
help="Whether or not to mock code summary" | ||
) | ||
parser.add_argument( | ||
"--time", | ||
type=lambda x: x.lower() in ("true", "1"), | ||
default=True, | ||
help="Whether or not to time the pipeline run" | ||
) | ||
return parser.parse_args() | ||
|
||
if __name__ == "__main__": | ||
import os | ||
|
||
args = parse_args() | ||
|
||
if args.mock_embedding: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This I wouldn't do. It can cause misunderstanding later, we should either use envs or args. |
||
os.environ["MOCK_EMBEDDING"] = "true" | ||
print("Mocking embedding.") | ||
|
||
if args.mock_code_summary: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
os.environ["MOCK_CODE_SUMMARY"] = "true" | ||
print("Mocking code summary.") | ||
|
||
if args.time: | ||
import time | ||
start_time = time.time() | ||
asyncio.run(main(args.repo_path, args.include_docs)) | ||
end_time = time.time() | ||
print("\n" + "="*50) | ||
print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds") | ||
print("="*50 + "\n") | ||
else: | ||
asyncio.run(main(args.repo_path, args.include_docs)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the embedding_engine shouldn't change in code and should only be changed through environment configuration I think we should remove it as a parameter to functions and instead get it dynamically with the getter when needed