Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pgvector search #360

Merged
merged 4 commits into from
Dec 12, 2024
Merged

Fix pgvector search #360

merged 4 commits into from
Dec 12, 2024

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Dec 12, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced instantiation behavior of the PGVectorAdapter class, allowing for multiple instances.
  • Bug Fixes

    • Improved key construction in the existing_edges_map for consistent string formatting.
  • Refactor

    • Streamlined imports and class structure for better accessibility of the Vector type.

dexters1 and others added 2 commits December 12, 2024 11:02
Resolve issue regarding UUID being concatenated instead of string
…en using vector search

Issue happens when search is called in a session without previously adding data or creating tables as an import of Vector column type was missing

Fix
@dexters1 dexters1 self-assigned this Dec 12, 2024
Copy link
Contributor

coderabbitai bot commented Dec 12, 2024

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes primarily involve modifications to the PGVectorAdapter class and the retrieve_existing_edges function. The PGVectorAdapter class no longer utilizes the singleton decorator, allowing for multiple instances. Additionally, the import of the Vector class has been moved to the class level for broader accessibility. In the retrieve_existing_edges function, the key construction for the existing_edges_map dictionary has been updated to ensure that the first two components are explicitly converted to strings before concatenation.

Changes

File Path Change Summary
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py Removed singleton decorator from PGVectorAdapter, moved Vector import to class level, updated vector column definition to use self.Vector(vector_size) in PGVectorDataPoint.
cognee/modules/graph/utils/retrieve_existing_edges.py Modified key construction in retrieve_existing_edges to convert edge[0] and edge[1] to strings before concatenation.

Poem

🐰 In the fields where vectors play,
A singleton hopped away,
Now multiple friends can share the day,
With edges strung in a new array.
Hooray for changes, bright and clear,
A joyful leap, let’s give a cheer! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
cognee/modules/graph/utils/retrieve_existing_edges.py (1)

56-56: Approve explicit string conversion but suggest a more robust solution

The explicit string conversion for edge components is a good fix for type consistency. However, consider using a more robust key construction method:

-        existing_edges_map[str(edge[0]) + str(edge[1]) + edge[2]] = True
+        existing_edges_map[f"{edge[0]}:{edge[1]}:{edge[2]}"] = True

Using an f-string with a separator makes the key format more explicit and easier to debug.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

74-74: Consider refactoring vector search operations

There's an opportunity to improve the vector-related operations:

  1. Extract common vector search logic from get_distance_from_collection_elements and search methods
  2. Implement the TODO for similarity score normalization

Example refactor for common search logic:

async def _execute_vector_search(
    self,
    collection_name: str,
    query_vector: List[float],
    limit: Optional[int] = None
) -> List[tuple]:
    PGVectorDataPoint = await self.get_table(collection_name)
    
    query = select(
        PGVectorDataPoint,
        PGVectorDataPoint.c.vector.cosine_distance(query_vector).label("similarity"),
    ).order_by("similarity")
    
    if limit:
        query = query.limit(limit)
        
    async with self.get_async_session() as session:
        results = await session.execute(query)
        return [row for row in results]

For similarity score normalization:

def _normalize_similarity_scores(self, results: List[tuple]) -> List[ScoredResult]:
    similarities = [row.similarity for row in results]
    normalized_scores = normalize_distances(similarities)
    
    return [
        ScoredResult(
            id=UUID(str(row.id)),
            payload=row.payload,
            score=normalized_scores[idx]
        ) for idx, row in enumerate(results)
    ]

Also applies to: 111-111

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3935660 and 599e1d4.

📒 Files selected for processing (2)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3 hunks)
  • cognee/modules/graph/utils/retrieve_existing_edges.py (1 hunks)
🔇 Additional comments (2)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (2)

74-74: Approve consistent Vector column definitions

The vector column definitions now consistently use self.Vector, which aligns with the class-level import.

Also applies to: 111-111


41-44: Approve class-level Vector import but verify table reflection

Moving the Vector import to class level is good practice. However, let's verify that all table operations handle the Vector type correctly.

✅ Verification successful

Let me analyze the usage of Vector across different adapters to ensure there are no conflicts or inconsistencies.


Vector import and usage is correctly implemented

The verification shows that:

  • The Vector import in PGVectorAdapter is unique and doesn't conflict with other adapters
  • LanceDB uses a different Vector class from lancedb.pydantic
  • All Vector column definitions in PGVectorAdapter use self.Vector, ensuring consistent usage
  • Table operations properly handle the Vector type through dynamic table loading with get_table method
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if Vector type is used consistently across the codebase

# Look for any other Vector imports that might conflict
rg "from pgvector.sqlalchemy import Vector"

# Check for direct Vector usage without self.Vector
rg "Vector\(" --type py

Length of output: 969


Script:

#!/bin/bash
# Check the full context of Vector usage in LanceDB adapter
ast-grep --pattern 'from $_ import Vector'

# Get more context around Vector usage in LanceDB
rg "Vector" -B 2 -A 2 cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py

# Check if there are any table reflection or dynamic table loading code
rg "Table|MetaData" --type py cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py

Length of output: 1856

@dexters1 dexters1 requested a review from hajdul88 December 12, 2024 12:42
Resolve issue with poetry lock

Fix
@dexters1 dexters1 merged commit ec38404 into dev Dec 12, 2024
24 checks passed
@dexters1 dexters1 deleted the fix-pgvector-search branch December 12, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants