-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rag enrichment and ingestion. #33413
base: master
Are you sure you want to change the base?
Conversation
e50d908
to
e966cb7
Compare
ba5ef2a
to
b759659
Compare
b759659
to
b2ad7ac
Compare
R: @damccorm |
b2ad7ac
to
589023f
Compare
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Comments are all minor, I'm really excited with how this turned out!
embedding: ARRAY<FLOAT64> # Vector embedding | ||
content: STRING # Document content | ||
metadata: ARRAY<STRUCT< # Nested repeated metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata: ARRAY<STRUCT< # Nested repeated metadata | |
metadata: ARRAY<STRUCT> # Nested repeated metadata |
Nit
... # chunk.metadata['language'] | ||
... metadata_restriction_template="language = '{language}'" | ||
... ) | ||
>>> # When processing a chunk with metadata={'language': 'en'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be described a bit more - specifically, I don't think its clear just looking at this object what a Chunk is, and that it will be used by the Enrichment transform/BigQueryVectorSearchEnrichmentHandler.
Maybe at the top of this class, we could mention those 2 concepts (with pydoc links)?
... metadata_restriction_template=( | ||
... "check_metadata(metadata, 'language', '{language}')" | ||
... ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be a string, right?
... metadata_restriction_template=( | |
... "check_metadata(metadata, 'language', '{language}')" | |
... ) | |
... metadata_restriction_template="check_metadata(metadata, 'language', '{language}')" |
embedding_column: Column name containing the embedding vectors. | ||
columns: List of columns to retrieve from matched vectors. | ||
neighbor_count: Number of similar vectors to return (top-k). | ||
metadata_restriction_template: Template string for filtering vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think composing multiple restrictions should work fine, right? It would basically be: metadata_restriction_template="check_metadata(metadata, 'language', '{language}') AND owner = '{companyName}'"
.
If yes, it would be good to document it. We could also consider taking a list of restrictions, though it is less expressive since it only supports AND, so maybe not worth doing
""" | ||
def __init__( | ||
self, | ||
project: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If we're putting the table name in vector_search_parameters, from a user's perspective I think it makes sense to include the project id as well (even if we just immediately extract that into its own variable in init). Batching kwargs seem fine to keep out since its more of a beam detail than a BQ one.
- Setting up database connection/client | ||
- Writing with appropriate batching/error handling | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It is probably better to throw here like
raise NotImplementedError(type(self)) |
... ) | ||
""" | ||
@abstractmethod | ||
def create_write_transform(self) -> beam.PTransform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add types to this PTransform? Even if it is beam.PTransform[Chunk, Any]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment applies below as well, and maybe elsewhere
@@ -0,0 +1,304 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this out in CHANGES.md (both the enrichment and ingestion pieces)?
Add ingestion and enrichment components for rag pipelines.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.