-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexing function for indexing files to vector search #532
Conversation
This pull request has been linked to Shortcut Story #42043: Trigger Task Graph for Indexing (Ingestion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments in first pass.
This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.
I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation. |
When you are back we need to sync on this. I believe we had discussed this and the example POC code I provided handled all of this. The goal was to take the POC we used and re-implement the features in a production fashion. The first goal is to support file ingestion and index creation as part of one pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid nesting functions especially when they are only used once through out the body of the method.
Also functions like index_exists
that only work as a "passthrough" to only call another function should be avoided as they are unnecessary, they add overhead and they complicate the code and any potential debugging process.
@Shelnutt2 are your comments addressed by the changes? This requires your approval to continue with merging. |
embedding_class_ = getattr(embeddings_module, embedding_class) | ||
embedding = embedding_class_(**embedding_kwargs) | ||
|
||
with tiledb.scope_ctx(config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done as the first stage in the graph, not local to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this, ingest_files_udf
is running within the taskgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an alternative version of the PR that uses an extra taskgraph and creates the dataset as a first node in the taskgraph #547 Is this what you are expecting the ingestion structure to look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shelnutt2 is the alternative version matching your expectation? Let me know if you still have concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied the alternative taskgraph structure in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shelnutt2 this PR needs your approval to move forward. Let me know if you need any more changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments that need to be fixed. This code also needs to be aligned with the goals of handling TileDB FileStore files as a primary source.
Additionally there are some pylint errors related to variables that aren't passed through. Please address all lint errors.
driver_image: Optional[str] = None, | ||
extra_driver_modules: Optional[List[str]] = None, | ||
max_tasks_per_stage: int = -1, | ||
embeddings_generation_mode: dag.Mode = dag.Mode.LOCAL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything must default to batch mode. Running this in local is unexpected. The goals are that like all other verticals we support and default to batch ingestion capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document indexing has multiple execution steps that can be run in different modes:
ingest_files
creates aBATCH
taskgraph that runs all the indexing. This means that all processing happens within aBATCH
taskgraph withaccess_credentials
even if the options here are set toLOCAL
embeddings_generation
: reads the documents and creates text embeddings. This can spawn its own taskgraph.vector_indexing
: creates a vector index from the produced embeddings. This can spawn its own taskgraph.
The default configuration at the moment is:
ingest_files
creates aBATCH
taskgraph that runs all the indexing.embeddings_generation, vector_indexing
run inLOCAL
mode within a UDF of theingest_files
taskgraph. Both of these tasks can leverage the available parallelism within the single worker.
This is expected to be a good default execution configuration for cost and latency even for sets of thousands of documents.
Do you want all the execution steps to be executed in BATCH
mode by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement again, as we discussed and as spelled out in the story is to have a robust batch mode ingestion that can scale to millions of documents. Local mode is a bad default and does not meet our intended goal, please change it and please be sure you actually test at scale. These issues are easy to see even running just our same test datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed embeddings_generation_mode, vector_indexing
modes to BATCH
.
Note: this setting can introduce some latency and resource overhead for small to medium size document datasets.
There were some bugs that led to execution failure for this setup. This was addressed in TileDB-Inc/TileDB-Vector-Search#351 Also added tests for the BATCH execution in Cloud.
|
||
|
||
def ingest_files( | ||
file_dir_uri: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work as expected. Passing in a TileDB file URI get ignored. Please test this and add unit tests for the relevant cases. Currently this does not cover the required use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is implemented atm this should be the group URI and we pick up the file from the group(applying regexp patterns if provided). What are the cases you are looking for supporting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some testcases for this here https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/main/apis/python/test/test_directory_reader.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirements are to support TileDB FileStore files or a group of files. This has been a hard requirement from day one and is outlined in our planning document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed file_dir_uri
to search_uri
to be consistent with other ingestion jobs. search_uri
supports FileStore URIs. If we want to index one file using the FileStore URI we can pass it as search_uri
. @Shelnutt2 is this covering your expectations of the function signature?
def ingest_files( | ||
file_dir_uri: str, | ||
index_uri: str, | ||
file_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This along with include/exclude don't make sense. How is this to work with TileDB files? There is no check if the TileDB file name or any parsing of the TileDB URIs. The goal again as outlined in the requirements is to use this with TileDB files, either standalone or from a group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the file_name
option. FileStore URIs work directly using the search_uri
param.
# Index update params | ||
index_timestamp: Optional[int] = None, | ||
workers: int = -1, | ||
worker_resources: Optional[Dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not plumbing all the different resource parameters, is there a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the vector indexing resources? These can be passed in index_update_kwargs
environment_variables=environment_variables, | ||
load_embedding=False, | ||
load_metadata_in_memory=False, | ||
memory_budget=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this set to 1? Please add inline code comments. There should be a decent amount of comments explaining the purpose of values set such as this one. The goal is for others to be able to read the code + comments and understand the code and be able to work on it. If request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment for this, this avoids loading vector data in memory since we don't want to perform queries.
mode=dag.Mode.BATCH, | ||
) | ||
if worker_resources is None: | ||
driver_resources = {"cpu": "2", "memory": "8Gi"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean worker or driver here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@Shelnutt2 what is the P0's definition of done for this PR? |
@JohnMoutafis We need to add unit tests and prove things work as expected in the context of the change. Integration style (notebook) tests is good but they will have to integrate in later steps as well. |
- Also the majority of the case, tests vector-search functionality.
b67b041
to
76ae535
Compare
This adds one-click ingestion function for ingesting files to vector search.
Tested in cloud: