Add indexing function for indexing files to vector search #532

NikolaosPapailiou · 2024-03-08T08:48:54Z

This adds one-click ingestion function for ingesting files to vector search.

Tested in cloud:

shortcut-integration · 2024-03-08T08:48:57Z

This pull request has been linked to Shortcut Story #42043: Trigger Task Graph for Indexing (Ingestion).

Shelnutt2

Several comments in first pass.

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

src/tiledb/cloud/vector_search/file_ingestion.py

NikolaosPapailiou · 2024-03-11T14:13:16Z

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

Shelnutt2 · 2024-03-19T13:04:23Z

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

When you are back we need to sync on this. I believe we had discussed this and the example POC code I provided handled all of this. The goal was to take the POC we used and re-implement the features in a production fashion. The first goal is to support file ingestion and index creation as part of one pipeline.

JohnMoutafis

We should avoid nesting functions especially when they are only used once through out the body of the method.
Also functions like index_exists that only work as a "passthrough" to only call another function should be avoided as they are unnecessary, they add overhead and they complicate the code and any potential debugging process.

src/tiledb/cloud/vector_search/file_ingestion.py

NikolaosPapailiou · 2024-04-01T06:31:07Z

@Shelnutt2 are your comments addressed by the changes? This requires your approval to continue with merging.

Shelnutt2 · 2024-04-01T10:33:01Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    embedding_class_ = getattr(embeddings_module, embedding_class)
+    embedding = embedding_class_(**embedding_kwargs)
+
+    with tiledb.scope_ctx(config):


This should be done as the first stage in the graph, not local to the caller.

I am not sure I understand this, ingest_files_udf is running within the taskgraph

I created an alternative version of the PR that uses an extra taskgraph and creates the dataset as a first node in the taskgraph #547 Is this what you are expecting the ingestion structure to look like?

@Shelnutt2 is the alternative version matching your expectation? Let me know if you still have concerns

Applied the alternative taskgraph structure in this PR.

@Shelnutt2 this PR needs your approval to move forward. Let me know if you need any more changes.

src/tiledb/cloud/vector_search/__init__.py

src/tiledb/cloud/vector_search/file_ingestion.py

Shelnutt2

Several comments that need to be fixed. This code also needs to be aligned with the goals of handling TileDB FileStore files as a primary source.

Additionally there are some pylint errors related to variables that aren't passed through. Please address all lint errors.

Shelnutt2 · 2024-04-26T04:16:29Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    driver_image: Optional[str] = None,
+    extra_driver_modules: Optional[List[str]] = None,
+    max_tasks_per_stage: int = -1,
+    embeddings_generation_mode: dag.Mode = dag.Mode.LOCAL,


Everything must default to batch mode. Running this in local is unexpected. The goals are that like all other verticals we support and default to batch ingestion capabilities.

Document indexing has multiple execution steps that can be run in different modes:

ingest_files creates a BATCH taskgraph that runs all the indexing. This means that all processing happens within a BATCH taskgraph with access_credentials even if the options here are set to LOCAL

embeddings_generation: reads the documents and creates text embeddings. This can spawn its own taskgraph.

vector_indexing: creates a vector index from the produced embeddings. This can spawn its own taskgraph.

The default configuration at the moment is:

ingest_files creates a BATCH taskgraph that runs all the indexing.

embeddings_generation, vector_indexing run in LOCAL mode within a UDF of the ingest_files taskgraph. Both of these tasks can leverage the available parallelism within the single worker.

This is expected to be a good default execution configuration for cost and latency even for sets of thousands of documents.

Do you want all the execution steps to be executed in BATCH mode by default?

The requirement again, as we discussed and as spelled out in the story is to have a robust batch mode ingestion that can scale to millions of documents. Local mode is a bad default and does not meet our intended goal, please change it and please be sure you actually test at scale. These issues are easy to see even running just our same test datasets.

Changed embeddings_generation_mode, vector_indexing modes to BATCH.
Note: this setting can introduce some latency and resource overhead for small to medium size document datasets.

There were some bugs that led to execution failure for this setup. This was addressed in TileDB-Inc/TileDB-Vector-Search#351 Also added tests for the BATCH execution in Cloud.

Shelnutt2 · 2024-04-26T04:17:43Z

src/tiledb/cloud/vector_search/file_ingestion.py

+
+
+def ingest_files(
+    file_dir_uri: str,


This does not work as expected. Passing in a TileDB file URI get ignored. Please test this and add unit tests for the relevant cases. Currently this does not cover the required use cases.

As it is implemented atm this should be the group URI and we pick up the file from the group(applying regexp patterns if provided). What are the cases you are looking for supporting here?

We have some testcases for this here https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/main/apis/python/test/test_directory_reader.py

The requirements are to support TileDB FileStore files or a group of files. This has been a hard requirement from day one and is outlined in our planning document.

Renamed file_dir_uri to search_uri to be consistent with other ingestion jobs. search_uri supports FileStore URIs. If we want to index one file using the FileStore URI we can pass it as search_uri. @Shelnutt2 is this covering your expectations of the function signature?

Shelnutt2 · 2024-04-26T04:20:33Z

src/tiledb/cloud/vector_search/file_ingestion.py

+def ingest_files(
+    file_dir_uri: str,
+    index_uri: str,
+    file_name: Optional[str] = None,


This along with include/exclude don't make sense. How is this to work with TileDB files? There is no check if the TileDB file name or any parsing of the TileDB URIs. The goal again as outlined in the requirements is to use this with TileDB files, either standalone or from a group.

Removed the file_name option. FileStore URIs work directly using the search_uri param.

Shelnutt2 · 2024-04-26T04:22:00Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    # Index update params
+    index_timestamp: Optional[int] = None,
+    workers: int = -1,
+    worker_resources: Optional[Dict] = None,


This is not plumbing all the different resource parameters, is there a reason?

You mean the vector indexing resources? These can be passed in index_update_kwargs

Shelnutt2 · 2024-04-26T04:24:28Z

src/tiledb/cloud/vector_search/file_ingestion.py

+                    environment_variables=environment_variables,
+                    load_embedding=False,
+                    load_metadata_in_memory=False,
+                    memory_budget=1,


Why is this set to 1? Please add inline code comments. There should be a decent amount of comments explaining the purpose of values set such as this one. The goal is for others to be able to read the code + comments and understand the code and be able to work on it. If request.

Added comment for this, this avoids loading vector data in memory since we don't want to perform queries.

Shelnutt2 · 2024-04-26T04:25:28Z

src/tiledb/cloud/vector_search/file_ingestion.py

+        mode=dag.Mode.BATCH,
+    )
+    if worker_resources is None:
+        driver_resources = {"cpu": "2", "memory": "8Gi"}


Did you mean worker or driver here?

JohnMoutafis · 2024-05-15T10:07:29Z

@Shelnutt2 what is the P0's definition of done for this PR?
Since File ingestion is now in place and working, we would like to push this forward and combine it eventually with ingestion.
@NikolaosPapailiou has tested this, using notebooks.

antalakas · 2024-05-16T10:35:04Z

@Shelnutt2 what is the P0's definition of done for this PR? Since File ingestion is now in place and working, we would like to push this forward and combine it eventually with ingestion. @NikolaosPapailiou has tested this, using notebooks.

@JohnMoutafis We need to add unit tests and prove things work as expected in the context of the change. Integration style (notebook) tests is good but they will have to integrate in later steps as well.

Old review

- Also the majority of the case, tests vector-search functionality.

NikolaosPapailiou requested review from JohnMoutafis, Tile-Kyle and antalakas March 8, 2024 08:49

ihnorton self-requested a review March 8, 2024 13:04

NikolaosPapailiou requested review from thetorpedodog and removed request for Tile-Kyle March 11, 2024 10:22

Shelnutt2 requested changes Mar 11, 2024

View reviewed changes

JohnMoutafis suggested changes Mar 26, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

thetorpedodog reviewed Mar 26, 2024

View reviewed changes

NikolaosPapailiou requested review from JohnMoutafis, thetorpedodog and Shelnutt2 March 27, 2024 09:57

JohnMoutafis approved these changes Mar 27, 2024

View reviewed changes

thetorpedodog reviewed Mar 29, 2024

View reviewed changes

thetorpedodog approved these changes Mar 29, 2024

View reviewed changes

Shelnutt2 reviewed Apr 1, 2024

View reviewed changes

NikolaosPapailiou requested a review from Shelnutt2 April 2, 2024 10:52

Shelnutt2 reviewed Apr 2, 2024

View reviewed changes

src/tiledb/cloud/vector_search/__init__.py Outdated Show resolved Hide resolved

Shelnutt2 reviewed Apr 3, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

Shelnutt2 reviewed Apr 3, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

NikolaosPapailiou mentioned this pull request Apr 3, 2024

Add ingestion function for ingesting files to vector search #547

Closed

NikolaosPapailiou requested a review from Shelnutt2 April 3, 2024 14:33

Shelnutt2 previously requested changes Apr 26, 2024

View reviewed changes

NikolaosPapailiou changed the title ~~Add ingestion function for ingesting files to vector search~~ Add indexing function for indexing files to vector search May 15, 2024

NikolaosPapailiou and others added 26 commits June 6, 2024 17:08

Add ingestion fuction for ingesting files to vector search

2a9d7d2

Format

3748482

Format

b8f4182

Address review comments

410c1cf

Format

6f92337

Add support for OpenAI embeddings and consolidate directory listing args

74600bd

Address review comments

e2e0383

Fix param

bce2734

Address review comments

63e5934

Removed default arg dicts

c04ebc7

Fix function name

d32220b

Fix typoand remove create_index argument

801662b

Apply code restructure suggestion

8803f13

Address review comments

04261da

Fix format

8078d75

Unit test file indexing methods.

0a8f2c4

Attempt to resolve dependecy issues with vecotr-search.

2abaa66

Try to install vector-search without dependencies.

ce8daa4

Attempt to differenciate test files to avoid "collisions".

3b25568

Fix test file/array name collisions.

bdf6bfa

Skip create&update indexing test as it is extremely slow.

ac8debb

- Also the majority of the case, tests vector-search functionality.

Set all the default dag.Modes to BATCH.

cd3966c

Add control over BATCH resources and sensible defaults.

90a14de

Fix import error.

4844f4a

Final follow-ups.

6ab33b6

Fix vector-search install issue for testing.

76ae535

JohnMoutafis force-pushed the npapa/sc-42043/vector-search-ingestion branch from b67b041 to 76ae535 Compare June 6, 2024 14:09

JohnMoutafis merged commit d6185f7 into main Jun 6, 2024
17 checks passed

JohnMoutafis deleted the npapa/sc-42043/vector-search-ingestion branch June 6, 2024 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indexing function for indexing files to vector search #532

Add indexing function for indexing files to vector search #532

NikolaosPapailiou commented Mar 8, 2024 •

edited

Loading

shortcut-integration bot commented Mar 8, 2024

Shelnutt2 left a comment

NikolaosPapailiou commented Mar 11, 2024

Shelnutt2 commented Mar 19, 2024

JohnMoutafis left a comment

NikolaosPapailiou commented Apr 1, 2024

Shelnutt2 Apr 1, 2024

NikolaosPapailiou Apr 1, 2024

NikolaosPapailiou Apr 3, 2024

NikolaosPapailiou Apr 4, 2024

NikolaosPapailiou Apr 12, 2024

NikolaosPapailiou Apr 23, 2024

Shelnutt2 left a comment

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

Shelnutt2 Apr 26, 2024 •

edited

Loading

NikolaosPapailiou Apr 29, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 29, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 29, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 29, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 29, 2024

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 29, 2024

JohnMoutafis commented May 15, 2024

antalakas commented May 16, 2024

Add indexing function for indexing files to vector search #532

Add indexing function for indexing files to vector search #532

Conversation

NikolaosPapailiou commented Mar 8, 2024 • edited Loading

shortcut-integration bot commented Mar 8, 2024

Shelnutt2 left a comment

Choose a reason for hiding this comment

NikolaosPapailiou commented Mar 11, 2024

Shelnutt2 commented Mar 19, 2024

JohnMoutafis left a comment

Choose a reason for hiding this comment

NikolaosPapailiou commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shelnutt2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shelnutt2 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohnMoutafis commented May 15, 2024

antalakas commented May 16, 2024

NikolaosPapailiou commented Mar 8, 2024 •

edited

Loading

Shelnutt2 Apr 26, 2024 •

edited

Loading