-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] OpenSearch Feature Brief - Multimodal Search Support for Neural Search #473
Comments
@dylan-tong-aws do you think this could go into sepecific plugins like ml-commons? |
@opensearch-project/admin -- Can we transfer this to the neural-search repo? |
@dylan-tong-aws can you elaborate this more. I am not sure what we want here. |
Added feature of using image as part of the semantic search together with text queries under #359. Is this is meta issue for other enhancements in multimodal or we can close it @dylan-tong-aws ? |
What are you proposing?
We’re adding multimodal (text and image) search support to our Neural Search experience. This capability will enable users to add multimodal search capabilities to OpenSearch-powered applications without having to build and manage custom middleware to integrate multimodal models into OpenSearch workflows.
Text and image multimodal search enables users to search on image and text pairs like product catalog items (product image and description) based on visual and semantic similarity. This enables new search experiences which can deliver more relevant results. For instance, users can search for “white blouse” to retrieve product images—the machine learning (ML) model that powers this experience is able to associate semantics and visual characteristics. Unlike traditional methods, there isn’t the requirement to manually manage and index metadata to enable comparable search capabilities. Furthermore, users can also search by image to retrieve visually similar products. Lastly, users can search using both text and image such as finding the products most similar to a particular product catalog item based on semantic and visual similarity.
We want to enable this capability via the Neural Search experience so that OpenSearch users can infuse multimodal search capabilities—like they can for semantic search—into applications with less effort to accelerate their rate of innovation.
Which users have asked for this feature?
This feature was driven by AWS customer demand.
What problems are you trying to solve?
Text and image multimodal search will help our customers improve image search relevancy. Traditional image search is text based search in this disguise. It requires labor to create metadata to describe images, which is a process that is hard to scale due to the speed and cost of labor. Thus, traditional image search performance and freshness can be limited to economics and the ability to maintain high metadata quality.
Multimodal search involves leveraging multimodal embedding models that are trained to understand semantics and visual similarity enabling the aforementioned search experiences without having to produce and maintain image metadata. Furthermore, users can perform visual similarity search. It’s not always easy to describe an image in words. This feature can provide users with the option to match images by visual similarity empowering users to discover more relevant images when visual characteristics are hard to describe.
What is the developer experience going to be?
The developer experience will be the same as the neural search experience for semantic search except we’re adding enhancements to allow users to provide image data via the query, index and ingest processor APIs. Initially, the feature will be powered by an AI connector to Amazon Bedrock’s Multimodal API. New connectors can be added based on user demand and community contributions.
Are there any security considerations?
We’re building this feature on the existing security controls created for semantic search. We’ll support the same granular security controls.
Are there any breaking changes to the API?
No
What is the user experience going to be?
The end user experience will be the same as what we’ve provided for semantic search via the neural search experience. Multimodal search is powered by our vector search engine (k-NN), but users won’t have to run vector search queries. Instead, they can run queries via text, image (binary type) or text and image pairs.
Are there breaking changes to the User Experience?
No.
Why should it be built? Any reason not to?
Refer to the first response to the what/why question.
Any reason why we shouldn’t built this? Some developers will want full flexibility and they might choose to build their multimodal search (vector search) application on our core vector search engine (k-NN). We’ll continue to support users with this option while working on improving our framework so that we provide users with a simpler solution with minimal contraints.
What will it take to execute?
We’ll be enhancing the neural search plugin APIs and creating a new AI connector for Amazon Bedrock multimodal APIs.
Any remaining open questions?
Community feedback is welcome.
The text was updated successfully, but these errors were encountered: