Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: initial ChatQnA example and placeholders #87

Merged
merged 1 commit into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
_build
.vscode
184 changes: 144 additions & 40 deletions examples/ChatQnA/ChatQnA_Guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,89 +3,193 @@
ChatQnA Sample Guide
####################

.. note:: This guide is in its early development and is a work-in-progress with
placeholder content.

Introduction/Purpose
********************

Tom to provide.
TODO: Tom to provide.

Overview/Intro
==============

Chatbots are a widely adopted use case for leveraging the powerful chat and
reasoning capabilities of large language models (LLMs). The ChatQnA example
provides the starting point for developers to begin working in the GenAI space.
Consider it the “hello world” of GenAI applications and can be leveraged for
solutions across wide enterprise verticals, both internally and externally.

Purpose
=======

Preview
The ChatQnA example uses retrieval augmented generation (RAG) architecture,
which is quickly becoming the industry standard for chatbot development. It
combines the benefits of a knowledge base (via a vector store) and generative
models to reduce hallucinations, maintain up-to-date information, and leverage
domain-specific knowledge.

RAG bridges the knowledge gap by dynamically fetching relevant information from
external sources, ensuring that responses generated remain factual and current.
The core of this architecture are vector databases, which are instrumental in
enabling efficient and semantic retrieval of information. These databases store
data as vectors, allowing RAG to swiftly access the most pertinent documents or
data points based on semantic similarity.

Central to the RAG architecture is the use of a generative model, which is
responsible for generating responses to user queries. The generative model is
trained on a large corpus of customized and relevant text data and is capable of
generating human-like responses. Developers can easily swap out the generative
model or vector database with their own custom models or databases. This allows
developers to build chatbots that are tailored to their specific use cases and
requirements. By combining the generative model with the vector database, RAG
can provide accurate and contextually relevant responses specific to your users'
queries.

The ChatQnA example is designed to be a simple, yet powerful, demonstration of
the RAG architecture. It is a great starting point for developers looking to
build chatbots that can provide accurate and up-to-date information to users.

GMC is GenAI Microservices Connector. GMC facilitates sharing of services across
GenAI applications/pipelines, dynamic switching between models used in any stage
of a GenAI pipeline, for instance in the ChatQnA GenAI pipeline, it supports
changing the model used in the embedder, re-ranker, and/or the LLM.

So one can use Upstream Vanilla Kubernetes or RHOCP, and one can use them with
and without GMC. GMC as indicated provides additional features.

The ChatQnA provides several deployment options, including single-node
deployments on-premise or in a cloud environment using hardware such as Xeon
Scalable Processors, Gaudi servers, NVIDIA GPUs, and even on AI PCs. It also
supports Kubernetes deployments with and without the GenAI Management Console
(GMC), as well as cloud-native deployments using Red Hat OpenShift Container
Platform (RHOCP).


Preview
=======

AI catalog if applicable, or recorded demos.
To get a preview of the ChatQnA example, visit the
`AI Explore site <https://aiexplorer.intel.com/explore>`_. The **ChatQnA Solution**
provides a basic chatbot while the **ChatQnA with Augmented Context**
allows you to upload your own files in order to quickly experiment with a RAG
solution to see how a developer supplied corpus can provide relevant and up to
date responses.

Key Implementation Details
Key Implementation Details
==========================

Tech Overview
*************
Embedding:
The process of transforming user queries into numerical representations called
embeddings.
Vector Database:
The storage and retrieval of relevant data points using vector databases.
RAG Architecture:
The use of the RAG architecture to combine knowledge bases and generative
models for development of chatbots with relevant and up to date query
responses.
Large Language Models (LLMs):
The training and utilization of LLMs for generating responses.
Deployment Options:
production ready deployment options for the ChatQnA
example, including single-node deployments and Kubernetes deployments.

How It Works
============

High level graphics to summarize the application.
The ChatQnA Examples follows a basic flow of information in the chatbot system,
starting from the user input and going through the retrieve, re-ranker, and
generate components, ultimately resulting in the bot's output.

.. figure:: /GenAIExamples/ChatQnA/assets/img/chatqna_architecture.png
:alt: ChatQnA Architecture Diagram

This diagram illustrates the flow of information in the chatbot system,
starting from the user input and going through the retrieve, analyze, and
generate components, ultimately resulting in the bot's output.

The architecture follows a series of steps to process user queries and generate responses:

1. **Embedding**: The user query is first transformed into a numerical
representation called an embedding. This embedding captures the semantic
meaning of the query and allows for efficient comparison with other
embeddings.
#. **Vector Database**: The embedding is then used to search a vector database,
which stores relevant data points as vectors. The vector database enables
efficient and semantic retrieval of information based on the similarity
between the query embedding and the stored vectors.
#. **Re-ranker**: Uses a model to rank the retrieved data on their saliency.
The vector database retrieves the most relevant data
points based on the query embedding. These data points can include documents,
articles, or any other relevant information that can help generate accurate
responses.
dbkinder marked this conversation as resolved.
Show resolved Hide resolved
#. **LLM**: The retrieved data points are then passed to large language models
(LLM) for further processing. LLMs are powerful generative models that have
been trained on a large corpus of text data. They can generate human-like
responses based on the input data.
#. **Generate Response**: The LLMs generate a response based on the input data
and the user query. This response is then returned to the user as the
chatbot's answer.

Expected Output
===============

Validation Matrix and Prerequisites
***********************************

See :doc:`/GenAIExamples/supported_examples`

Architecture
************

Includes microservice level graphics.
TODO: Includes microservice level graphics.

Need to include the architecture with microservices. Like the ones Xigui/Chun made and explain in a para or 2 on the highlights of the arch including Gateway, UI, mega service, how models are deployed and how the microservices use the deployment service. The architecture can be laid out as general as possible, maybe calling out “for e.g” on variable pieces. Will also be good to include a linw or 2 on what the overall use case is. For e.g. This chatqna is setup to assist in ansewering question on OPEA. The microservices are set up with RAG and llm pipeline to query on OPEA pdf documents
TODO: Need to include the architecture with microservices. Like the ones
Xigui/Chun made and explain in a paragraph or 2 on the highlights of the arch
including Gateway, UI, mega service, how models are deployed and how the
microservices use the deployment service. The architecture can be laid out as
general as possible, maybe calling out “for e.g” on variable pieces. Will also
be good to include a line or 2 on what the overall use case is. For e.g. This
chatqna is setup to assist in answering question on OPEA. The microservices are
set up with RAG and LLM pipeline to query on OPEA PDF documents

Microservice Outline and Diagram
================================

Deployment
**********

+--------------------------------------------------------------------------------------+
| Single Node |
| |
+============================================+=========================================+
| XEON Scalable Processors |Gaudi Servers |
| | |
+--------------------------------------------+-----------------------------------------+
| NNIDIA GPUs | AI PC |
| | |
+--------------------------------------------+-----------------------------------------+

+--------------------------------------------------------------------------------------+
| Kubernetes |
| |
+============================================+=========================================+
| Xeon & Gaudi with GMC |Xeon & Gaudi without GMC |
| | |
+--------------------------------------------+-----------------------------------------+
| Using Helm Charts | |
| | |
+--------------------------------------------+-----------------------------------------+

+--------------------------------------------------------------------------------------+
|Cloud Native |
| |
+============================================+=========================================+
| Red Hat OpenShift Container Platform | |
| (RHOCP) | |
+--------------------------------------------+-----------------------------------------+

Single Node
===========

.. toctree::
:maxdepth: 1

deploy/xeon
deploy/gaudi
deploy/nvidia
deploy/AIPC

Kubernetes
==========

* Xeon & Gaudi with GMC
* Xeon & Gaudi without GMC
* Using Helm Charts

Cloud Native
============

* Red Hat OpenShift Container Platform (RHOCP)

Troubleshooting
***************

Monitoring
Monitoring
**********

Evaluate performance and accuracy

Summary and Next Steps
**********************
**********************
7 changes: 7 additions & 0 deletions examples/ChatQnA/deploy/AIPC.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _ChatQnA_deploy_aiPC:


Single Node On-Prem Deployment: AI PC
#####################################

TODO
7 changes: 7 additions & 0 deletions examples/ChatQnA/deploy/gaudi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _ChatQnA_deploy_gaudi:


Single Node On-Prem Deployment: Gaudi Servers
#############################################

TODO
7 changes: 7 additions & 0 deletions examples/ChatQnA/deploy/nvidia.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _ChatQnA_deploy_nvidia:


Single Node On-Prem Deployment: NVIDIA GPUs
###########################################

TODO
81 changes: 56 additions & 25 deletions examples/ChatQnA/deploy/xeon.rst
Original file line number Diff line number Diff line change
@@ -1,42 +1,73 @@
.. _ChatQnA_deploy_xeon:


Single Node On-Prem Deployment
##############################
Single Node On-Prem Deployment: XEON Scalable Processors
########################################################

e.g use case:
Should provide context for selecting between vLLM and TGI.

.. tabs:: Deploy with docker compose with vLLM
.. tabs::

.. tab::
.. tab:: Deploy with Docker compose with vLLM

The section must cover how the above said archi can be implemented with vllm mode, or the serving model chosen. Show an Basic E2E end case set up with 1 type of DB for e.g Redis based on what is already covered in chatqna example( others can be called out or referenced to accordingly), Show how to use one SOTA model, for llama3 and others with a sample configuration. The use outcome must demonstrate on a real use case showing both productivity and performance. For consistency, lets use the OPEA documentation for RAG use cases
Sample titles:
1. Overview
Talk a few lines of what is expected in this tutorial. Forer.g. Redis db used and llama3 model run to showcase an e2e use case using OPEA and vllm……..
#. Pre-requisites
Includes cloning the repos, pulling the necessary containers if available (UI, pipeline ect), setting the env variables like proxys, getting access to model weights, get tokens on hf, lg etc. sanity checks if needed. Etc.
#. Prepare (Building / Pulling) Docker images
a) This step will involve building/pulling ( maybe in future) relevant docker images with step-by-step process along with sanity check in the end
#) If customization is needed, we show 1 case of how to do it
TODO: The section must cover how the above said archi can be implemented
with vllm mode, or the serving model chosen. Show an Basic E2E end case
set up with 1 type of DB for e.g Redis based on what is already covered in
chatqna example( others can be called out or referenced to accordingly),
Show how to use one SOTA model, for llama3 and others with a sample
configuration. The use outcome must demonstrate on a real use case showing
both productivity and performance. For consistency, lets use the OPEA
documentation for RAG use cases

#. Use case setup
a) This section will include how to get the data and other dependencies needed, followed by all the micoservice envs ready. Use this section to also talk about how to set other models if needed, how to use other dbs etc
Sample titles:

#. Deploy chatqna use case based on the docker_compose
a) This should cover the steps involved in starting the microservices and megaservies, also explaining some key highlights of what’s covered in the docker compose. Include sanity checks as needed. Each microservice/megaservice start command along with what it does and the expected output will be good to add
1. Overview
Talk a few lines of what is expected in this tutorial. Forer.g. Redis
db used and llama3 model run to showcase an e2e use case using OPEA and
vllm.
#. Pre-requisites
Includes cloning the repos, pulling the necessary containers if
available (UI, pipeline ect), setting the env variables like proxys,
getting access to model weights, get tokens on hf, lg etc. sanity
checks if needed. Etc.
#. Prepare (Building / Pulling) Docker images
a) This step will involve building/pulling ( maybe in future) relevant docker images with step-by-step process along with sanity check in the end
#) If customization is needed, we show 1 case of how to do it

#. Interacting with ChatQnA deployment. ( or navigating chatqna workflow)

This section to cover how to use a different machine to interact and validate the microservice and walk through how to navigate each services. For e.g uploading local document for data prep and how to get answers? Customer will be interested in getting the output for a query, and a time also measure the quality of the model and the perf metrics( Health and Statistics to also be covered). Please check if these details can also be curled in the endpoints. Is uploading templates available now?. Custom template is available today
Show all the customization available and features
#. Use case setup

#. Additional Capabilities (optional)
Use case specific features to call out
This section will include how to get the data and other
dependencies needed, followed by all the micoservice envs ready. Use
this section to also talk about how to set other models if needed, how
to use other dbs etc

#. Launch the UI service
Show steps how to launch the UI and a sample screenshot of query and output
#. Deploy chatqna use case based on the docker_compose

This should cover the steps involved in starting the microservices
and megaservies, also explaining some key highlights of what’s covered
in the docker compose. Include sanity checks as needed. Each
microservice/megaservice start command along with what it does and the
expected output will be good to add

#. Interacting with ChatQnA deployment. ( or navigating chatqna workflow)

This section to cover how to use a different machine to interact and
validate the microservice and walk through how to navigate each
services. For e.g uploading local document for data prep and how to get
answers? Customer will be interested in getting the output for a query,
and a time also measure the quality of the model and the perf metrics(
Health and Statistics to also be covered). Please check if these
details can also be curled in the endpoints. Is uploading templates
available now?. Custom template is available today

Show all the customization available and features

#. Additional Capabilities (optional)
Use case specific features to call out

#. Launch the UI service
Show steps how to launch the UI and a sample screenshot of query and output


.. tab:: Deploy with docker compose with TGI
Expand Down