Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Azure automated deployment for OPEA applications - Infosys #629

Merged
merged 16 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/code_spell_ignore.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
aks
AKS
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,21 +47,22 @@ The following steps are optional. They're only required if you want to run the w

### Use GenAI Microservices Connector (GMC) to deploy and adjust GenAIExamples

Follow [GMC README](https://github.com/opea-project/GenAIInfra/blob/main/microservices-connector/README.md)
Follow [GMC README](microservices-connector/README.md)
to install GMC into your kubernetes cluster. [GenAIExamples](https://github.com/opea-project/GenAIExamples) contains several sample GenAI example use case pipelines such as ChatQnA, DocSum, etc.
Once you have deployed GMC in your Kubernetes cluster, you can deploy any of the example pipelines by following its Readme file (e.g. [Docsum](https://github.com/opea-project/GenAIExamples/blob/main/DocSum/kubernetes/intel/README_gmc.md)).

### Use helm charts to deploy

To deploy GenAIExamples to Kubernetes using helm charts, you need [Helm](https://helm.sh/docs/intro/install/) installed on your machine.

For a detailed version, see [Deploy GenAIExample/GenAIComps using helm charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/README.md)
For a detailed version, see [Deploy GenAIExample/GenAIComps using helm charts](helm-charts/README.md)

### Use terraform to deploy on cloud service providers

You can use [Terraform](https://www.terraform.io/) to create infrastructure to run OPEA applications on various cloud service provider (CSP) environments.

- [AWS/EKS: Create managed Kubernetes cluster on AWS for OPEA](https://github.com/opea-project/GenAIInfra/blob/main/cloud-service-provider/aws/eks/terraform/README.MD)
- [AWS/EKS: Create managed Kubernetes cluster on AWS for OPEA](cloud-service-provider/aws/eks/terraform/README.MD)
- [Azure/AKS: Create managed Kubernetes cluster on Azure for OPEA](cloud-service-provider/azure/aks/terraform/README.md)

## Additional Content

Expand Down
84 changes: 84 additions & 0 deletions cloud-service-provider/azure/aks/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# OPEA applications Azure AKS deployment guide

This guide shows how to deploy OPEA applications on Azure Kubernetes Service (AKS) using Terraform.

## Prerequisites

- Access to Azure AKS
- [Terraform](https://developer.hashicorp.com/terraform/tutorials/azure-get-started/install-cli), [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/) and [Helm](https://helm.sh/docs/helm/helm_install/) installed on your local machine.
- Keep the Azure subscription handy and enter the subscription id when prompted during the terraform execution.

## Setup

The setup uses Terraform to create AKS cluster with the following properties:

- 1-node AKS cluster with 50 GB disk and `Standard_D32d_v5` SPOT (or standard based on the application variables) instance (16 vCPU and 32 GB memory)
- Cluster autoscaling up to 10 nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this autoscaling to 10 nodes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable number.

- Storage Class (SC) `azurefile-csi` and Persistent Volume Claim (PVC) `model-volume` for storing the model data

Initialize the Terraform environment.

```bash
terraform init
```

## AKS cluster

By default, 1-node cluster is created which is suitable for running the OPEA application. See `variables.tf` and `opea-<application-name>.tfvars` if you want to tune the cluster properties, e.g., number of nodes, instance types or disk size.

## Persistent Volume Claim

OPEA needs a volume where to store the model. For that we need to create Kubernetes Persistent Volume Claim (PVC). OPEA requires `ReadWriteMany` option since multiple pods needs access to the storage and they can be on different nodes. On AKS, only Azure File Service supports `ReadWriteMany`. Thus, each OPEA application below uses the file `aks-azfs-csi-pvc.yaml` to create PVC in its namespace.

## OPEA Applications

### ChatQnA

Use the commands below to create AKS cluster.
User has to input their Azure subscription id while running the following commands when prompted.

```bash
terraform plan --var-file opea-chatqna.tfvars -out opea-chatqna.plan
terraform apply "opea-chatqna.plan"
```

Once the cluster is ready, the kubeconfig file to access the new cluster is updated automatically. By default, the file is `~/.kube/config`.

Now you should have access to the cluster via the `kubectl` command.

Deploy ChatQnA Application with Helm

```bash
helm install -n chatqna --create-namespace chatqna oci://ghcr.io/opea-project/charts/chatqna --set service.type=LoadBalancer --set global.modelUsePVC=model-volume --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN}
```

Create the PVC as mentioned [above](#-persistent-volume-claim)

```bash
kubectl apply -f aks-azfs-csi-pvc.yaml -n chatqna
```

After a while, the OPEA application should be running. You can check the status via `kubectl`.

```bash
kubectl get pod -n chatqna
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to state that "ensure all pods are running or show 1/1 or something similar"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Will add the statement.


Ensure that all pods are running.
You can now start using the OPEA application.

```bash
OPEA_SERVICE=$(kubectl get svc -n chatqna chatqna -ojsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl http://${OPEA_SERVICE}:8888/v1/chatqna \
-H "Content-Type: application/json" \
-d '{"messages": "What is the revenue of Nike in 2023?"}'
```

Cleanup

Delete the cluster via the following command. User has to input their Azure subscription id while running the following commands when prompted.

```bash
helm uninstall -n chatqna chatqna
terraform destroy -var-file opea-chatqna.tfvars
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-volume
spec:
accessModes:
- ReadWriteMany
storageClassName: azurefile-csi
resources:
requests:
storage: 100Gi
113 changes: 113 additions & 0 deletions cloud-service-provider/azure/aks/terraform/azure_main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
provider "kubernetes" {
config_path = "~/.kube/config"
}

# Resource Group
resource "azurerm_resource_group" "main" {
name = "${var.cluster_name}-rg"
location = var.location
}

# Virtual Network
module "vnet" {
source = "Azure/vnet/azurerm"
resource_group_name = azurerm_resource_group.main.name
vnet_name = "${var.cluster_name}-vnet"
vnet_location = azurerm_resource_group.main.location

tags = {
environment = "dev"
}
depends_on = [azurerm_resource_group.main]
}

# AKS Cluster
resource "azurerm_kubernetes_cluster" "main" {
name = var.cluster_name
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
dns_prefix = var.cluster_name
kubernetes_version = var.cluster_version
private_cluster_public_fqdn_enabled = true

default_node_pool {
name = "default"
auto_scaling_enabled = true
node_count = var.node_count
vm_size = var.instance_types[0]
min_count = var.min_count
max_count = var.max_count
vnet_subnet_id = module.vnet.vnet_subnets[0]
os_disk_size_gb = var.os_disk_size_gb
}

identity {
type = "SystemAssigned"
}

network_profile {
network_plugin = "azure"
load_balancer_sku = "standard"
service_cidr = "10.0.4.0/24"
dns_service_ip = "10.0.4.10"
}

}

# Azure Files Storage Account
resource "azurerm_storage_account" "main" {
name = replace(lower("${var.cluster_name}st"), "-", "")
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Premium"
account_replication_type = "LRS"
account_kind = "FileStorage"
}

# Azure Files Share
resource "azurerm_storage_share" "main" {
name = "aksshare"
storage_account_id = azurerm_storage_account.main.id
quota = 100
}

# Key Vault
resource "azurerm_key_vault" "main" {
name = "${var.cluster_name}-kv"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
soft_delete_retention_days = 7
purge_protection_enabled = false

access_policy {
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azurerm_client_config.current.object_id

key_permissions = [
"Create",
"Delete",
"Get",
"List",
]

secret_permissions = [
"Set",
"Get",
"Delete",
"List",
]
}
}

# Update kubeconfig
resource "null_resource" "kubectl" {
provisioner "local-exec" {
command = "az aks get-credentials --resource-group ${azurerm_resource_group.main.name} --name ${azurerm_kubernetes_cluster.main.name} --overwrite-existing"
}
depends_on = [azurerm_kubernetes_cluster.main]
}

# Data source for Azure subscription information
data "azurerm_client_config" "current" {}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cluster_name = "opea"
instance_types = ["Standard_D32d_v5"]
node_pool_type = "Spot" # cheaper
os_disk_size_gb = 50
location = "eastus"
kubernetes_version = "1.30"
21 changes: 21 additions & 0 deletions cloud-service-provider/azure/aks/terraform/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
output "cluster_endpoint" {
description = "Endpoint for AKS control plane"
#sensitive = false
sensitive = true
value = azurerm_kubernetes_cluster.main.kube_config.0.host
}

output "oidc_issuer_url" {
description = "The URL for the OpenID Connect issuer"
value = azurerm_kubernetes_cluster.main.oidc_issuer_url
}

output "location" {
description = "Azure region"
value = var.location
}

output "cluster_name" {
description = "Kubernetes Cluster Name"
value = azurerm_kubernetes_cluster.main.name
}
18 changes: 18 additions & 0 deletions cloud-service-provider/azure/aks/terraform/terraform.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "2.33.0"
}
}
}

# Azure provider configuration
provider "azurerm" {
features {}
subscription_id = var.subscription_id
}
83 changes: 83 additions & 0 deletions cloud-service-provider/azure/aks/terraform/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
variable "location" {
description = "Azure region"
type = string
default = "eastus"
}

variable "cluster_name" {
description = "AKS cluster name"
type = string
default = "opea aks cluster"
}

variable "kubernetes_version" {
description = "AKS cluster version"
type = string
default = "1.30"
}

variable "use_custom_node_config" {
description = "Enable custom node configuration"
type = bool
default = true
}

variable "subscription_id" {
description = "This is the Azure subscription id of the user"
type = string
}

variable "os_disk_size_gb" {
description = "OS disk size in GB for nodes"
type = number
default = 50
}

variable "node_pool_type" {
description = "VM spot or on-demand instance types"
type = string
default = "Regular" # Regular for on-demand, Spot for spot instances
}

variable "min_count" {
description = "Minimum number of nodes"
type = number
default = 1
}

variable "max_count" {
description = "Maximum number of nodes"
type = number
default = 10
}

variable "node_count" {
description = "Desired number of nodes"
type = number
default = 1
}

variable "resource_group_name" {
description = "Name of the resource group"
type = string
default = null
}

variable "vnet_subnet_id" {
description = "ID of the subnet where the cluster will be deployed"
type = string
default = null
}


variable "cluster_version" {
description = "Kubernetes version for the cluster"
type = string
default = "1.30"
}

variable "instance_types" {
description = "Azure VM instance type"
type = list(string)
default = ["Standard_D32d_v5"]
}
Loading