jjleng · simple-easydev · Apr 9, 2024 · Apr 10, 2024 · Apr 11, 2024 · Apr 11, 2024
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 ## Paka Highlights
 
 - **Cloud-Agnostic Resource Provisioning**: paka starts by breaking down the barriers of cloud vendor lock-in, currently supporting EKS with plans to expand to more cloud services.
-- **Optimized Model Execution**: Designed for efficiency, paka runs LLM models on CPUs, with imminent support for GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
+- **Optimized Model Execution**: Designed for efficiency, paka runs LLM models on CPUs and Nvidia GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
 - **Scalable Batch Job Management**: paka excels in managing batch jobs that dynamically scale out and in, catering to varying workload demands without manual intervention.
 - **Seamless Application Deployment**: With support for running Langchain and LlamaIndex applications as functions, paka offers scalability to zero and back up, along with rolling updates to ensure no downtime.
 - **Comprehensive Monitoring and Tracing**: Embedded with built-in support for metrics collection via Prometheus and Grafana, along with tracing through Zipkin.
@@ -105,6 +105,8 @@ paka cluster down -f cluster.yaml
 
 ## Dependencies
 - docker daemon
+- pack cli (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
+- pulumi cli (https://www.pulumi.com/docs/install/)
 - aws cli and credentials for the AWS deployment
 ```bash
 # Make sure aws credentials and cli are set up. Your aws credentials should have access to the following services:

diff --git a/examples/invoice_extraction/README.md b/examples/invoice_extraction/README.md
@@ -0,0 +1,67 @@
+## Invoice Extraction
+This code provides an example of how to build a RESTful API that converts an invoice PDF into a structured data format (JSON). It extracts text from the PDF and then uses the langchain and llama2-7B to extract structured data from the text.
+
+## Running the Example
+
+Follow the steps below to run the example:
+
+1. **Install the necessary dependencies:**
+  ```bash
+  pip install paka
+
+  # Ensure AWS credentials and CLI are set up. Your AWS credentials should have access to the following services:
+  # - S3
+  # - ECR
+  # - EKS
+  # - EC2
+  aws configure
+
+  # Install pack CLI and verify it is working (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
+  pack --version
+
+  # Install pulumi CLI and verify it is working (https://www.pulumi.com/docs/install/)
+  pulumi version
+  ```
+
+2. **Ensure the Docker daemon is running:**
+  ```bash
+  docker info
+  ```
+
+3. **Provision the cluster:**
+  ```bash
+  cd examples/invoice_extraction
+
+  # Provision the cluster and update ~/.kube/config
+  paka cluster up -f cluster.yaml -u
+
+  # Provision a cluster with Nvidia GPUs
+  paka cluster up -f gpu_cluster.yaml -u
+  ```
+
+4. **Deploy the App:**
+  ```bash
+  # The command below will build the source and deploy it as a serverless function.
+  paka function deploy --name invoice-extraction --source . --entrypoint serve
+  ```
+
+5. **Check the status of the functions:**
+  ```bash
+  paka function list
+  ```
+
+  If everything is successful, you should see the function in the list with a status of "READY". By default, the function is exposed through a publicly accessible REST API endpoint.
+
+6. **Test the App:**
+
+  Submit the PDF invoices by hitting the `/extract_invoice` endpoint of the deployed function.
+
+  ```bash
+  curl -X POST -H "Content-Type: multipart/form-data" -F "file=@/path/to/invoices/invoice-2024-02-29.pdf" http://invoice-extraction.default.xxxx.sslip.io/extract_invoice
+  ```
+
+  If the invoice extraction is successful, you should see the structured data in the response, e.g.
+
+  ```json
+  {"number":"#25927345","date":"2024-01-31T05:07:53","company":"Akamai Technologies, Inc.","company_address":"249 Arch St. Philadelphia, PA 19106 USA","tax_id":"United States EIN: 04-3432319","customer":"John Doe","customer_address":"1 Hacker Way Menlo Park, CA  94025","amount":"$5.00"}
+  ```
diff --git a/examples/invoice_extraction/gpu_cluster.yaml b/examples/invoice_extraction/gpu_cluster.yaml
@@ -0,0 +1,30 @@
+aws:
+  cluster:
+    name: invoice-extraction
+    region: us-west-2
+    namespace: default
+    nodeType: t2.medium
+    minNodes: 2
+    maxNodes: 4
+  prometheus:
+    enabled: true
+  tracing:
+    enabled: false
+  modelGroups:
+    - nodeType: g4dn.xlarge
+      minInstances: 1
+      maxInstances: 1
+      name: llama2-7b
+      resourceRequest:
+        cpu: 3600m
+        memory: 14Gi
+      awsGpu: # This would enable inference on CUDA devices
+        diskSize: 40
+      autoScaleTriggers:
+        - type: prometheus
+          metadata:
+            serverAddress: http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local:9090
+            metricName: max_qps
+            threshold: '5'
+            query: |
+              max(rate(istio_requests_total{destination_service_name="llama2-7b", destination_app="model-group", response_code="200"}[1m]))
diff --git a/examples/invoice_extraction/invoices/invoice-2024-01-01.pdf b/examples/invoice_extraction/invoices/invoice-2024-01-01.pdf
diff --git a/examples/invoice_extraction/invoices/invoice-2024-01-31.pdf b/examples/invoice_extraction/invoices/invoice-2024-01-31.pdf
diff --git a/examples/invoice_extraction/invoices/invoice-2024-02-29.pdf b/examples/invoice_extraction/invoices/invoice-2024-02-29.pdf
diff --git a/examples/invoice_extraction/invoices/invoice-2024-03-31.pdf b/examples/invoice_extraction/invoices/invoice-2024-03-31.pdf
diff --git a/examples/invoice_extraction/serve.py b/examples/invoice_extraction/serve.py
@@ -49,8 +49,11 @@ def extract(pdf_path: str) -> str:
      Only returns the extracted JSON object, don't say anything else.
     """
 
+    # Future paka code will be able to handle this
+    chat_template = f"[INST] <<SYS>><</SYS>>\n\n{template} [/INST]\n"
+
     prompt = PromptTemplate(
-        template=template,
+        template=chat_template,
         input_variables=["invoice_text"],
         partial_variables={
             "format_instructions": invoice_parser.get_format_instructions()
@@ -60,7 +63,6 @@ def extract(pdf_path: str) -> str:
     llm = LlamaCpp(
         model_url=LLM_URL,
         temperature=0,
-        max_tokens=2500,
         streaming=False,
     )
 

diff --git a/examples/website_rag/README.md b/examples/website_rag/README.md
@@ -15,6 +15,12 @@ pip install paka
 # - EKS
 # - EC2
 aws configure
+
+# Install pack CLI and verify it is working (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
+pack --version
+
+# Install pulumi CLI and verify it is working (https://www.pulumi.com/docs/install/)
+pulumi version
 ```
 
 ### Make sure docker daemon is running

diff --git a/paka/__init__.py b/paka/__init__.py
@@ -1 +1,6 @@
-__version__ = "0.1.1"
+from importlib.metadata import PackageNotFoundError, version
+
+try:
+    __version__ = version(__name__)
+except PackageNotFoundError:
+    __version__ = ""
diff --git a/paka/cli/cluster.py b/paka/cli/cluster.py
@@ -4,6 +4,7 @@
 import typer
 
 from paka.cli.utils import load_cluster_manager
+from paka.k8s import remove_crd_finalizers
 from paka.k8s import update_kubeconfig as merge_update_kubeconfig
 from paka.logger import logger
 
@@ -64,6 +65,18 @@ def down(
         "all resources and data will be permanently deleted.",
         default=False,
     ):
+        # Sometime finalizers might block CRD deletion, so we need to force delete those
+        # TODO: better way to handle this
+        remove_crd_finalizers(
+            "scaledobjects.keda.sh",
+        )
+        remove_crd_finalizers(
+            "routes.serving.knative.dev",
+        )
+        remove_crd_finalizers(
+            "ingresses.networking.internal.knative.dev",
+        )
+
         cluster_manager = load_cluster_manager(cluster_config)
         cluster_manager.destroy()
 

diff --git a/paka/cluster/aws/eks.py b/paka/cluster/aws/eks.py
@@ -14,6 +14,7 @@
 from paka.cluster.keda import create_keda
 from paka.cluster.knative import create_knative_and_istio
 from paka.cluster.namespace import create_namespace
+from paka.cluster.nvidia_device_plugin import install_nvidia_device_plugin
 from paka.cluster.prometheus import create_prometheus
 from paka.cluster.qdrant import create_qdrant
 from paka.cluster.redis import create_redis
@@ -79,10 +80,6 @@ def create_node_group_for_model_group(
             node_group_name=f"{project}-{kubify_name(model_group.name)}-group",
             cluster=cluster,
             instance_types=[model_group.nodeType],
-            # Set the desired size of the node group to the minimum number of instances
-            # specified for the model group.
-            # Note: Scaling down to 0 is not supported, since cold starting time is
-            # too long for model group services.
             scaling_config=aws.eks.NodeGroupScalingConfigArgs(
                 desired_size=model_group.minInstances,
                 min_size=model_group.minInstances,
@@ -95,8 +92,6 @@ def create_node_group_for_model_group(
             },
             node_role_arn=worker_role.arn,
             subnet_ids=vpc.private_subnet_ids,
-            # Apply taints to ensure that only pods belonging to the same model group
-            # can be scheduled on this node group.
             taints=[
                 aws.eks.NodeGroupTaintArgs(
                     effect="NO_SCHEDULE", key="app", value="model-group"
@@ -105,6 +100,13 @@ def create_node_group_for_model_group(
                     effect="NO_SCHEDULE", key="model", value=model_group.name
                 ),
             ],
+            # Supported AMI types https://docs.aws.amazon.com/eks/latest/APIReference/API_Nodegroup.html#AmazonEKS-Type-Nodegroup-amiType
+            ami_type=("AL2_x86_64_GPU" if model_group.awsGpu else None),
+            disk_size=(
+                model_group.awsGpu.diskSize
+                if model_group.awsGpu
+                else model_group.diskSize
+            ),
         )
 
 
@@ -301,6 +303,9 @@ def create_eks_resources(kubeconfig_json: str) -> None:
         enable_cloudwatch(config, k8s_provider)
         create_prometheus(config, k8s_provider)
         create_zipkin(config, k8s_provider)
+        # Install the NVIDIA device plugin for GPU support
+        # Even if the cluster doesn't have GPUs, this won't cause any issues
+        install_nvidia_device_plugin(k8s_provider)
 
         # TODO: Set timeout to be the one used by knative
         update_elb_idle_timeout(kubeconfig_json, 300)

diff --git a/paka/cluster/nvidia_device_plugin.py b/paka/cluster/nvidia_device_plugin.py
@@ -0,0 +1,87 @@
+import pulumi
+import pulumi_kubernetes as k8s
+
+
+def install_nvidia_device_plugin(
+    k8s_provider: k8s.Provider, version: str = "v0.15.0-rc.2"
+) -> None:
+    """
+    Installs the NVIDIA device plugin for GPU support in the cluster.
+
+    This function deploys the NVIDIA device plugin to the cluster using a DaemonSet.
+    The device plugin allows Kubernetes to discover and manage GPU resources on the nodes.
+
+    Args:
+        k8s_provider (k8s.Provider): The Kubernetes provider to use for deploying the device plugin.
+
+    Returns:
+        None
+    """
+
+    k8s.apps.v1.DaemonSet(
+        "nvidia-device-plugin-daemonset",
+        metadata=k8s.meta.v1.ObjectMetaArgs(
+            namespace="kube-system",
+        ),
+        spec=k8s.apps.v1.DaemonSetSpecArgs(
+            selector=k8s.meta.v1.LabelSelectorArgs(
+                match_labels={
+                    "name": "nvidia-device-plugin-ds",
+                },
+            ),
+            update_strategy=k8s.apps.v1.DaemonSetUpdateStrategyArgs(
+                type="RollingUpdate",
+            ),
+            template=k8s.core.v1.PodTemplateSpecArgs(
+                metadata=k8s.meta.v1.ObjectMetaArgs(
+                    labels={
+                        "name": "nvidia-device-plugin-ds",
+                    },
+                ),
+                spec=k8s.core.v1.PodSpecArgs(
+                    tolerations=[
+                        k8s.core.v1.TolerationArgs(
+                            key="nvidia.com/gpu",
+                            operator="Exists",
+                            effect="NoSchedule",
+                        ),
+                        k8s.core.v1.TolerationArgs(operator="Exists"),
+                    ],
+                    priority_class_name="system-node-critical",
+                    containers=[
+                        k8s.core.v1.ContainerArgs(
+                            image=f"nvcr.io/nvidia/k8s-device-plugin:{version}",
+                            name="nvidia-device-plugin-ctr",
+                            env=[
+                                k8s.core.v1.EnvVarArgs(
+                                    name="FAIL_ON_INIT_ERROR",
+                                    value="false",
+                                )
+                            ],
+                            security_context=k8s.core.v1.SecurityContextArgs(
+                                allow_privilege_escalation=False,
+                                capabilities=k8s.core.v1.CapabilitiesArgs(
+                                    drop=["ALL"],
+                                ),
+                            ),
+                            volume_mounts=[
+                                k8s.core.v1.VolumeMountArgs(
+                                    name="device-plugin",
+                                    mount_path="/var/lib/kubelet/device-plugins",
+                                )
+                            ],
+                        )
+                    ],
+                    volumes=[
+                        k8s.core.v1.VolumeArgs(
+                            name="device-plugin",
+                            host_path=k8s.core.v1.HostPathVolumeSourceArgs(
+                                path="/var/lib/kubelet/device-plugins",
+                            ),
+                        )
+                    ],
+                ),
+            ),
+        ),
+        opts=pulumi.ResourceOptions(provider=k8s_provider),
+    )