Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HuggingFaceModel #21

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
c582ac8
Draft models and tests
simple-easydev Apr 9, 2024
e761491
update huggleface model
simple-easydev Apr 10, 2024
365927a
Done first version of HuggingFaceModel
simple-easydev Apr 11, 2024
d9c19b7
Fix tiny bugs
simple-easydev Apr 11, 2024
ab22045
Fix feedbacks
simple-easydev Apr 11, 2024
4fc31b4
Fix missing feedback
simple-easydev Apr 11, 2024
d40be9d
[wip] gpu support
jjleng Mar 26, 2024
5af3e16
feat(gpu): run models on cuda GPUs
jjleng Apr 5, 2024
615cc4d
feat(gpu): make nvidia device plugin tolerate model group taints
jjleng Apr 6, 2024
8181314
feat(gpu): set n_gpu_layers to offload work to gpu for the llama.cpp …
jjleng Apr 9, 2024
91e4571
feat(gpu): larger disk for gpu nodes
jjleng Apr 9, 2024
28075b7
feat(gpu): make model group node disk size configerable
jjleng Apr 10, 2024
ac8c726
feat(gpu): be able to request a number of GPUs through config
jjleng Apr 10, 2024
a945de8
docs: update README with the GPU support message
jjleng Apr 10, 2024
62e9a62
docs: add llama2 chat template for the invoice extraction example
jjleng Apr 10, 2024
c842495
docs: README for the invoice extraction example
jjleng Apr 10, 2024
ed40b64
docs(invoice_extraction): gpu_cluster.yaml for GPU inferences
jjleng Apr 10, 2024
0aadc74
feat: remove finalizers before tearing down a cluster
jjleng Apr 10, 2024
4e2bdf7
chore: bump version
jjleng Apr 10, 2024
6f88d8a
docs: instructions for installing the pack CLI
jjleng Apr 11, 2024
c1bcd37
update the progress status logging for downloading
simple-easydev Apr 13, 2024
a0f0ad4
docs: add pulumi CLI as a dependency
jjleng Apr 13, 2024
5863ad0
Fix test case for HuggingFaceModel.upload_file_to_s3
simple-easydev Apr 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
## Paka Highlights

- **Cloud-Agnostic Resource Provisioning**: paka starts by breaking down the barriers of cloud vendor lock-in, currently supporting EKS with plans to expand to more cloud services.
- **Optimized Model Execution**: Designed for efficiency, paka runs LLM models on CPUs, with imminent support for GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
- **Optimized Model Execution**: Designed for efficiency, paka runs LLM models on CPUs and Nvidia GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
- **Scalable Batch Job Management**: paka excels in managing batch jobs that dynamically scale out and in, catering to varying workload demands without manual intervention.
- **Seamless Application Deployment**: With support for running Langchain and LlamaIndex applications as functions, paka offers scalability to zero and back up, along with rolling updates to ensure no downtime.
- **Comprehensive Monitoring and Tracing**: Embedded with built-in support for metrics collection via Prometheus and Grafana, along with tracing through Zipkin.
Expand Down Expand Up @@ -105,6 +105,8 @@ paka cluster down -f cluster.yaml

## Dependencies
- docker daemon
- pack cli (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
- pulumi cli (https://www.pulumi.com/docs/install/)
- aws cli and credentials for the AWS deployment
```bash
# Make sure aws credentials and cli are set up. Your aws credentials should have access to the following services:
Expand Down
67 changes: 67 additions & 0 deletions examples/invoice_extraction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
## Invoice Extraction
This code provides an example of how to build a RESTful API that converts an invoice PDF into a structured data format (JSON). It extracts text from the PDF and then uses the langchain and llama2-7B to extract structured data from the text.

## Running the Example

Follow the steps below to run the example:

1. **Install the necessary dependencies:**
```bash
pip install paka

# Ensure AWS credentials and CLI are set up. Your AWS credentials should have access to the following services:
# - S3
# - ECR
# - EKS
# - EC2
aws configure

# Install pack CLI and verify it is working (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
pack --version

# Install pulumi CLI and verify it is working (https://www.pulumi.com/docs/install/)
pulumi version
```

2. **Ensure the Docker daemon is running:**
```bash
docker info
```

3. **Provision the cluster:**
```bash
cd examples/invoice_extraction

# Provision the cluster and update ~/.kube/config
paka cluster up -f cluster.yaml -u

# Provision a cluster with Nvidia GPUs
paka cluster up -f gpu_cluster.yaml -u
```

4. **Deploy the App:**
```bash
# The command below will build the source and deploy it as a serverless function.
paka function deploy --name invoice-extraction --source . --entrypoint serve
```

5. **Check the status of the functions:**
```bash
paka function list
```

If everything is successful, you should see the function in the list with a status of "READY". By default, the function is exposed through a publicly accessible REST API endpoint.

6. **Test the App:**

Submit the PDF invoices by hitting the `/extract_invoice` endpoint of the deployed function.

```bash
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@/path/to/invoices/invoice-2024-02-29.pdf" http://invoice-extraction.default.xxxx.sslip.io/extract_invoice
```

If the invoice extraction is successful, you should see the structured data in the response, e.g.

```json
{"number":"#25927345","date":"2024-01-31T05:07:53","company":"Akamai Technologies, Inc.","company_address":"249 Arch St. Philadelphia, PA 19106 USA","tax_id":"United States EIN: 04-3432319","customer":"John Doe","customer_address":"1 Hacker Way Menlo Park, CA 94025","amount":"$5.00"}
```
30 changes: 30 additions & 0 deletions examples/invoice_extraction/gpu_cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
aws:
cluster:
name: invoice-extraction
region: us-west-2
namespace: default
nodeType: t2.medium
minNodes: 2
maxNodes: 4
prometheus:
enabled: true
tracing:
enabled: false
modelGroups:
- nodeType: g4dn.xlarge
minInstances: 1
maxInstances: 1
name: llama2-7b
resourceRequest:
cpu: 3600m
memory: 14Gi
awsGpu: # This would enable inference on CUDA devices
diskSize: 40
autoScaleTriggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local:9090
metricName: max_qps
threshold: '5'
query: |
max(rate(istio_requests_total{destination_service_name="llama2-7b", destination_app="model-group", response_code="200"}[1m]))
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
6 changes: 4 additions & 2 deletions examples/invoice_extraction/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,11 @@ def extract(pdf_path: str) -> str:
Only returns the extracted JSON object, don't say anything else.
"""

# Future paka code will be able to handle this
chat_template = f"[INST] <<SYS>><</SYS>>\n\n{template} [/INST]\n"

prompt = PromptTemplate(
template=template,
template=chat_template,
input_variables=["invoice_text"],
partial_variables={
"format_instructions": invoice_parser.get_format_instructions()
Expand All @@ -60,7 +63,6 @@ def extract(pdf_path: str) -> str:
llm = LlamaCpp(
model_url=LLM_URL,
temperature=0,
max_tokens=2500,
streaming=False,
)

Expand Down
6 changes: 6 additions & 0 deletions examples/website_rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ pip install paka
# - EKS
# - EC2
aws configure

# Install pack CLI and verify it is working (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
pack --version

# Install pulumi CLI and verify it is working (https://www.pulumi.com/docs/install/)
pulumi version
```

### Make sure docker daemon is running
Expand Down
7 changes: 6 additions & 1 deletion paka/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
__version__ = "0.1.1"
from importlib.metadata import PackageNotFoundError, version

try:
__version__ = version(__name__)
except PackageNotFoundError:
__version__ = ""
13 changes: 13 additions & 0 deletions paka/cli/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import typer

from paka.cli.utils import load_cluster_manager
from paka.k8s import remove_crd_finalizers
from paka.k8s import update_kubeconfig as merge_update_kubeconfig
from paka.logger import logger

Expand Down Expand Up @@ -64,6 +65,18 @@ def down(
"all resources and data will be permanently deleted.",
default=False,
):
# Sometime finalizers might block CRD deletion, so we need to force delete those
# TODO: better way to handle this
remove_crd_finalizers(
"scaledobjects.keda.sh",
)
remove_crd_finalizers(
"routes.serving.knative.dev",
)
remove_crd_finalizers(
"ingresses.networking.internal.knative.dev",
)

cluster_manager = load_cluster_manager(cluster_config)
cluster_manager.destroy()

Expand Down
17 changes: 11 additions & 6 deletions paka/cluster/aws/eks.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from paka.cluster.keda import create_keda
from paka.cluster.knative import create_knative_and_istio
from paka.cluster.namespace import create_namespace
from paka.cluster.nvidia_device_plugin import install_nvidia_device_plugin
from paka.cluster.prometheus import create_prometheus
from paka.cluster.qdrant import create_qdrant
from paka.cluster.redis import create_redis
Expand Down Expand Up @@ -79,10 +80,6 @@ def create_node_group_for_model_group(
node_group_name=f"{project}-{kubify_name(model_group.name)}-group",
cluster=cluster,
instance_types=[model_group.nodeType],
# Set the desired size of the node group to the minimum number of instances
# specified for the model group.
# Note: Scaling down to 0 is not supported, since cold starting time is
# too long for model group services.
scaling_config=aws.eks.NodeGroupScalingConfigArgs(
desired_size=model_group.minInstances,
min_size=model_group.minInstances,
Expand All @@ -95,8 +92,6 @@ def create_node_group_for_model_group(
},
node_role_arn=worker_role.arn,
subnet_ids=vpc.private_subnet_ids,
# Apply taints to ensure that only pods belonging to the same model group
# can be scheduled on this node group.
taints=[
aws.eks.NodeGroupTaintArgs(
effect="NO_SCHEDULE", key="app", value="model-group"
Expand All @@ -105,6 +100,13 @@ def create_node_group_for_model_group(
effect="NO_SCHEDULE", key="model", value=model_group.name
),
],
# Supported AMI types https://docs.aws.amazon.com/eks/latest/APIReference/API_Nodegroup.html#AmazonEKS-Type-Nodegroup-amiType
ami_type=("AL2_x86_64_GPU" if model_group.awsGpu else None),
disk_size=(
model_group.awsGpu.diskSize
if model_group.awsGpu
else model_group.diskSize
),
)


Expand Down Expand Up @@ -301,6 +303,9 @@ def create_eks_resources(kubeconfig_json: str) -> None:
enable_cloudwatch(config, k8s_provider)
create_prometheus(config, k8s_provider)
create_zipkin(config, k8s_provider)
# Install the NVIDIA device plugin for GPU support
# Even if the cluster doesn't have GPUs, this won't cause any issues
install_nvidia_device_plugin(k8s_provider)

# TODO: Set timeout to be the one used by knative
update_elb_idle_timeout(kubeconfig_json, 300)
Expand Down
87 changes: 87 additions & 0 deletions paka/cluster/nvidia_device_plugin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import pulumi
import pulumi_kubernetes as k8s


def install_nvidia_device_plugin(
k8s_provider: k8s.Provider, version: str = "v0.15.0-rc.2"
) -> None:
"""
Installs the NVIDIA device plugin for GPU support in the cluster.

This function deploys the NVIDIA device plugin to the cluster using a DaemonSet.
The device plugin allows Kubernetes to discover and manage GPU resources on the nodes.

Args:
k8s_provider (k8s.Provider): The Kubernetes provider to use for deploying the device plugin.

Returns:
None
"""

k8s.apps.v1.DaemonSet(
"nvidia-device-plugin-daemonset",
metadata=k8s.meta.v1.ObjectMetaArgs(
namespace="kube-system",
),
spec=k8s.apps.v1.DaemonSetSpecArgs(
selector=k8s.meta.v1.LabelSelectorArgs(
match_labels={
"name": "nvidia-device-plugin-ds",
},
),
update_strategy=k8s.apps.v1.DaemonSetUpdateStrategyArgs(
type="RollingUpdate",
),
template=k8s.core.v1.PodTemplateSpecArgs(
metadata=k8s.meta.v1.ObjectMetaArgs(
labels={
"name": "nvidia-device-plugin-ds",
},
),
spec=k8s.core.v1.PodSpecArgs(
tolerations=[
k8s.core.v1.TolerationArgs(
key="nvidia.com/gpu",
operator="Exists",
effect="NoSchedule",
),
k8s.core.v1.TolerationArgs(operator="Exists"),
],
priority_class_name="system-node-critical",
containers=[
k8s.core.v1.ContainerArgs(
image=f"nvcr.io/nvidia/k8s-device-plugin:{version}",
name="nvidia-device-plugin-ctr",
env=[
k8s.core.v1.EnvVarArgs(
name="FAIL_ON_INIT_ERROR",
value="false",
)
],
security_context=k8s.core.v1.SecurityContextArgs(
allow_privilege_escalation=False,
capabilities=k8s.core.v1.CapabilitiesArgs(
drop=["ALL"],
),
),
volume_mounts=[
k8s.core.v1.VolumeMountArgs(
name="device-plugin",
mount_path="/var/lib/kubelet/device-plugins",
)
],
)
],
volumes=[
k8s.core.v1.VolumeArgs(
name="device-plugin",
host_path=k8s.core.v1.HostPathVolumeSourceArgs(
path="/var/lib/kubelet/device-plugins",
),
)
],
),
),
),
opts=pulumi.ResourceOptions(provider=k8s_provider),
)
Loading