skypilot-org · Michaelvll · Sep 19, 2024 · Sep 19, 2024
diff --git a/llm/qwen/README.md b/llm/qwen/README.md
@@ -67,6 +67,35 @@ curl http://$ENDPOINT/v1/chat/completions \
   }' | jq -r '.choices[0].message.content'
 ```
 
+## Running Multimodal Qwen2-VL
+
+
+1. Start serving Qwen2-VL:
+
+```console
+sky launch -c qwen2-vl qwen2-vl-7b.yaml
+```
+2. Send a multimodalrequest to the endpoint for completion:
+```bash
+ENDPOINT=$(sky status --endpoint 8000 qwen2-vl)
+
+curl http://$ENDPOINT/v1/chat/completions \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: Bearer token' \
+    --data '{
+        "model": "Qwen/Qwen2-VL-7B-Instruct",
+        "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type" : "text", "text": "Covert this logo to ASCII art"},
+                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
+            ]
+        }],
+        "max_tokens": 1024
+    }' | jq .
+```
+
 ## Scale up the service with SkyServe
 
 1. With [SkyPilot Serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:

diff --git a/llm/qwen/qwen2-vl-7b.yaml b/llm/qwen/qwen2-vl-7b.yaml
@@ -0,0 +1,36 @@
+envs:
+  MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct
+
+service:
+  # Specifying the path to the endpoint to check the readiness of the replicas.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+    initial_delay_seconds: 1200
+  # How many replicas to manage.
+  replicas: 2
+
+
+resources:
+  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+  disk_tier: best
+  ports: 8000
+
+setup: |
+  # Install later transformers version for the support of
+  # qwen2_vl support
+  pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
+  pip install vllm==0.6.1.post2
+  pip install vllm-flash-attn
+
+run: |
+  export PATH=$PATH:/sbin
+  vllm serve $MODEL_NAME \
+    --host 0.0.0.0 \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-model-len 2048 | tee ~/openai_api_server.log