HART support

Current configuration on cuda device:

hart-LLM float16
Qwen2-VL int4 bitsandbytes quant

This configuration requires around 3.7GB VRAM when the models are loaded + additional ~3GB during inference (due to Qwen2 text embeddings) maxing out around 7.7GB VRAM

Examples

We did not cherry-pick any results, the generated images are the first ones received when prompting the model.

curl -X 'POST' \
  'http://localhost:8001/models/hart' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "An astronaut riding a horse on the moon, oil painting by Van Gogh.",
  "parameters": {}
}'

curl -X 'POST' \
  'http://localhost:8001/models/hart' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "inputs": "A panda that has been cybernetically enhanced.",
  "parameters": {}
}'