HART support
Current configuration on cuda
device:
- hart-LLM
float16
- Qwen2-VL
int4
bitsandbytes quant
This configuration requires around 3.7GB VRAM when the models are loaded + additional ~3GB during inference (due to Qwen2 text embeddings) maxing out around 7.7GB VRAM
Examples
We did not cherry-pick any results, the generated images are the first ones received when prompting the model.
curl -X 'POST' \
'http://localhost:8001/models/hart' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "An astronaut riding a horse on the moon, oil painting by Van Gogh.",
"parameters": {}
}'
curl -X 'POST' \
'http://localhost:8001/models/hart' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": "A panda that has been cybernetically enhanced.",
"parameters": {}
}'
Full Changelog: v0.1.0...v0.1.1