Support More Pre-Converted VLM Models #5

ZackPu · 2024-12-09T14:31:26Z

Would it be possible to provide more pre-converted Vision-Language Models (VLMs) on Hugging Face ？i cant use this model on RK3588 board. https://huggingface.co/openvla/openvla-7b

c0zaut · 2024-12-10T02:41:19Z

RKLLM only supports the LLM portion of it, but you can use Happyme's MiniCPMV-2_6 implementation here: https://huggingface.co/happyme531/MiniCPM-V-2_6-rkllm

MiniCPM-V-2_6 LLM (instead of just Qwen, if you want to swap that out) is here: https://huggingface.co/c01zaut/MiniCPM-V-2_6-rk3588-1.1.2

c0zaut · 2024-12-12T01:42:57Z

With the release of 1.1.4, there is now a full VLM pipeline for this platform. I am going to be doing some testing this week to get things working.

ZackPu · 2024-12-12T03:54:18Z

With the release of 1.1.4, there is now a full VLM pipeline for this platform. I am going to be doing some testing this week to get things working.

That’s fantastic! Looking forward to your new version~

c0zaut · 2024-12-13T03:43:47Z

@ZackPu - multimodal conversion completed. I have some tweaks I need to make to the pipeline, and then I will be pushing that to my toolkit repo for converting on x86_64 to use on the RK3588. The recent update to 1.1.4 now supports Qwen2VL, in addition to MiniCPMV 2_6. The pipeline is a lot more complex, though, since the embedder + mmproj need to be utilized as separate models in RKNN (basically modified ONNX) format. It's similar to happyme531's implementation.

ZackPu · 2024-12-14T09:25:51Z

@ZackPu - multimodal conversion completed. I have some tweaks I need to make to the pipeline, and then I will be pushing that to my toolkit repo for converting on x86_64 to use on the RK3588. The recent update to 1.1.4 now supports Qwen2VL, in addition to MiniCPMV 2_6. The pipeline is a lot more complex, though, since the embedder + mmproj need to be utilized as separate models in RKNN (basically modified ONNX) format. It's similar to happyme531's implementation.

Thanks! maybe i can use your pipeline to build my deploy env? my VLM model: siglip + projector+finetuned llama2 7b

c0zaut · 2024-12-18T04:27:57Z

@ZackPu - https://github.com/c0zaut/rkllm-mm-export

This script converts Qwen2VL -> image encoder in rknn, and llm in rkllm. The siglip + mmproj is configured with fixed shapes and exported to onnx, followed by rknn. You'll definitely need to tweak it for your purpose, and it doesn't have a deployment component yet; but it will at least allow you to play around with converting a custom VLM. Since the LLM is Llama2 7b FT, you should have no issue converting.

Also, you may want to check out this new model, which has a bunch of tool calling and web search features: https://huggingface.co/Infinigence/Megrez-3B-Omni/

For that one to work properly with RKLLM, you will need to set the eos token to <|turn_end|> on line 214 of tokenizer_config.json, as well as change the eos token array to a single value, id 120005, in generation_config.json, as well as config.json. That model is also Llama-based, uses Whisper for the audio encoder, and siglip image encoder for vision. It is also only 3B for the LLM component. I just tested the LLM component and it runs at about 7 tok/s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support More Pre-Converted VLM Models #5

Support More Pre-Converted VLM Models #5

ZackPu commented Dec 9, 2024

c0zaut commented Dec 10, 2024

c0zaut commented Dec 12, 2024

ZackPu commented Dec 12, 2024

c0zaut commented Dec 13, 2024

ZackPu commented Dec 14, 2024

c0zaut commented Dec 18, 2024

Support More Pre-Converted VLM Models #5

Support More Pre-Converted VLM Models #5

Comments

ZackPu commented Dec 9, 2024

c0zaut commented Dec 10, 2024

c0zaut commented Dec 12, 2024

ZackPu commented Dec 12, 2024

c0zaut commented Dec 13, 2024

ZackPu commented Dec 14, 2024

c0zaut commented Dec 18, 2024