Qauntization in glm4-9b failed #489

Orion-zhen · 2024-06-05T15:44:09Z

As I run the convert.py with command: CUDA_VISIBLE_DEVICES=1 python convert.py -i /home/orion/ai/Models/glm4-9b -o ./tmp-file -cf /home/orion/ai/Models/glm4-9b-4-exl2 -r 256, it runs into an error saying TypeError: Value for eos_token_id is not of expected type <class 'int'>.

It seems that the architecture of glm4 hasn't been supported yet.

Steps to reproduce: Just download the glm4-9b model and run the convert.py as README says.

Full console log:

 !! Warning, unknown architecture: ChatGLMModel
 !! Loading as LlamaForCausalLM
Traceback (most recent call last):
  File "/home/orion/repo/exllamav2/convert.py", line 71, in <module>
    config.prepare()
  File "/home/orion/repo/exllamav2/exllamav2/config.py", line 187, in prepare
    self.eos_token_id = read(read_config, int, "eos_token_id", None)  # 2
  File "/home/orion/repo/exllamav2/exllamav2/config.py", line 40, in read
    raise TypeError(f"Value for {key} is not of expected type {expected_type}")
TypeError: Value for eos_token_id is not of expected type <class 'int'>

The text was updated successfully, but these errors were encountered:

turboderp · 2024-06-05T23:35:05Z

Yeah, the architecture isn't supported. There's a bunch of little things that would have to be updated, like how the EOS token is a list all of a sudden, scaled attention layers and such. It's not high on the list of priorities at the moment. Not sure if the model is any good, or if it's any good without the multimodal capabilities which wouldn't be supported anyway.

Orion-zhen · 2024-06-06T03:19:46Z

As the THUDM declares, the glm4-9b could out-perform llama3-8b. So it might worth a try. BTW, would it be possible to add multimodal support to exllamav2 in the future? I see that multimodal llm (vlm) could be the next trend.

turboderp · 2024-06-14T12:32:45Z

Multimodal is possible, of course, as is GLM4 in general, along with diffusion models, TTS, you name it. I just have to prioritize. But contributions are always welcome.

Trapper4888 · 2024-10-02T19:04:47Z

It's getting old so there will probably be a better glm model soon, maybe esier to integrate into exl2, but it is supposed to be SOTA in a hallucination(less) benchmark:
https://huggingface.co/spaces/vectara/leaderboard

It would be cool for the dev of exotic architectures to contribute to their integration into inference engine, but I'm not blaming them either First I'm not contributing much myself, plus things move so fast, and its good that people do what they are best at. If there would be a very high pressure to have glm exl2, someone would eventually do it.

edit: Seems like it's getting hfized: huggingface/transformers#33823

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qauntization in glm4-9b failed #489

Qauntization in glm4-9b failed #489

Orion-zhen commented Jun 5, 2024

turboderp commented Jun 5, 2024

Orion-zhen commented Jun 6, 2024

turboderp commented Jun 14, 2024

Trapper4888 commented Oct 2, 2024 •

edited

Loading

Qauntization in glm4-9b failed #489

Qauntization in glm4-9b failed #489

Comments

Orion-zhen commented Jun 5, 2024

turboderp commented Jun 5, 2024

Orion-zhen commented Jun 6, 2024

turboderp commented Jun 14, 2024

Trapper4888 commented Oct 2, 2024 • edited Loading

Trapper4888 commented Oct 2, 2024 •

edited

Loading