Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

sunwhw · 2024-01-27T10:32:42Z

I want to do finetune based on native llama and languagebind.
In principle, if the model is downloaded locally, it will take the first "if" (because if is_absolute_path_exists is True), but this will cause it to a misalign error.

But if I manually switch to the second branch, it says imagetower and videotower's hiddendim are different.
But I think my configuration files are all pulled from huggingface, there should be no configuration errors? So what causes such a strange phenomenon？

LinB203 · 2024-01-27T14:03:18Z

What is your "image tower"? The assertion function enforces the encoder's output dimension to be 1024. It appears that 768 is the dimension for a base version of the image encoder.

DemiLulu · 2024-01-27T15:52:43Z

I have the same problem in local computer, but it works in https://colab.research.google.com/.
error like:
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

jiangtaoo2333 · 2024-01-29T08:20:56Z

save issue

LinB203 · 2024-01-29T08:23:57Z

Hi everyone, what is your "image_tower"? is there a minimal runtime code to help me reproduce the error?

DemiLulu · 2024-01-30T12:54:58Z

config file:

"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mm_hidden_size": 1024,
"mm_image_tower": "/home/demi/model_lib/LanguageBind_Image",
"mm_projector_type": "mlp2x_gelu",
"mm_use_x_patch_token": false,
"mm_use_x_start_end": false,
"mm_video_tower": "/home/demi/model_lib/LanguageBind_Video_merge",
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"model_type": "llava",
"num_attention_heads": 32,

LinB203 · 2024-01-30T15:03:00Z

If you want to run model locally, maybe you can refer to this issue.
#57 (comment)

sunwhw · 2024-02-05T04:41:20Z

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}')

In fact, if you choose running locally, and you should choose the second "if".
I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now！

LinB203 · 2024-02-05T05:36:45Z

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}')

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now！

Great! Congrats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

sunwhw commented Jan 27, 2024

LinB203 commented Jan 27, 2024

DemiLulu commented Jan 27, 2024

jiangtaoo2333 commented Jan 29, 2024

LinB203 commented Jan 29, 2024

DemiLulu commented Jan 30, 2024

LinB203 commented Jan 30, 2024

sunwhw commented Feb 5, 2024

LinB203 commented Feb 5, 2024

Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

Hi， is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

Comments

sunwhw commented Jan 27, 2024

LinB203 commented Jan 27, 2024

DemiLulu commented Jan 27, 2024

jiangtaoo2333 commented Jan 29, 2024

LinB203 commented Jan 29, 2024

DemiLulu commented Jan 30, 2024

LinB203 commented Jan 30, 2024

sunwhw commented Feb 5, 2024

LinB203 commented Feb 5, 2024