Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi, is there a bug in Video-LLaVA-main/videollava/model/multimodal_encoder/builder.py? #89

Open
sunwhw opened this issue Jan 27, 2024 · 8 comments

Comments

@sunwhw
Copy link

sunwhw commented Jan 27, 2024

I want to do finetune based on native llama and languagebind.
In principle, if the model is downloaded locally, it will take the first "if" (because if is_absolute_path_exists is True), but this will cause it to a misalign error.
image
image

But if I manually switch to the second branch, it says imagetower and videotower's hiddendim are different.
But I think my configuration files are all pulled from huggingface, there should be no configuration errors? So what causes such a strange phenomenon?
image

@LinB203
Copy link
Member

LinB203 commented Jan 27, 2024

What is your "image tower"? The assertion function enforces the encoder's output dimension to be 1024. It appears that 768 is the dimension for a base version of the image encoder.

@DemiLulu
Copy link

I have the same problem in local computer, but it works in https://colab.research.google.com/.
error like:
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).

@jiangtaoo2333
Copy link

save issue

@LinB203
Copy link
Member

LinB203 commented Jan 29, 2024

Hi everyone, what is your "image_tower"? is there a minimal runtime code to help me reproduce the error?

@DemiLulu
Copy link

config file:

"intermediate_size": 11008,
"max_position_embeddings": 4096,
"mm_hidden_size": 1024,
"mm_image_tower": "/home/demi/model_lib/LanguageBind_Image",
"mm_projector_type": "mlp2x_gelu",
"mm_use_x_patch_token": false,
"mm_use_x_start_end": false,
"mm_video_tower": "/home/demi/model_lib/LanguageBind_Video_merge",
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"model_type": "llava",
"num_attention_heads": 32,

image

@LinB203
Copy link
Member

LinB203 commented Jan 30, 2024

If you want to run model locally, maybe you can refer to this issue.
#57 (comment)

@sunwhw
Copy link
Author

sunwhw commented Feb 5, 2024

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}') 

In fact, if you choose running locally, and you should choose the second "if".
I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now!

@LinB203
Copy link
Member

LinB203 commented Feb 5, 2024

I sovled! I changed the code just like

def build_image_tower(image_tower_cfg, **kwargs):
    image_tower = getattr(image_tower_cfg, 'mm_image_tower', getattr(image_tower_cfg, 'image_tower', None))
    is_absolute_path_exists = os.path.exists(image_tower)
    # if is_absolute_path_exists or image_tower.startswith("openai") or image_tower.startswith("laion"):
    #     return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.startswith("openai") or image_tower.startswith("laion"):
        return CLIPVisionTower(image_tower, args=image_tower_cfg, **kwargs) 
    if image_tower.endswith('LanguageBind_Image'):
        return LanguageBindImageTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    if 'mae' in image_tower:
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        print('maemaemaemaemaemaemaemae')
        return MAEVisionTower(image_tower, args=image_tower_cfg, cache_dir='./cache_dir', **kwargs)
    raise ValueError(f'Unknown image tower: {image_tower}') 

In fact, if you choose running locally, and you should choose the second "if". I haven't changed anything else, but the "mismatch" error disappear, so it's still weird, but anyway, it works now!

Great! Congrats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants