Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't hear the audio #112

Open
sjghh opened this issue Oct 25, 2024 · 14 comments
Open

Can't hear the audio #112

sjghh opened this issue Oct 25, 2024 · 14 comments

Comments

@sjghh
Copy link

sjghh commented Oct 25, 2024

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
disable_torch_init()

# Video Inference
modal = 'video'
modal_path = '/data/video-llama2-av/VideoLLaMA2-audio_visual/assets/00001.mp4' 
instruct = 'What exactly did the person in the video say?'


model_path = '/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

print(output)

if name == "main":
inference()

The output is: The person in the video spoke a few words, but they were not audible.
I input a video with sound, but it seems the model didn't pick it up. Is it because the audio branch isn't functioning properly? Also, I changed "mm_audio_tower" in VideoLLaMA2.1-7B-AV/config.json to the provided BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt. Is this the correct place to make the change? Thanks for your reply!

@xinyifei99
Copy link
Collaborator

xinyifei99 commented Oct 25, 2024

Thanks for your attention! Currently, our audio branch mainly focuses on understanding audio events, and has not yet included speech recognition functions, so the model cannot identify the specific content of the speaker. Besides, you should switch to the audio_visual branch (https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual) and clone the repository to run inference for audio_visual related tasks.

@sjghh
Copy link
Author

sjghh commented Oct 26, 2024

Thank you for your response. I have a few more questions.

First question: I have some video data that I want to fine-tune, and in va_joint.sh, I use --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json,${DATA_DIR}/stage2_video_subset.json . How should I design this? My understanding is that stage3_video_audio.json and stage2_audio_subset_new.json use the same set of videos, while ${DATA_DIR}/stage2_video_subset.json uses the audio from the videos.

Second question: I want to further train using VideoLLaMA2.1-7B-AV. How should I modify va_joint.sh? Additionally, what should I pay attention to during this process? Is it possible to see the prompts you used in your paper?

Looking forward to your response, and thank you again!

@xinyifei99
Copy link
Collaborator

xinyifei99 commented Oct 26, 2024

For the first question, stage3_video_audio.json represents the newly added audio-video data in the joint training stage, stage2_video_subset.json represents the video subset used in the two-stage training of video, and stage2_audio_subset_new.json represents the audio subset used in the two-stage training of audio.
For the second question, for the stage3_video_audio.json and stage2_video_subset.json files storing video data, the data formats are mainly the following two categories:
image
For the stage2_audio_subset_new.json file that stores audio data, the data format is as follows:
image

@xinyifei99 xinyifei99 reopened this Oct 26, 2024
@sjghh
Copy link
Author

sjghh commented Oct 27, 2024

Thank you again for your response. Can I use only stage3_video_audio.json for the fine-tuning of the model? If so, should I simply provide the .json file for joint training in line 45 of va_joint.sh like this: --data_path ${DATA_DIR}/stage3_video_audio.json? Additionally, I would like to train on VideoLLaMA2.1-7B-AV. Should I change line 43 from --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-16F to VideoLLaMA2.1-7B-AV?

Thank you for taking the time to answer my question amidst your busy schedule!

@xinyifei99
Copy link
Collaborator

You can fine-tune the model using only stage3_video_audio.json like this --data_path ${DATA_DIR}/stage3_video_audio.json; you can also use --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-AV to continue training VideoLLaMA2.1-7B-AV.

@sjghh
Copy link
Author

sjghh commented Oct 27, 2024

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error:
Traceback (most recent call last):
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in
train()
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train
model.get_model().initialize_audio_modules(
File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules
self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim
UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment
Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows:
#!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8}
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
WORLD_SIZE=$ARG_WORLD_SIZE
NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
MASTER_ADDR=$ARG_MASTER_ADDR
MASTER_PORT=$ARG_MASTER_PORT
RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1
export WANDB_PROJECT=audio_visual_stage3_qwen2
RUN_NAME=audio_visual_stage3_qwen2
DATA_DIR=/data/VideoLLaMA2-audio_visual/datasets
OUTP_DIR=work_dirs
torchrun --nnodes $WORLD_SIZE
--nproc_per_node $NPROC_PER_NODE
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
--node_rank $RANK
videollama2/train.py
--deepspeed scripts/zero2.json
--model_type videollama2_qwen2
--model_path /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV
--data_folder ${DATA_DIR}
--data_path ${DATA_DIR}/custom.json
--vision_tower /data/video-llama2-av/av-weight/siglip-so400m-patch14-384
--audio_tower /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/audio_tower.bin
--pretrain_mm_mlp_adapter_a /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/mm_projector_a.bin
--mm_projector_type stc_connector_v35
--mm_projector_a_type mlp2x_gelu
--va True
--tune_audio_tower True
--tune_adapter_llm True
--tune_mm_mlp_adapter_a True
--mm_vision_select_layer -2
--image_aspect_ratio pad
--num_frames 16
--bf16 True
--tf32 True
--fp16 False
--output_dir $OUTP_DIR/${WANDB_PROJECT}/VideoLLaMA2.1-7B-AV
--num_train_epochs 2
--per_device_train_batch_size $LOCAL_BATCH_SIZE
--per_device_eval_batch_size 4
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--report_to tensorboard
--run_name $RUN_NAME \
I have made the following changes to VideoLLaMA2.1-7B-AV/config.json:
"mm_audio_tower": "/data/video-llama2-av/av-weight/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt",
"mm_vision_tower": "/data/video-llama2-av/av-weight/siglip-so400m-patch14-384",
"_name_or_path": "/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-16F".
Thank you again for your help!

@Zzitang
Copy link

Zzitang commented Oct 27, 2024

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error: Traceback (most recent call last): File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in train() File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train model.get_model().initialize_audio_modules( File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows: #!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1} ARG_NPROC_PER_NODE=${2:-8} ARG_MASTER_ADDR="127.0.0.1" ARG_MASTER_PORT=16666 ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then WORLD_SIZE=$ARG_WORLD_SIZE NPROC_PER_NODE=$ARG_NPROC_PER_NODE fi if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then MASTER_ADDR=$ARG_MASTER_ADDR MASTER_PORT=$ARG_MASTER_PORT RANK=$ARG_RANK fi

echo "WORLD_SIZE: $WORLD_SIZE" echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128 LOCAL_BATCH_SIZE=4 GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1 export WANDB_PROJECT=audio_visual_stage3_qwen2 RUN_NAME=audio_visual_stage3_qwen2 DATA_DIR=/data/VideoLLaMA2-audio_visual/datasets OUTP_DIR=work_dirs torchrun --nnodes $WORLD_SIZE --nproc_per_node $NPROC_PER_NODE --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank $RANK videollama2/train.py --deepspeed scripts/zero2.json --model_type videollama2_qwen2 --model_path /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV --data_folder ${DATA_DIR} --data_path ${DATA_DIR}/custom.json --vision_tower /data/video-llama2-av/av-weight/siglip-so400m-patch14-384 --audio_tower /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/audio_tower.bin --pretrain_mm_mlp_adapter_a /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/mm_projector_a.bin --mm_projector_type stc_connector_v35 --mm_projector_a_type mlp2x_gelu --va True --tune_audio_tower True --tune_adapter_llm True --tune_mm_mlp_adapter_a True --mm_vision_select_layer -2 --image_aspect_ratio pad --num_frames 16 --bf16 True --tf32 True --fp16 False --output_dir O U T P D I R / {WANDB_PROJECT}/VideoLLaMA2.1-7B-AV --num_train_epochs 2 --per_device_train_batch_size $LOCAL_BATCH_SIZE --per_device_eval_batch_size 4 --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to tensorboard --run_name $RUN_NAME \ I have made the following changes to VideoLLaMA2.1-7B-AV/config.json: "mm_audio_tower": "/data/video-llama2-av/av-weight/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt", "mm_vision_tower": "/data/video-llama2-av/av-weight/siglip-so400m-patch14-384", "_name_or_path": "/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-16F". Thank you again for your help!

I solved this by adding self. before all audio_tower_cfg in videollama2/model/videollama2_arch.py

@sjghh
Copy link
Author

sjghh commented Oct 27, 2024

Thank you for your response. I would like to ask what size GPU you used to get it running. I used 8 A100-40G GPUs, but I keep getting the following error:

[2024-10-27 17:04:36,988] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 6 (pid: 2829472) of binary: /opt/conda/envs/Videollama2/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/Videollama2/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:"

In addition, I made the following adjustments:

GLOBAL_BATCH_SIZE=32
LOCAL_BATCH_SIZE=1
--num_frames 8
--bf16 False
--tf32 True
--fp16 True \

But it still cannot train properly.

@ffcarina
Copy link

I encountered the same issue. Does further fine-tuning of the VideoLLaMA2.1-7B-AV model require a larger GPU? I modified the va_joint.sh script to fine-tune the AV model, but kept getting OOM errors. However, I was able to fine-tune the VideoLLaMA2-7B model on the same GPU before.
Could you kindly provide an official script for further fine-tuning the VideoLLaMA2.1-7B-AV model?
Thank you very much. Looking forward to your response.

@Huskyii24
Copy link

I encountered the same issue. Does further fine-tuning of the VideoLLaMA2.1-7B-AV model require a larger GPU? I modified the va_joint.sh script to fine-tune the AV model, but kept getting OOM errors. However, I was able to fine-tune the VideoLLaMA2-7B model on the same GPU before. Could you kindly provide an official script for further fine-tuning the VideoLLaMA2.1-7B-AV model? Thank you very much. Looking forward to your response.

Has your problem been solved? I'm also having issues with OOM

@ffcarina
Copy link

Has your problem been solved? I'm also having issues with OOM

No... Without an official response and unsure how to solve it, I have temporarily put it aside.

@Huskyii24
Copy link

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error:
Traceback (most recent call last):
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in
train()
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train
model.get_model().initialize_audio_modules(
File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules
self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim
UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment
Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows:
#!/bin/bash

I used 6 A100-80G and set the local batch size to 2 it worked...

@NBSHUN
Copy link

NBSHUN commented Jan 9, 2025

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error: Traceback (most recent call last): File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in train() File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train model.get_model().initialize_audio_modules( File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows: #!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1} ARG_NPROC_PER_NODE=${2:-8} ARG_MASTER_ADDR="127.0.0.1" ARG_MASTER_PORT=16666 ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then WORLD_SIZE=$ARG_WORLD_SIZE NPROC_PER_NODE=$ARG_NPROC_PER_NODE fi if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then MASTER_ADDR=$ARG_MASTER_ADDR MASTER_PORT=$ARG_MASTER_PORT RANK=$ARG_RANK fi

echo "WORLD_SIZE: $WORLD_SIZE" echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128 LOCAL_BATCH_SIZE=4 GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1 export WANDB_PROJECT=audio_visual_stage3_qwen2 RUN_NAME=audio_visual_stage3_qwen2 DATA_DIR=/data/VideoLLaMA2-audio_visual/datasets OUTP_DIR=work_dirs torchrun --nnodes $WORLD_SIZE --nproc_per_node $NPROC_PER_NODE --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank $RANK videollama2/train.py --deepspeed scripts/zero2.json --model_type videollama2_qwen2 --model_path /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV --data_folder ${DATA_DIR} --data_path ${DATA_DIR}/custom.json --vision_tower /data/video-llama2-av/av-weight/siglip-so400m-patch14-384 --audio_tower /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/audio_tower.bin --pretrain_mm_mlp_adapter_a /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/mm_projector_a.bin --mm_projector_type stc_connector_v35 --mm_projector_a_type mlp2x_gelu --va True --tune_audio_tower True --tune_adapter_llm True --tune_mm_mlp_adapter_a True --mm_vision_select_layer -2 --image_aspect_ratio pad --num_frames 16 --bf16 True --tf32 True --fp16 False --output_dir O U T P D I R / {WANDB_PROJECT}/VideoLLaMA2.1-7B-AV --num_train_epochs 2 --per_device_train_batch_size $LOCAL_BATCH_SIZE --per_device_eval_batch_size 4 --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to tensorboard --run_name $RUN_NAME \ I have made the following changes to VideoLLaMA2.1-7B-AV/config.json: "mm_audio_tower": "/data/video-llama2-av/av-weight/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt", "mm_vision_tower": "/data/video-llama2-av/av-weight/siglip-so400m-patch14-384", "_name_or_path": "/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-16F". Thank you again for your help!

How did you solve this problem?

@NBSHUN
Copy link

NBSHUN commented Jan 9, 2025

I encountered the same issue. Does further fine-tuning of the VideoLLaMA2.1-7B-AV model require a larger GPU? I modified the va_joint.sh script to fine-tune the AV model, but kept getting OOM errors. However, I was able to fine-tune the VideoLLaMA2-7B model on the same GPU before. Could you kindly provide an official script for further fine-tuning the VideoLLaMA2.1-7B-AV model? Thank you very much. Looking forward to your response.

Hello, how you modified the va_joint.sh script to fine-tune the AV model?
Thank you very much. Looking forward to your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants