Skip to content

Commit

Permalink
fix lita bugs (NVIDIA#9810) (NVIDIA#9828)
Browse files Browse the repository at this point in the history
Signed-off-by: slyne deng <[email protected]>
Co-authored-by: Slyne Deng <[email protected]>
Co-authored-by: slyne deng <[email protected]>
Signed-off-by: kchike <[email protected]>
  • Loading branch information
3 people authored and kchike committed Aug 8, 2024
1 parent 38912e4 commit 47e462f
Show file tree
Hide file tree
Showing 4 changed files with 25 additions and 19 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -395,8 +395,8 @@ def replace_media_embeddings(self, input_ids, inputs_embeds, media):
t_token_start, t_token_end = start, start + T
s_token_start, s_token_end = start + T, start + T + M
assert s_token_end == end + 1, "Token replacement error"
inputs_embeds[idx, t_token_start:t_token_end] = temporal_tokens[idx]
inputs_embeds[idx, s_token_start:s_token_end] = spatial_tokens[idx]
inputs_embeds[idx, t_token_start:t_token_end] = t_tokens[idx]
inputs_embeds[idx, s_token_start:s_token_end] = s_tokens[idx]
elif self.visual_token_format == 'im_vid_start_end': # v1.5 lita
if not self.use_media_start_end:
# replace the media start and media end embedding with
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@

event_prompts = [
"What is the action performed in this video?",
"Can you highlight the action performed in this video?" "What is the main event or action captured in this video?",
"Can you highlight the action performed in this video?",
"What is the main event or action captured in this video?",
"Could you summarize the sequence of events depicted in this video?",
]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def repl(match):
return time_to_string(value) + f"<!|t{value}t|!>"

value = re.sub(r"<([\d.]{1,20})s>", repl, value)
value = re.sub(r"\s([\d.]{1,20})s[\s|\.|,|>]", repl, value)
value = re.sub(r"\s([\d.]{1,20})s[\s\.,>]", repl, value)
value = re.sub(r"\s([\d.]{1,20}) seconds", repl, value)
value = re.sub(r"\s([\d.]{1,20}) second", repl, value)

Expand Down
35 changes: 20 additions & 15 deletions tutorials/multimodal/LITA Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,22 @@
"metadata": {},
"source": [
"### Note:\n",
"Currently, this notebook must be run in a NeMo container (> 24.04). An example command to launch the container:\n",
"Currently, this notebook can be run in a NeMo container (>= 24.07). An example command to launch the container:\n",
"\n",
"```\n",
"docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
"docker run --gpus all -it --rm -v $PWD:/ws --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
"```\n",
"For inference and finetuning, you need to increase the share memory size to avoid some OOM issue. For example,\n",
"```\n",
"docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev\n",
"docker run --gpus all -it --rm -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev\n",
"```\n",
"\n",
"By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate."
"By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate.\n",
"\n",
"If you wanna use NeMo container (>24.04 and < 24.07) (not recommended), you need to manually mount the latest nemo:\n",
"```\n",
"docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
"```"
]
},
{
Expand Down Expand Up @@ -66,7 +71,7 @@
"source": [
"### Tokenizer conversion\n",
"Here we show how to add 100 time tokens and some nemo extra tokens to a huggingface tokenizer.\n",
"For the definition of nemo extra tokens, please refer to `NeMo/nemo/collections/multimodal/data/neva/conversation.py`.\n"
"For the definition of nemo extra tokens, please refer to `/opt/NeMo/nemo/collections/multimodal/data/neva/conversation.py`.\n"
]
},
{
Expand Down Expand Up @@ -136,7 +141,7 @@
"metadata": {},
"source": [
"### Checkpoint Conversion\n",
"Since VILA and LITA shared a similar model structure as LLaVA, we'll leverage `NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py` for converting the checkpoint. Since VILA and LITA depends on LLaVA, we need to clone LLaVA first.\n"
"Since VILA and LITA shared a similar model structure as LLaVA, we'll leverage `/opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py` for converting the checkpoint. Since VILA and LITA depends on LLaVA, we need to clone LLaVA first.\n"
]
},
{
Expand Down Expand Up @@ -323,17 +328,17 @@
"pip install moviepy\n",
"\n",
"#download videos, this may take a while\n",
"python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -d True\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -d True\n",
"\n",
"#chunk videos into clips, with each clip containing 120 seconds\n",
"python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -l 12\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -l 12\n",
"\n",
"#create evaluation dataset\n",
"python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/valid/valid_steps.json -o /ws/dataset/valid/ -d True\n",
"python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/valid_steps.json -o /ws/dataset/valid/ -l 120\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/valid/valid_steps.json -o /ws/dataset/valid/ -d True\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/valid_steps.json -o /ws/dataset/valid/ -l 120\n",
"\n",
"#create QA style validation/evaluation or test dataset\n",
"python3 NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_evaluation.py --input /ws/dataset/valid/train.json --output_file=/ws/dataset/valid/rtl_eval.json"
"python3 /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_evaluation.py --input /ws/dataset/valid/train.json --output_file=/ws/dataset/valid/rtl_eval.json"
]
},
{
Expand Down Expand Up @@ -364,14 +369,14 @@
"source": [
"%%bash\n",
"# generate custom caption dataset and multiply the dataset by three times\n",
"python NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
" --input_dvc_dataset /ws/dataset/train.json \\\n",
" --video_path_prefix /ws/dataset/videos/ \\\n",
" --subtask custom_caption --data_multiplier 3 \\\n",
" --output_file /ws/dataset/vc_train.json\n",
"\n",
"# generate event loalization dataset and increase the dataset by three times\n",
"python NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
"python /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
" --input_dvc_dataset /ws/dataset/train.json \\\n",
" --video_path_prefix /ws/dataset/videos/ \\\n",
" --subtask event_localization --data_multiplier 3 \\\n",
Expand Down Expand Up @@ -598,7 +603,7 @@
"outputs": [],
"source": [
"%%bash\n",
"python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_video_rtl.py \\\n",
"python3 /opt/NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_video_rtl.py \\\n",
" --input_file=/ws/dataset/valid/split_output/nemo_infer_output_total.json \\\n",
" --output_dir=/ws/dataset/valid/split_output/ --save_mid_result"
]
Expand All @@ -607,7 +612,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You many also refer to `NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_vqa.py` to check how to use external LLM API to do the video question answering task evaluation."
"You many also refer to `/opt/NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_vqa.py` to check how to use external LLM API to do the video question answering task evaluation."
]
}
],
Expand Down

0 comments on commit 47e462f

Please sign in to comment.