fix lita bugs (NVIDIA#9810) (NVIDIA#9828)

Signed-off-by: slyne deng <[email protected]> Co-authored-by: Slyne Deng <[email protected]> Co-authored-by: slyne deng <[email protected]> Signed-off-by: kchike <[email protected]>
kchike · Aug 8, 2024 · 47e462f · 47e462f
1 parent 38912e4
commit 47e462f
Show file tree

Hide file tree

Showing 4 changed files with 25 additions and 19 deletions.
diff --git a/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py b/nemo/collections/multimodal/models/multimodal_llm/neva/neva_model.py
@@ -395,8 +395,8 @@ def replace_media_embeddings(self, input_ids, inputs_embeds, media):
                 t_token_start, t_token_end = start, start + T
                 s_token_start, s_token_end = start + T, start + T + M
                 assert s_token_end == end + 1, "Token replacement error"
-                inputs_embeds[idx, t_token_start:t_token_end] = temporal_tokens[idx]
-                inputs_embeds[idx, s_token_start:s_token_end] = spatial_tokens[idx]
+                inputs_embeds[idx, t_token_start:t_token_end] = t_tokens[idx]
+                inputs_embeds[idx, s_token_start:s_token_end] = s_tokens[idx]
             elif self.visual_token_format == 'im_vid_start_end':  # v1.5 lita
                 if not self.use_media_start_end:
                     # replace the media start and media end embedding with

diff --git a/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py b/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py
@@ -126,7 +126,8 @@
 
 event_prompts = [
     "What is the action performed in this video?",
-    "Can you highlight the action performed in this video?" "What is the main event or action captured in this video?",
+    "Can you highlight the action performed in this video?",
+    "What is the main event or action captured in this video?",
     "Could you summarize the sequence of events depicted in this video?",
 ]
 

diff --git a/scripts/multimodal_dataset_conversion/convert_video_qa_dataset.py b/scripts/multimodal_dataset_conversion/convert_video_qa_dataset.py
@@ -120,7 +120,7 @@ def repl(match):
         return time_to_string(value) + f"<!|t{value}t|!>"
 
     value = re.sub(r"<([\d.]{1,20})s>", repl, value)
-    value = re.sub(r"\s([\d.]{1,20})s[\s|\.|,|>]", repl, value)
+    value = re.sub(r"\s([\d.]{1,20})s[\s\.,>]", repl, value)
     value = re.sub(r"\s([\d.]{1,20}) seconds", repl, value)
     value = re.sub(r"\s([\d.]{1,20}) second", repl, value)
 

diff --git a/tutorials/multimodal/LITA Tutorial.ipynb b/tutorials/multimodal/LITA Tutorial.ipynb
@@ -12,17 +12,22 @@
    "metadata": {},
    "source": [
     "### Note:\n",
-    "Currently, this notebook must be run in a NeMo container (> 24.04). An example command to launch the container:\n",
+    "Currently, this notebook can be run in a NeMo container (>= 24.07). An example command to launch the container:\n",
     "\n",
     "```\n",
-    "docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
+    "docker run --gpus all -it --rm  -v $PWD:/ws --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
     "```\n",
     "For inference and finetuning, you need to increase the share memory size to avoid some OOM issue. For example,\n",
     "```\n",
-    "docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev\n",
+    "docker run --gpus all -it --rm  -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev\n",
     "```\n",
     "\n",
-    "By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate."
+    "By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate.\n",
+    "\n",
+    "If you wanna use NeMo container (>24.04 and < 24.07) (not recommended), you need to manually mount the latest nemo:\n",
+    "```\n",
+    "docker run --gpus all -it --rm -v <your_nemo_dir>:/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 <your_nemo_container>\n",
+    "```"
    ]
   },
   {
@@ -66,7 +71,7 @@
    "source": [
     "### Tokenizer conversion\n",
     "Here we show how to add 100 time tokens and some nemo extra tokens to a huggingface tokenizer.\n",
-    "For the definition of nemo extra tokens, please refer to `NeMo/nemo/collections/multimodal/data/neva/conversation.py`.\n"
+    "For the definition of nemo extra tokens, please refer to `/opt/NeMo/nemo/collections/multimodal/data/neva/conversation.py`.\n"
    ]
   },
   {
@@ -136,7 +141,7 @@
    "metadata": {},
    "source": [
     "### Checkpoint Conversion\n",
-    "Since VILA and LITA shared a similar model structure as LLaVA, we'll leverage `NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py` for converting the checkpoint. Since VILA and LITA depends on LLaVA, we need to clone LLaVA first.\n"
+    "Since VILA and LITA shared a similar model structure as LLaVA, we'll leverage `/opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py` for converting the checkpoint. Since VILA and LITA depends on LLaVA, we need to clone LLaVA first.\n"
    ]
   },
   {
@@ -323,17 +328,17 @@
     "pip install moviepy\n",
     "\n",
     "#download videos, this may take a while\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -d True\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -d True\n",
     "\n",
     "#chunk videos into clips, with each clip containing 120 seconds\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -l 12\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/train_steps.json -o /ws/dataset -l 12\n",
     "\n",
     "#create evaluation dataset\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/valid/valid_steps.json -o /ws/dataset/valid/ -d True\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/valid_steps.json -o /ws/dataset/valid/ -l 120\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/valid/valid_steps.json -o /ws/dataset/valid/ -d True\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/prepare_youmakeup.py -i YouMakeup/data/train/valid_steps.json -o /ws/dataset/valid/ -l 120\n",
     "\n",
     "#create QA style validation/evaluation or test dataset\n",
-    "python3 NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_evaluation.py --input /ws/dataset/valid/train.json --output_file=/ws/dataset/valid/rtl_eval.json"
+    "python3 /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_evaluation.py --input /ws/dataset/valid/train.json --output_file=/ws/dataset/valid/rtl_eval.json"
    ]
   },
   {
@@ -364,14 +369,14 @@
    "source": [
     "%%bash\n",
     "# generate custom caption dataset and multiply the dataset by three times\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
     "    --input_dvc_dataset /ws/dataset/train.json \\\n",
     "    --video_path_prefix /ws/dataset/videos/ \\\n",
     "    --subtask custom_caption --data_multiplier 3 \\\n",
     "    --output_file /ws/dataset/vc_train.json\n",
     "\n",
     "# generate event loalization dataset and increase the dataset by three times\n",
-    "python NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
+    "python /opt/NeMo/scripts/multimodal_dataset_conversion/convert_dvc_dataset_for_training.py \\\n",
     "    --input_dvc_dataset /ws/dataset/train.json \\\n",
     "    --video_path_prefix /ws/dataset/videos/ \\\n",
     "    --subtask event_localization --data_multiplier 3 \\\n",
@@ -598,7 +603,7 @@
    "outputs": [],
    "source": [
     "%%bash\n",
-    "python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_video_rtl.py \\\n",
+    "python3 /opt/NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_video_rtl.py \\\n",
     "    --input_file=/ws/dataset/valid/split_output/nemo_infer_output_total.json \\\n",
     "    --output_dir=/ws/dataset/valid/split_output/ --save_mid_result"
    ]
@@ -607,7 +612,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You many also refer to `NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_vqa.py` to check how to use external LLM API to do the video question answering task evaluation."
+    "You many also refer to `/opt/NeMo/examples/multimodal/multimodal_llm/neva/eval/eval_vqa.py` to check how to use external LLM API to do the video question answering task evaluation."
    ]
   }
  ],