Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Pretrain preprocess #1767

Open
leo-young opened this issue Nov 14, 2024 · 0 comments
Open

[Question] Pretrain preprocess #1767

leo-young opened this issue Nov 14, 2024 · 0 comments

Comments

@leo-young
Copy link

leo-young commented Nov 14, 2024

Question

When I try to reproduce the llave v1.5 on llama3. on pretraining stage, I find the preprocess func is using the preprocess_v1, not the plain. But following the official training script in v1.5 pretrain.sh, the --version is set the plain.

I tried to debug the code, found that

    if model_args.version == "v0":
        if tokenizer.pad_token is None:
            smart_tokenizer_and_embedding_resize(
                special_tokens_dict=dict(pad_token="[PAD]"),
                tokenizer=tokenizer,
                model=model,
            )
    elif model_args.version == "v0.5":
        tokenizer.pad_token = tokenizer.unk_token
    else:
        # tokenizer.pad_token = tokenizer.unk_token
        tokenizer.pad_token = tokenizer.eos_token
        if model_args.version in conversation_lib.conv_templates:
            print("a")
            conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
        else:
            conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]

the code block is setting the default_conversation to plain, but when trainer.train() start,

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        sources = self.list_data_dict[i]
        if isinstance(i, int):
            sources = [sources]
        assert len(sources) == 1, "Don't know why it is wrapped to a list"  # FIXME
        if 'image' in sources[0]:
            image_file = self.list_data_dict[i]['image']
            image_folder = self.data_args.image_folder
            processor = self.data_args.image_processor
            image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')


            if self.data_args.image_aspect_ratio == 'pad':
                def expand2square(pil_img, background_color):
                    width, height = pil_img.size
                    if width == height:
                        return pil_img
                    elif width > height:
                        result = Image.new(pil_img.mode, (width, width), background_color)
                        result.paste(pil_img, (0, (width - height) // 2))
                        return result
                    else:
                        result = Image.new(pil_img.mode, (height, height), background_color)
                        result.paste(pil_img, ((height - width) // 2, 0))
                        return result

                image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
            else:
                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
            sources = preprocess_multimodal(
                copy.deepcopy([e["conversations"] for e in sources]),
                self.data_args)
        else:
            sources = copy.deepcopy([e["conversations"] for e in sources])
        data_dict = preprocess(
            sources,
            self.tokenizer,
            has_image=('image' in self.list_data_dict[i]))
        if isinstance(i, int):
            data_dict = dict(input_ids=data_dict["input_ids"][0],
                             labels=data_dict["labels"][0])

        # image exist in the data
        if 'image' in self.list_data_dict[i]:
            data_dict['image'] = image
        elif self.data_args.is_multimodal:
            # image does not exist in the data, but the model is multimodal
            crop_size = self.data_args.image_processor.crop_size
            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
        return data_dict

when code is running to the dataset getitem func, the conversation_lib.default_conversation is v1, so the preprocess is using the preprocess_v1.
Does someone encountered the same question?
Does the official llava is using preprocess_v1 in the pretraining stage?

Blow is my training script:

--deepspeed
.scripts/zero2.json
--model_name_or_path
models/Llama-3.2-1B-Instruct
--vision_tower
models/clip-vit-large-patch14-336
--version
plain
--data_path
./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
--image_folder
./playground/data/LLaVA-Pretrain/images
--mm_projector_type
mlp2x_gelu
--tune_mm_mlp_adapter
True
--mm_vision_select_layer
-2
--mm_use_im_start_end
False
--mm_use_im_patch_token
False
--output_dir
./checkpoints/llava-v1.5-1b-pretrain
--num_train_epochs
1
--per_device_train_batch_size
2
--per_device_eval_batch_size
4
--gradient_accumulation_steps
1
--evaluation_strategy
"no"
--save_strategy
"steps"
--save_steps
24000
--save_total_limit
1
--learning_rate
1e-3
--weight_decay
0.
--warmup_ratio
0.03
--lr_scheduler_type
"cosine"
--logging_steps
1
--model_max_length
2048
--gradient_checkpointing
True
--dataloader_num_workers
4
--lazy_preprocess
True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant