Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretrain 是否会支持呢 #10

Open
Wangman1 opened this issue Oct 11, 2024 · 2 comments
Open

pretrain 是否会支持呢 #10

Wangman1 opened this issue Oct 11, 2024 · 2 comments

Comments

@Wangman1
Copy link

请问大佬,qwen2-vl 的pretrain是否有计划支持呢

@zhangfaen
Copy link
Owner

Below is from Qwen2-VL tech report. We can see pre-train work of Qwen2-VL is very complicated and need huge data.
So we wouldn't cover pretrain work in this repo. I recommaned you to read their report, which is very detailed and helpful.

The model is pre-trained on a diverse dataset that includes image-text pairs, optical character recognition
(OCR) data, interleaved image-text articles, visual question answering datasets, video dialogues, and image
knowledge datasets. Our data sources primarily comprise cleaned web pages, open-source datasets, and
synthetic data. The cutoff date for our data knowledge is June 2023. This diverse data composition is
instrumental in developing a robust multimodal understanding capability.
During the initial pre-training phase, Qwen2-VL is exposed to a corpus of around 600 billion tokens. The
LLM component of Qwen2-VL is initialized using the parameters from Qwen2 (Yang et al., 2024), while
the vision encoder of Qwen2-VL is initialized with the ViT derived from DFN. However, the fixed position
embedding in the original DFN’s ViT (Fang et al., 2023) is replaced by RoPE-2D. This pre-training phase
primarily focuses on learning image-text relationships, textual content recognition within images through OCR, and image classification tasks. Such foundational training is instrumental in enabling the model to
develop a robust understanding of core visual-textual correlations and alignments.
The second pre-training phase marks a significant progression, involving an additional 800 billion tokens of
image-related data. This stage introduces a higher volume of mixed image-text content, facilitating a more
nuanced understanding of the interplay between visual and textual information. The incorporation of visual
question answering datasets refines the model’s capacity to respond to image-related queries. Moreover,
the inclusion of multitasking datasets is pivotal in developing the model’s ability to navigate diverse tasks
concurrently, a skill of paramount importance when dealing with complex, real-world datasets. Concurrently,
purely textual data continues to play a crucial role in maintaining and advancing the model’s linguistic
proficiency.

@Wangman1
Copy link
Author

好嘞,感谢您的回复

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants