Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data engine #37

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Data engine #37

wants to merge 22 commits into from

Conversation

zxrys
Copy link

@zxrys zxrys commented Dec 2, 2024

Add data engine to enable users to build their own DPO dataset.

@zxrys
Copy link
Author

zxrys commented Dec 2, 2024

@yiranyyu @Haoye17

Copy link
Collaborator

@yiranyyu yiranyyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This PR supports automatically generating high-quality preference learning dataset efficiently with RLAIF-V models or other reward models and instruction models.

Still, some of the modification should be further revised before permitted to be merged.

data_engine/README.md Outdated Show resolved Hide resolved
data_engine/README.md Outdated Show resolved Hide resolved
data_engine/README.md Outdated Show resolved Hide resolved
Please refer to the `run_engine.sh` script.

You will need to provide the path and name for both the reward model and the instruction model. Currently, we support the following models: llava-1.5-7b, RLAIF-V-7B, OmniLMM-12B, and RLAIF-V-12B. We are considering adding more models in the future. \
If the model you wish to use is not listed, you may need to implement the corresponding code yourself (for model loading, add code to `RLAIF-V/builder`; for answer sampling, refer to `RLAIF-V/llava/llava15_sample_data.py` to see how data is formatted (don't forget to pass `raw_images`) and add call it in `RLAIF-V/data_engine/answer_sampler.py`; for log probability calculation, change data formatting part in `RLAIF-V/data_engine/logps_calculator.py` and `get_multimodal_sample_logps` function in `RLAIF-V/muffin/eval/muffin_inference_logp.py`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里拆成几个不同的 subsection.

Generate Rollouts

Reward collection

Customize your reward model

Customize your instruction model


You can specify a `--work_dir` to store intermediate files and the final output under this directory (which will actually be a subdirectory within it).

If you encounter errors during generation, you can pass the stage next to the stage that has been completed using the `--continue_from_stage` parameter (0, 1, or 2). When the value is 0, it will start from scratch. (For example, if you've completed stages 0 and 1 but encounter an error during stage 2, you can fix the issue and set `--continue_from_stage 2` to continue from that point.) You can check the `data_engine.py` file for details on what each stage does.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就分成三个脚本吧,尽量不要有含糊的信息。

pyproject.toml Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些版本都需要改吗,会不会影响其他结果可复现性

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我印象中transformers版本为4.35时training报错,印象中是说Bfloat不支持还是啥的,升级到4.37后解决了。其他几个基本上都是因为transformers升级为避免版本依赖冲突而一起升级到。确实可能会影响可复现性,可能需要讨论下

omnilmm/train/train_utils.py Outdated Show resolved Hide resolved
omnilmm/model/omnilmm.py Outdated Show resolved Hide resolved
muffin/eval/muffin_inference_logp.py Show resolved Hide resolved
muffin/data/datasets.py Show resolved Hide resolved
Copy link
Collaborator

@yiranyyu yiranyyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last step, refine the readme to improve the readability.


Thank you for choosing RLAIF-V. Best wishes for your project!
Generates rewards using the DPO framework to rank answers. Higher-ranked answers are marked as "chosen," while
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RLAIF-V self-feedback guidance with DPO-trained models.

```
#### Process Method

Detailed in the corresponding research paper.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RLAIF-V divide-and-conquer strategy to collect AI feedback.


#### 处理方法

具体流程详见论文。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中文参考英文修改

@@ -72,7 +72,7 @@

#### 处理方法

使用 DPO 框架生成奖励以对答案排序。高分答案标记为Chosen,低分答案标记为Rejected
将 RLAIF-V 自反馈指导与 DPO 训练模型结合使用
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用 RLAIF-V 基于 DPO-aligned 模型构造的自反馈信号。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants