-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data engine #37
base: main
Are you sure you want to change the base?
Data engine #37
Conversation
# Conflicts: # chat.py # muffin/eval/muffin_inference_logp.py
# Conflicts: # pyproject.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! This PR supports automatically generating high-quality preference learning dataset efficiently with RLAIF-V models or other reward models and instruction models.
Still, some of the modification should be further revised before permitted to be merged.
data_engine/README.md
Outdated
Please refer to the `run_engine.sh` script. | ||
|
||
You will need to provide the path and name for both the reward model and the instruction model. Currently, we support the following models: llava-1.5-7b, RLAIF-V-7B, OmniLMM-12B, and RLAIF-V-12B. We are considering adding more models in the future. \ | ||
If the model you wish to use is not listed, you may need to implement the corresponding code yourself (for model loading, add code to `RLAIF-V/builder`; for answer sampling, refer to `RLAIF-V/llava/llava15_sample_data.py` to see how data is formatted (don't forget to pass `raw_images`) and add call it in `RLAIF-V/data_engine/answer_sampler.py`; for log probability calculation, change data formatting part in `RLAIF-V/data_engine/logps_calculator.py` and `get_multimodal_sample_logps` function in `RLAIF-V/muffin/eval/muffin_inference_logp.py`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里拆成几个不同的 subsection.
Generate Rollouts
Reward collection
Customize your reward model
Customize your instruction model
data_engine/README.md
Outdated
|
||
You can specify a `--work_dir` to store intermediate files and the final output under this directory (which will actually be a subdirectory within it). | ||
|
||
If you encounter errors during generation, you can pass the stage next to the stage that has been completed using the `--continue_from_stage` parameter (0, 1, or 2). When the value is 0, it will start from scratch. (For example, if you've completed stages 0 and 1 but encounter an error during stage 2, you can fix the issue and set `--continue_from_stage 2` to continue from that point.) You can check the `data_engine.py` file for details on what each stage does. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就分成三个脚本吧,尽量不要有含糊的信息。
pyproject.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些版本都需要改吗,会不会影响其他结果可复现性
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我印象中transformers版本为4.35时training报错,印象中是说Bfloat不支持还是啥的,升级到4.37后解决了。其他几个基本上都是因为transformers升级为避免版本依赖冲突而一起升级到。确实可能会影响可复现性,可能需要讨论下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last step, refine the readme to improve the readability.
data_engine/README.md
Outdated
|
||
Thank you for choosing RLAIF-V. Best wishes for your project! | ||
Generates rewards using the DPO framework to rank answers. Higher-ranked answers are marked as "chosen," while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use RLAIF-V self-feedback guidance with DPO-trained models.
data_engine/README.md
Outdated
``` | ||
#### Process Method | ||
|
||
Detailed in the corresponding research paper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use RLAIF-V divide-and-conquer strategy to collect AI feedback.
data_engine/README_zh.md
Outdated
|
||
#### 处理方法 | ||
|
||
具体流程详见论文。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
中文参考英文修改
data_engine/README_zh.md
Outdated
@@ -72,7 +72,7 @@ | |||
|
|||
#### 处理方法 | |||
|
|||
使用 DPO 框架生成奖励以对答案排序。高分答案标记为Chosen,低分答案标记为Rejected。 | |||
将 RLAIF-V 自反馈指导与 DPO 训练模型结合使用。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用 RLAIF-V 基于 DPO-aligned 模型构造的自反馈信号。
Add data engine to enable users to build their own DPO dataset.