[中文版]
Factual accuracy remains a prominent challenge for ChatGPT and its replicas, such as the Llama family. It's widely accepted that enhancing both the volume and quality of factual information during the fine-tuning process consistently results in improved factual correctness. However, relying on human experts to create multi-turn question and answer content can be costly and time-consuming.
To address this issue, we introduce a cost-effective method called RedGPT, which generates a vast amount of high-quality multi-turn Q&A content. We have developed a dataset called RedGPT-Dataset to showcase our approach. Our initial release, RedGPT-Dataset-V1-CN, offers 50,000 Chinese multi-turn dialogues.
Our objective is to automatically generate a vast amount of high-quality multi-turn question-answer (QA) to improve the factual correctness of GPT. We employ the following methods.
-
Compile a collection of high-quality factual documents which we call references. These may include e-books, Wikipedia articles, and niche website pages. The documents should encompass a wide range of topics, such as people, organizations, technology, medicine, law, humanities, economics, home, cars, travel, cuisine, fashion, sports, education, pets, and more.
-
Utilize existing Large Language Models (LLMs), e.g., paid APIs, to produce multi-turn QAs. The reference material serves as input, and the prompt is akin to "Please generate a multi-turn dialogue based on this article." The API then generates a section of multi-turn dialogue. This process transforms the reference material used for pre-training into multi-turn dialogues suitable for fine-tuning. Postprocessing may be required.
-
After amassing a considerable number of reference-dialogue pairs from step 2, we input the reference-prompt pairs and use the dialogues as targets for fine-tuning a GPT model. This model may be based on pre-training platforms like LLaMA or BLOOM. We refer to the resulting fine-tuned model as a Reference-Enlightened-Dialogue GPT (RedGPT). Then we can use RedGPT to generate massive amounts of dialogues based on references.
When reproducing our method, it is essential to consider the following points for optimal results:
-
Prioritize the quality and scope of the references. To guarantee high quality, ensure that the references originate from reliable sources like Wikipedia, authoritative niche websites, or e-books. In addition, diligently process the content by removing elements such as HTML symbols and special characters.
-
When leveraging existing LLMs, meticulously craft prompts aiming to generate multi-turn dialogues that meet expectations. This process may require trial and error to determine the most effective prompt structure.
Following the completion of steps 1 and 2, we successfully obtained a vast collection of multi-turn dialogues. We are releasing a part of this dataset, which we have named "RedGPT-Dataset-V1-CN". This dataset consists of 50,000 Chinese reference-dialogue pairs, with each dialogue generated by LLMs, drawing from the respective reference.
-
Reference: The Merck Manual of Diagnosis and Therapy, referred to as The Merck Manual, is the world's best-selling medical textbook, and the oldest continuously published English language medical textbook. First published in 1899, the current print edition of the book, the 20th Edition, was published in 2018. In 2014, Merck decided to move The Merck Manual to digital-only, online publication, available in both professional and consumer versions; this decision was reversed in 2017, with the publication of the 20th edition the following year. The Merck Manual of Diagnosis and Therapy is one of several medical textbooks, collectively known as The Merck Manuals, which are published by Merck Publishing, a subsidiary of the pharmaceutical company Merck Co., Inc. in the United States and Canada, and MSD (as The MSD Manuals) in other countries in the world. Merck also formerly published The Merck Index, An Encyclopedia of Chemicals, Drugs, and Biologicals.
-
Dialogue:
- Human: What is The Merck Manual of Diagnosis and Therapy?
- Assistant: The Merck Manual is the world's best-selling medical textbook and the oldest continuously published English language medical textbook.
- Human: When was it first published?
- Assistant: It was first published in 1899.
- Human: Is it still being published today?
- Assistant: Yes, the current print edition of the book, the 20th Edition, was published in 2018.
- Human: Did Merck Publishing stop printing the book at any point?
- Assistant: In 2014, Merck decided to move The Merck Manual to digital-only, online publication, available in both professional and consumer versions. However, this decision was reversed in 2017, with the publication of the 20th edition the following year.
- Human: Who publishes The Merck Manuals?
- Assistant: It is published by Merck Publishing, a subsidiary of the pharmaceutical company Merck Co., Inc. in the United States and Canada, and MSD (as The MSD Manuals) in other countries in the world.
- Human: Did Merck also publish any other medical textbooks?
- Assistant: Yes, Merck also formerly published The Merck Index, An Encyclopedia of Chemicals, Drugs, and Biologicals.
-
Reference: 3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension.
-
Dialogue:
- Human: Can you explain what Convolutional Neural Networks (ConvNet) are?
- Assistant: Convolutional Neural Networks are a type of neural network that are commonly used for image classification. They have neurons arranged in three dimensions: width, height, and depth.
- Human: How are the layers of a ConvNet different from a regular Neural Network?
- Assistant: In a ConvNet, the layers have neurons arranged in three dimensions: width, height, and depth. This is in contrast to a regular Neural Network, where the neurons are only arranged in two dimensions: input and output.
- Human: What is the input volume in CIFAR-10?
- Assistant: The input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively).
- Human: Can you explain how the neurons in a ConvNet layer are connected?
- Assistant: The neurons in a ConvNet layer are only connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. This is because neurons that are close together in an image are more likely to be related to each other than neurons that are farther apart.
-
Reference: An individual retirement account (IRA) is an account used to save for retirement. A Roth IRA is a special type of tax-advantaged individual retirement account to which you can contribute after-tax dollars. The primary benefit of a Roth IRA is that your contributions and the earnings on those contributions can grow tax-free and be withdrawn tax-free after the age 59½ assuming the account has been open for at least five years. In other words, you pay taxes on money going into your Roth IRA, and then all future withdrawals are tax-free. 1 Roth IRAs are similar to traditional IRAs, with the biggest distinction being how the two are taxed. Roth IRAs are funded with after-tax dollars—this means that the contributions are not tax-deductible, but once you start withdrawing funds, the money is tax-free.
-
Dialogue:
- Human: Hi, can you explain to me what a Roth IRA is?
- Assistant: Sure! A Roth IRA is a type of individual retirement account that allows you to contribute after-tax dollars. The primary benefit of a Roth IRA is that your contributions and earnings on those contributions can grow tax-free and be withdrawn tax-free after age 59 1/2, assuming the account has been open for at least five years.
- Human: How is a Roth IRA different from a traditional IRA?
- Assistant: The biggest distinction between the two is how they are taxed. Traditional IRAs are funded with pre-tax dollars, which means the contributions are tax-deductible, but you'll have to pay taxes on the money when you withdraw it. Roth IRAs are funded with after-tax dollars, so contributions are not tax-deductible, but all future withdrawals are tax-free.
- Human: Are there any limitations on who can open a Roth IRA?
- Assistant: Yes, there are eligibility requirements. You must have earned income, and your modified adjusted gross income must be below a certain threshold. For example, in 2021, single filers must earn less than $140,000 to contribute to a Roth IRA. The contribution limit is also $6,000 per year, or $7,000 if you are age 50 or older.
- Human: That's helpful, thanks! How long does a Roth IRA need to be open before I can withdraw money tax-free?
- Assistant: The account needs to be open for at least five years before you can withdraw money tax-free. After that, you can start withdrawing money penalty- and tax-free after age 59 1/2.
-
The data we have released, "RedGPT-Dataset-V1-CN," is in Chinese. We have plans to release English data in the near future. You can follow our method to generate English data or opt to translate our existing dataset into English.
-
Please note, "RedGPT-Dataset-V1-CN" has not undergone manual verification, and as such, its accuracy and security cannot be strictly guaranteed. Users should be aware that they are responsible for the results generated using this data.
We used certain existing LLMs to generate a large number of multi-turn QAs (of which 50,000 are made public). We fine-tuned GPT on all reference-dialogue pairs to create a model called RedGPT, which is a single-task model specialized for generating dialogues based on references.
-
Q1: Which websites, e-books, and LLMs do you use? For example, do you use GPT-3.5, GPT-4, or something else?
-
A1: Due to potential legal risks, we kindly refrain from answering these questions.
-
Q2: The RedGPT-Dataset-V1-CN contains 50,000 samples. Is this the entirety of your data collection, or will you release additional data?
-
A2: Our overall data collection for training RedGPT exceeds 50,000 samples. The released dataset is only a subset. We might make more data available in the future.
-
Q3: Do you plan to release the QA data generated by RedGPT?
-
A3: We will not release the data generated by RedGPT, citing legal risks. However, we may consider releasing RedGPT's checkpoints in the future.
-
Q4: The RedGPT-Dataset-V1-CN is a Chinese dataset. Are there plans to release an English dataset?
-
A4: Yes, we intend to release RedGPT-Dataset-V1-EN, which will be an English dataset containing 50,000 samples.
- Dongjie Yang, [email protected]
- Ruifeng Yuan, [email protected]
- Yuantao Fan, [email protected]
- Yifei Yang, [email protected]
- Zili Wang, [email protected]
- Shusen Wang, [email protected]
Should your work take inspiration from or make use of our method or data, we kindly request that you acknowledge and cite our GitHub repository as a reference.
@misc{redgpt,
author = {Dongjie Yang and Ruifeng Yuan and Yuantao Fan and and Yifei Yang and Zili Wang and Shusen Wang},
title = {RedGPT: Reference-Enlightened-Dialogue by GPT and for GPT},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ziliwangnlp/RedGPT}},
}