Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix few issues with the dataset #32

Closed
wants to merge 22 commits into from
Closed

Conversation

gururise
Copy link
Contributor

@gururise gururise commented Mar 17, 2023

Being that the training dataset was generated through GPT3, there seem to have been several issues I noticed when going through it. I have manually fixed the following issues:

  • Resolve empty outputs
  • Added a few CoT examples
  • Fixed a few empty code examples
  • Removed instructions asking to generate images
  • Resolve N/A outputs
  • Make empty inputs consistent (some used N/A, others uses None)
  • Fixed a few wrong answers.

Hoping this slightly curated dataset will help produce better training results.

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

alpaca_data.json Outdated
@@ -121612,7 +121567,7 @@
{
"instruction": "Calculate the area of the following shape in square centimeters.",
"input": "rectangle of size 4 cm x 5 cm",
"output": ""
"output": "The area the rectangle is 20 square centimeters."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: typo

alpaca_data.json Outdated
"input": "",
"output": ""
"output": "Over the last 5 years, there have been advancements in deep learning, increased used of AI in the industry, a greater emphasis on explainability, development of new hardware and increased importance placed on ethical and social considerations of AI."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: typo (used)

{
"instruction": "Create a graphic or logo that visually represents the word \"courage\".",
"input": "",
"output": "<No Output>"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably standardize an output for these instead of removing all of them completely. i liked the one above:

As a large language model, I am unable to generate visual data.

alpaca_data.json Outdated
Comment on lines 248908 to 248815
"instruction": "Given a list of materials, come up with a plan to build a table.",
"input": "Materials: Wood, paint, glue, nails, screws",
"output": "No Output"
"instruction": "Come up with a motto for a robot company.",
"input": "",
"output": "Machines that make a difference."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth running this one through GPT again.

alpaca_data.json Outdated
Comment on lines 250993 to 250995
"instruction": "Is the following statement true or false? People in China make the most of their allowances.",
"input": "None",
"input": "",
"output": "False"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. yeah, there are a few of these floating around. Seem to make no sense.

@gururise
Copy link
Contributor Author

gururise commented Mar 17, 2023

  • Fixed a few more issues.
  • Put the "visualization" tasks back with a standard response: "As a large language model, I am unable to generate visual data."

Noticed there are a several tasks that expect the LLM to use data from URL's. Many of which don't even exist. I've replaced equivalent data when available.

@gururise
Copy link
Contributor Author

gururise commented Mar 17, 2023

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

I only gave a cursory look and fixed the very obvious issues (ie. inconsistent empty input, obviously wrong answers, blank outputs, etc). I probably manually went through a few hundred examples.

I think I got most of the low-hanging fruit via searching for empty inputs and blank outputs. I did notice there are many instructions asking the LLM to reference online data to answer a question. These should probably be addressed in some manner.

@niclimcy
Copy link

niclimcy commented Mar 17, 2023

I’m not sure if this is the right place to ask but I was thinking of crowdsourcing updating of each response in the training data set with functions to approve and review each line

@chris-aeviator
Copy link

chris-aeviator commented Mar 17, 2023

Could contribute a simple system to accept/decline/upsert the entries

(Imagine each card in this kanban board beeing one instruction -> answer pair each)

grafik

Instead of category it would be a free form text field with the data from the original dataset that a reviewer can edit

grafik

@zkenda
Copy link

zkenda commented Mar 17, 2023

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools, that could potentially improve the accuracy of certain types of responses, such as calculations, image generation, or code compilation. The model should propose tools and use their output instead of relying solely on the language model's internal capabilities (which could be a big limitation considering the model size).

One can still detect the tool usage and replace it with generic answer if necessary.

@AndriyMulyar
Copy link

To assist with this, I made an embedding space explorer (running the data through a transformer) for visualizing the instructions and outputs.

Training Data Instructions Latent Space: https://atlas.nomic.ai/map/alpaca_instructions
Training Data Outputs: https://atlas.nomic.ai/map/alpaca_outputs

For example, here is a link to a bunch of bad data points in the outputs: https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612

@gururise
Copy link
Contributor Author

gururise commented Mar 17, 2023

The original Stanford Dataset is full of mistakes and holes. Another large issue I found was that many of the instructions hallucinated references to article URL's.

I made my best effort first pass through the dataset to clean it up:

  • Resolve empty outputs
  • Resolve empty inputs (no input, , n/a, etc.) for consistency
  • Added several CoT examples (from Google's FLAN paper)
  • Fixed a few empty code examples
  • Instructions to Generate Audio or Images default to message stating as an llm I can't do this.
  • Resolve N/A outputs
  • Fixed a few wrong answers.
  • Did my best to either insert actual text for URL's referring to articles, or replace them with an alternate instruction.
  • Remove several instructions asking the LLM to pull data from the internet.
  • Removed extraneous escape/ctrl characters in some answers

The patched dataset is much more consistent and no longer assumes the LLM can access the internet or view/generate visual data. It also now has a few CoT training examples.
Would be interested to see how training goes on this updated dataset.

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

I spent some time thinking about how to crowdsource dataset cleaning with minimal tooling. One way to do this is to create a separate repo with the following structure:

  • stanford_dataset.jsonl: a copy of the Alpaca dataset augmented with an id field for identification across versions
  • reviews: a folder of human-submitted data reviews
  • clean.py: a script or web interface that randomly samples unreviewed data points from stanford_dataset.jsonl for reviewing, then writes the edited or approved example to a new jsonl file in reviews
  • combine.py: a script that applies all the changes in reviews to the original dataset, and outputs a new cleaned_dataset.jsonl.

I suppose the utility of such an approach would depend on how many bad data points remain. In the meantime, I'll review the changes made so far and save a new "cleaned" dataset alongside the existing one.

@teknium1
Copy link

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

That's a lot of work to build. I'd hold out for that 22k dataset that LAION used to train SFT-1.

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

Folded into f704404. Thanks for your work!

@tloen tloen closed this Mar 17, 2023
@spAnser
Copy link

spAnser commented Mar 17, 2023

Looks like this just closed as I was typing but there is an typo not to far into the file which I'm not sure intentional or not.

construciton instead of construction

https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json#L23

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

8aecde8

@tloen
Copy link
Owner

tloen commented Mar 17, 2023

Although honestly we might want to leave typos in the instructions.

@spAnser
Copy link

spAnser commented Mar 17, 2023

Yeah it might be worth it idk.

@teknium1
Copy link

for prompts it seems a good idea to keep typos

@underlines
Copy link

underlines commented Mar 19, 2023

People should really support LAION's open-assistant.io project, because every person helping there, will improve a fully curated, crowd sourced, open sourced instruction fine tuning dataset, which in turn can be used for alpaca fine tuning.

@samching samching mentioned this pull request Mar 20, 2023
@gururise
Copy link
Contributor Author

FYI, the dataset cleaning is on-going. Latest cleaned dataset can be accessed here.

@wassname
Copy link

wassname commented Mar 22, 2023

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools

Good idea, meta is already working on it with toolformer and there are a few other efforts too, for example getting it to control a web browser. They help but not as much as you would expect at the moment (red is baseline, blue is with a calculator). Since it's a WIP I would guess it's outside the scope of this repo for now.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants