Bad dataset #65

abacaj · 2023-03-19T06:10:07Z

If anyone is curious here is my run on the Alpaca dataset using another decoder model (codegen-16B-nl). Appears the dataset isn't diverse, multiple closely related answers. I believe this dataset is not capable of generalizing well to new data.

The loss from the original Alpaca training script follows a similar pattern used in OPT-IML to compute loss based on the label.

My run on codegen-16B-nl

Another user's run on LLaMA 7B

Some more discussion: https://twitter.com/abacaj/status/1637310768780648448

Wingie · 2023-03-19T08:17:05Z

Hmm, i wonder if we could get better results using a dataset with higher quality like https://github.com/allenai/natural-instructions/tree/master/splits/default

kooshi · 2023-03-19T13:06:04Z

Yeah, I'm currently playing with Laion OIG small-chip2, but I'm really looking forward to the result of their Open-Assistant project, which will create a fully human generated, refined, and curated instruction dataset.

baleksey · 2023-03-19T15:50:31Z

Dataset is the problem I guess. Here are just two examples of inputs/outputs which teaches the model nothing but hallucinating.

{
    "instruction": "Retrieve the last three transactions from the customer's account.",
    "input": "Customer ID: 12333",
    "output": "Transaction 1: USD"
},

{
    "instruction": "Transcribe the recording into text.",
    "input": "Recording about the history of Waterloo",
    "output": "The Battle of Waterloo, fought on June 18, 1815, was the last major battle of the Napoleonic Wars. It marked the end of Napoleon Bonaparte's rule. The battle was fought between Napoleon Bonaparte's French forces and an allied force of British, Dutch, Belgian and Prussian troops, led by the Duke of Wellington. It was fought near the town of Waterloo in the former Austrian Netherlands, now present-day Belgium. The battle was a defeat for Napoleon's forces and ended Napoleon's rule over France....
}

I'm sure there are a lot of such bad instructions there which makes more problems than proper learning.

gururise · 2023-03-19T16:25:57Z

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:

{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}

There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

abacaj · 2023-03-19T16:46:44Z

There is definitely a lot of bad data in the dataset. I went through and manually cleaned hundreds of issues resulting in the current "cleaned" dataset, but there are likely many more issues I missed.

I came across and fixed many issues referencing data on the internet that would likely just cause hallucinations, for example:
{
   "instruction": "Describe what is shown in this painting in one sentence",
   "input": "https://works.bepress.com/jing-chen/3/download/",
   "output": "This painting depicts a mother embracing her child in a loving embrace, surrounded by a peaceful and tranquil atmosphere."
}
There were hundreds of similar examples in the original dataset.

@abacaj Are the training results you posted of the original dataset or the cleaned dataset?

The runs posted are from the original, unfortunately only discovered this dataset after posting on twitter. But I can probably do a smaller model run to see if it improves the loss. I personally don't know it will help to remove what appear to be bad samples

To me it seems like there are very similar answers in the labels of the dataset which are used to compute the loss, which could lead to overfitting/reduce ability to generalize. Here's 3 examples I found - maybe not a issue but it appears to contribute to the flat loss - ex not a diverse dataset

{
    "instruction": "Generate a list of 5 American cities with populations higher than 1 million.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston, Phoenix."
},

{
    "instruction": "Name 5 US cities with populations over 1 million.",
    "input": "",
    "output": "The 5 US cities with populations over 1 million are: New York City, Los Angeles, Chicago, Houston, and Phoenix."
},

{
    "instruction": "Generate a list of four US cities with over one million population.",
    "input": "",
    "output": "New York City, Los Angeles, Chicago, Houston."
},

tloen · 2023-03-19T19:28:14Z

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

abacaj · 2023-03-19T20:08:14Z

Following our discussion on twitter, here is a screenshot of my current alpaca-lora training run (losses are a bit higher because I'm masking out the instruction in the loss):

I'm starting to drift towards the idea that we should probably abandon the Alpaca dataset entirely once we get a suitable SFT dataset from the Open-Assistant project, or at least diversify the seed prompts in the original repo.

Looks better. We could probably improve quality by filtering out duplicate instruction/answer from the dataset by picking the best ones

I’m curious how you did the masking because I did something similar in my run by applying IGNORE_INDEX to the labels up to the instruction prompt length

Just realized your loss is still a bit of a flatline like my previous run, I think validation loss will show that it is overfitting

samching · 2023-03-20T01:46:26Z

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

teknium1 · 2023-03-20T04:55:34Z

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

abacaj · 2023-03-20T05:22:13Z

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

teknium1 · 2023-03-20T10:04:42Z

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

Makes sense to me as well for the prompt, the output dataset should aim to be correct

I agree with that forsure.

Wingie · 2023-03-20T11:02:38Z

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

samching · 2023-03-20T21:01:48Z

LAION's dataset can be found here https://github.com/LAION-AI/Anh/tree/main/data in case anyone wants to give a try for it in training!

Interesting - it looks like 100K lines of User: | Assistant: input / ouput pairs, pulled from different dataset sources. I wonder if this represents the latest from these efforts?

gururise · 2023-03-21T18:23:32Z

I started a new effort to try and clean up the current alpaca dataset
https://github.com/gururise/AlpacaDataCleaned

conceptofmind · 2023-03-22T15:26:37Z

I am working on putting together a FLAN dataset as well to upload to the HF hub.

Training a 7B and 13B llama model on OIG at bf16 no LORA. Will have those out soon.

claysauruswrecks · 2023-03-24T04:25:52Z

Maybe tangentially related, but @tloen curious why you might want to leave typos in the dataset (per #32 (comment))

Not my place to respond, but I would say leaving typos in the prompt makes it understand the typo should be thought of as what it is meant to be, and respond accordingly

My intuition is we should keep the training data scoped and focused. Correct all typos for the training data that does not cover the skill of correcting wrong spellings. Create more (there are some already) training prompts specifically focused on understanding the transition from:

Identifying wrong spelling input
Correct spelling from context
Understanding corrected input

claysauruswrecks · 2023-03-24T07:19:16Z

I've opened #152 to start the process of vendoring datasets in other repos.

I went through all the history for alpaca_data_cleaned.json in this repo to make sure the big fixes were in the vendored submodule.

Next, I will go through and improve the training prompts in @gururise repo.

conceptofmind · 2023-03-25T02:31:03Z

Uploaded these so far for Flan:
https://huggingface.co/datasets/conceptofmind/flan_niv2_zsopt
https://huggingface.co/datasets/conceptofmind/flan_cot_fsopt
https://huggingface.co/datasets/conceptofmind/flan_cot_zsopt
https://huggingface.co/datasets/conceptofmind/flan_cot_submix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad dataset #65

Bad dataset #65

abacaj commented Mar 19, 2023 •

edited

Loading

Wingie commented Mar 19, 2023

kooshi commented Mar 19, 2023

baleksey commented Mar 19, 2023 •

edited

Loading

gururise commented Mar 19, 2023 •

edited

Loading

abacaj commented Mar 19, 2023 •

edited

Loading

tloen commented Mar 19, 2023 •

edited

Loading

abacaj commented Mar 19, 2023 •

edited

Loading

samching commented Mar 20, 2023

teknium1 commented Mar 20, 2023

abacaj commented Mar 20, 2023

teknium1 commented Mar 20, 2023

Wingie commented Mar 20, 2023

samching commented Mar 20, 2023

gururise commented Mar 21, 2023

conceptofmind commented Mar 22, 2023 •

edited

Loading

claysauruswrecks commented Mar 24, 2023

claysauruswrecks commented Mar 24, 2023

conceptofmind commented Mar 25, 2023 •

edited

Loading

Bad dataset #65

Bad dataset #65

Comments

abacaj commented Mar 19, 2023 • edited Loading

My run on codegen-16B-nl

Another user's run on LLaMA 7B

Wingie commented Mar 19, 2023

kooshi commented Mar 19, 2023

baleksey commented Mar 19, 2023 • edited Loading

gururise commented Mar 19, 2023 • edited Loading

abacaj commented Mar 19, 2023 • edited Loading

tloen commented Mar 19, 2023 • edited Loading

abacaj commented Mar 19, 2023 • edited Loading

samching commented Mar 20, 2023

teknium1 commented Mar 20, 2023

abacaj commented Mar 20, 2023

teknium1 commented Mar 20, 2023

Wingie commented Mar 20, 2023

samching commented Mar 20, 2023

gururise commented Mar 21, 2023

conceptofmind commented Mar 22, 2023 • edited Loading

claysauruswrecks commented Mar 24, 2023

claysauruswrecks commented Mar 24, 2023

conceptofmind commented Mar 25, 2023 • edited Loading

abacaj commented Mar 19, 2023 •

edited

Loading

baleksey commented Mar 19, 2023 •

edited

Loading

gururise commented Mar 19, 2023 •

edited

Loading

abacaj commented Mar 19, 2023 •

edited

Loading

tloen commented Mar 19, 2023 •

edited

Loading

abacaj commented Mar 19, 2023 •

edited

Loading

conceptofmind commented Mar 22, 2023 •

edited

Loading

conceptofmind commented Mar 25, 2023 •

edited

Loading