-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some update to tr10 config #20
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts on the config, I'll update the readme at the end so that we leave a trail with the conclusions we come up with.
@@ -46,7 +46,7 @@ GLOBAL_BATCH_SIZE=2048 | |||
|
|||
NLAYERS=40 | |||
NHIDDEN=5120 | |||
NHEADS=32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't why we chose 32? We seem to have updated the NHIDDEN
value to be 5120 because it was divisible by 128, and 5120 // 128 = 40
.
cc @VictorSanh @stas00 @mryab (People who were involved in the original post)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, 530B training used:
NLAYERS=105
NHIDDEN=20480
NHEADS=128
So the same proportion as 32 and 5120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.
I'm not sure if there is a research showing size of the head vs. number of the heads performance.
--hidden-dropout 0.0 \ | ||
--attention-dropout 0.0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs
@@ -57,13 +57,14 @@ OPTIMIZER_ARGS=" \ | |||
--adam-beta1 0.9 \ | |||
--adam-beta2 0.95 \ | |||
--adam-eps 1e-8 \ | |||
--lr 6e-5 \ | |||
--lr 1e-4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5
?
--min-lr 6e-6 \ | ||
--lr-decay-style cosine \ | ||
--lr-decay-samples 126_953_125 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you removed this one w/o any commentary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original tr1-13B said:
We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341
We have been using this in arch/scaling.
However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.
@@ -165,7 +166,7 @@ export CMD=" \ | |||
--load $CHECKPOINT_PATH \ | |||
--data-path $DATA_PATH \ | |||
--data-impl mmap \ | |||
--split 900,100,0 \ | |||
--split 950,50,0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.
This PR is for sorting out the tr10-104B config.