Skip to content

Latest commit

 

History

History
251 lines (143 loc) · 10.6 KB

main-log-en.md

File metadata and controls

251 lines (143 loc) · 10.6 KB

The training notes of GLM-130B

Basic Information about GLM-130B

  • 130B:70 layers,12288 hidden size,32768 ffn hidden size, 150000 vocab size
    • MP = 4, PP = 8
  • GLM + Rotary Positional Embedding + GeGLU + DeepNorm
  • FP32 softmax with QKV scaling(no PB-Relax)
  • Shrink embedding gradient with $\alpha=0.1$
  • Global batch size: 4224

Environment

  • PyTorch 1.11 / CUDA 11.3
  • LargeScale@400893da37bb5cbe22c29e41c02a052369cc72ce
  • DeepSpeed 0.6.1
  • apex@master

Speed Testing (with Different Batch Sizes)

  • 96 nodes, BSZ=176 * 24=4224
    • glm-130B-2022.05.05-19:34:16:134TFLOPS, 88.5s/iter, 48samples/s,
  • 96 nodes, BSZ=256 * 24=6144
    • glm-130B-2022.05.05-19:43:13:141TFLOPS, 122.5s/iter, 50samples/s

2022-05-06 04:00 Training starts

  • glm-130B-2022.05.05-19:53:15

2022-05-07 20:14 Node failure

n30041, n30157 break down, changing saving interval to 100 steps (originally 500 steps, too long), restart from 4000 step

  • glm-130B-2022.05.07-13:44:59

2022-05-10 00:00 Increase alpha for embedding shrink, as we think the original alpha is too small (originally 0.1)

add --shrink-embedding-gradient-steps 6000 500 to warmup alpha to 1 from 6000 step within 500 steps

  • glm-130B-2022.05.09-16:02:04

2022-05-11 12:13 Node failure

n30115 breaks down, restart from 7300 step

  • glm-130B-2022.05.11-05:55:32

2022-05-20 00:03 Node failure

n30066 breaks down, restart from 15400 step

  • glm-130B-2022.05.19-19:56:19

Switch to another node pool, and restart from 15600 step

  • glm-130B-2022.05.20-01:58:57

2022-05-21 12:40 Replace node

Finding that the training flop is only 127T, smaller than before; suspecting that the n30076 we have replaced in has some unknown errors and kicking it out from 16600 step; nothing changes

2022-05-22 19:27 Node failure

n30126 loses connection

  • glm-130B-2022.05.22-14:15:41

2022-05-26 04:30 Node failure

n30039 reports missing GPUs

  • glm-130B-2022.05.25-22:23:12

2022-05-28 11:50 Change Multi-task Instruction Pre-training (MIP) data (abolished)

Restarts from 22800 step, change MIP data to the correct one (English & Chinese)

  • glm-130B-2022.05.28-03:52:26
  • events.out.tfevents.1653709957.9droa42ltcad5-0.1858.0 (abolished)

2022-05-28 16:50 Change MIP data

New MIP data (English & Chinese) leads to NaN loss at 22900 step; finding too much noises in Chinese multi-task data; switch to vanilla T0 training datasets

  • glm-130B-2022.05.28-09:18:12
  • events.out.tfevents.1653729502.9droa42ltcad5-0.5648.0(移除)

2022-05-28 20:50 Add warmup (abolished)

Image.png

Vanilla T0 datasets still lead to disconvergence; suspecting a changed task ratio leads to the instability; add argument --warmup-samples-after-loading 2112000 to warmup 500 steps from 22800 step

  • glm-130B-2022.05.28-12:57:24
  • events.out.tfevents.1653742654.9droa42ltcad5-0.7942.0(移除)

2022-05-29 01:30 Disconverges again, switch to self-supervised pre-training only (abolished)

Image.png

  • Disconverges after warmup; suspecting that the distribution change is still too large; trying to restart using self-supervised pre-training only with data reshuffle, loading from 22800 step
  • glm-130B-2022.05.28-18:05:33
  • events.out.tfevents.1653761143.9droa42ltcad5-0.9744.0 (abolished)
  • global_step23200_text
  • Configuration file

2022-05-29 Smoothing distribution shift (abolished)

Image.png

Image.png

Self-supervised pre-training only seems to be stable; trying to smooth the distribution shift via a warmed-up ratio of correct T0 data from 22800 step

  • glm-130B-2022.05.29-05:17:06
  • events.out.tfevents.1653801436.9droa42ltcad5-0.13868.0 (abolished)

2022-05-29 22:40 Smoothing data distribution shift & warmup learning rate

  • Disconverges; suspecting that learning rate requires warmup in this process, too

Image.png

  • Restart from 22800, warmup correct MIP data ratio and learning rate for 2000 steps; warmup embedding gradient shrink alpha from 0.2 to 1 by 6000 steps
  • glm-130B-2022.05.29-17:35:45

2022-05-30 14:00 Node and file system failure

Finding the warmup steps for embedding gradient shrink to be wrong (26850 steps instead of 6000 steps); changing the warmup steps implementation (according to the absolute number of samples); restarting from global_step23200

We discover that the restart is stacked in the data loading, which turns out to be an error of the Lustre file system. The result is that we cannot read the 2.3T text corpora and the engineer cannot help to recover the data, and we have to copy data from backup disk to the file system again (which takes few days)

  • glm-130B-2022.05.31-02:18:24

2022.05.03 20:00 Add DeepStruct data to MIP

  • Keeping the original warmup process; adding DeepStruct data to MIP portion; restart from 23500 step

2022-06-01 22:22 Replace MIP data to a cleaner version

Finding one noisy prompt in the task data for T0 (qqp) and DeepStruct respectively; removing them and restarting from 24500 step

  • glm-130B-2022.06.01-14:24:33

2022-06-02 12:00 Node failure

  • n30145 CPU error, restarting from 25000 step; removing the warmup process as it has ended
  • glm-130B-2022.06.02-04:35:05

2022-06-02 09:30 Start to print multitask loss

From 25800 step, we print multitask loss

  • glm-130B-2022.06.03-01:40:12

2022-06-02 15:00 Reduce learning rate and print gpt/bert loss

The loss decreases slowly, and we think it might be attributed to a too large learning rate; from 26000 step, we half the learning rate

  • glm-130B-2022.06.03-07:26:16

2022-06-06 17:00 Node cluster maintenance

The node cluster needs an upgrade from 9 am to 5 am

  • glm-130B-2022.06.06-10:00:39

PS: we observe a significant improvement of the file system's reading speed; only need 1 minute to load the checkpoint now

2022-06-08 08:00 Node failure

  • glm-130B-2022.06.08-00:00:37

2022-06-09 13:30 Unexpected termination of the training

Restarting from 23100 step; suspecting the network communication problem

  • glm-130B-2022.06.09-05:27:54

2022-06-12 10:00 Loss explodes

From 33700 step, the training loss explodes. The loss-scale reduces drastically around 33710 step, and the loss explodes at 33740 step

  • tensorboard record:glm-130B-33700

Image.png

Image.png

  • Restaring from 33600 step, reduce shrink embedding gradient from 1.0 to 0.5
  • glm-130B-2022.06.12-02:20:49

2022-06-14 03:00 Loss explodes

At 35250 step, the loss explodes again; almost the same behavior as it is in 33700 step; breaking down without any signs

tensorboard record:glm-130B-35250

  • Restarting from 35200 step, and shrinking embedding gradient from 0.5 to 0.1
  • glm-130B-2022.06.14-02:28:21

2022-06-19 00:10 Node failure

n30085 breaks down, restarting from 39600 step

  • glm-130B-2022.06.18-17:49:53

2022-06-20 09:10 Loss explodes

Image.png

Image.png

  • tensorboard record:glm-130B-40800
  • --skip-train-iteration-range 40701-40900
  • Restarting from 40700 step and skipping the noisy data in 40701-40900 steps
  • glm-130B-2022.06.20-03:36:13

2022-06-22 10:40 Gradient spikes

Image.png

Image.png

  • The gradient norm experiences a spike, which seems to recover automatically; but the training loss experiences a drastic change
  • --skip-train-iteration-range 40701-40900
  • Restarting from 42400 and skipping data in 42401-42600 steps
  • glm-130B-2022.06.22-02:38:20

2022-06-22 21:00 Gradient spikes

Image.png

Image.png

  • The gradient norm experiences a spike again, but the loss-scale seems stable. We think it might recover automatically.
  • Rethinking on the repeating gradient spikes in recent days, we speculate it might be attributed to a too-slow learning rate decay in the late stage of pre-training; reducing minimum lr from 8e-6 to 4e-6
  • --min-lr 4e-6
  • Restarting from 42700 step
  • glm-130B-2022.06.22-13:03:53

2022.06.26 16:00 Node failure

  • Unexpected NVLink Error; restarting training
  • glm-130B-2022.06.26-13:13:51

2022.06.29 00:00 Recover position_id

  • Restarting training from 48100 step; using another more consistent positional encoding (the original one has a different implementation for [MASK] and [gMASK])
  • glm-130B-2022.06.29-13:53:21