-
@StellaAthena created the Common LLM Settings spreadsheet which can be a super-useful resource when you're about to embark on a new LLM training - as it tells you how many known LLM trainings were created.
-
A few years back I started compiling information on which dtype the models were trained in - it only contains a handful of models but if you're doing a research on dtypes it can still be useful. I was using this information to try and write a model pretraining dtype auto-detection and here is a related float16 vs bfloat16 numerical properties comparison.
Logbooks and chronicles of training LLM/VLM are one of the best sources to learn from about dealing with training instabilities and choosing good hyper parameters.
If you know of a public LLM/VLM training logbook that is not on this list please kindly let me know or add it via a PR. Thank you!
The listing is in no particular order other than being grouped by the year.
- BigScience pre-BLOOM 108B training experiments (2021): chronicles | the full spec and discussions (backup: 1 | 2)
-
BigScience BLOOM-176B (2022): chronicles-prequel | chronicles | the full spec and discussions (backup: 1 | 2 | 3)
-
THUDM GLM-130B (2022): en logbook | Mandarin version (backup: 1 | 2)
-
HuggingFace IDEFICS-80B multimodal (Flamingo repro) (2023): Learning log | Training Chronicles (backup: 1 | 2)
-
BloombergGPT 50B LLM - section C in BloombergGPT: A Large Language Model for Finance
-
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs - the paper covers various training issues and their resolution - albeit on models that are proprietary yet just as instructional/useful.
-
Imbue's From bare metal to a 70B model: infrastructure set-up and scripts very detailed technical post covers many training-related issues that they had to overcome while training a proprietary 70B-param model.
- Imbue published a detailed log of how they have set up a 512-node IB-fat-tree cluster and made it to work: From bare metal to a 70B model: infrastructure set-up and scripts, they also open-sourced the cluster tooling they created in the process.