Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Resources

Useful compilations

@StellaAthena created the Common LLM Settings spreadsheet which can be a super-useful resource when you're about to embark on a new LLM training - as it tells you how many known LLM trainings were created.
A few years back I started compiling information on which dtype the models were trained in - it only contains a handful of models but if you're doing a research on dtypes it can still be useful. I was using this information to try and write a model pretraining dtype auto-detection and here is a related float16 vs bfloat16 numerical properties comparison.

Publicly available training LLM/VLM logbooks

Logbooks and chronicles of training LLM/VLM are one of the best sources to learn from about dealing with training instabilities and choosing good hyper parameters.

If you know of a public LLM/VLM training logbook that is not on this list please kindly let me know or add it via a PR. Thank you!

The listing is in no particular order other than being grouped by the year.

2021

BigScience pre-BLOOM 108B training experiments (2021): chronicles | the full spec and discussions (backup: 1 | 2)

2022

BigScience BLOOM-176B (2022): chronicles-prequel | chronicles | the full spec and discussions (backup: 1 | 2 | 3)
Meta OPT-175B (2022): logbook | Video (backup: 1)
THUDM GLM-130B (2022): en logbook | Mandarin version (backup: 1 | 2)

2023

HuggingFace IDEFICS-80B multimodal (Flamingo repro) (2023): Learning log | Training Chronicles (backup: 1 | 2)
BloombergGPT 50B LLM - section C in BloombergGPT: A Large Language Model for Finance

2024

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs - the paper covers various training issues and their resolution - albeit on models that are proprietary yet just as instructional/useful.
Imbue's From bare metal to a 70B model: infrastructure set-up and scripts very detailed technical post covers many training-related issues that they had to overcome while training a proprietary 70B-param model.

Hardware setup logbooks

Imbue published a detailed log of how they have set up a 512-node IB-fat-tree cluster and made it to work: From bare metal to a 70B model: infrastructure set-up and scripts, they also open-sourced the cluster tooling they created in the process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

README.md

Resources

Useful compilations

Publicly available training LLM/VLM logbooks

2021

2022

2023

2024

Hardware setup logbooks

Files

resources

Directory actions

More options

Directory actions

More options

Latest commit

History

resources

Folders and files

parent directory

README.md

Resources

Useful compilations

Publicly available training LLM/VLM logbooks

2021

2022

2023

2024

Hardware setup logbooks