[BUG] Create zero equivalency unit test #2790

tjruwase · 2023-02-03T23:04:15Z

Starting point: #966

Test matrix

gradient accumulation: one vs many
#gpus: one vs many
stages: 1 vs 2 vs 3
dtype: bf16 vs fp16 vs fp32

stas00 · 2023-02-03T23:21:13Z

for (4)th item training and comparing the loss curve you can probably use HF Trainer + pytorch example program, so everything is already done for you - e.g. see the example of how I train from scratch opt-1.3b huggingface/transformers#21312 -
so this will generate a tensorboard and if you add --skip_memory_metrics 0 it'll even print you detailed summary of allocated vs peak memory for each stage.

so just need to add a set of ds_config files with auto values

I can help set this one up. It should be very trivial to do. as it's really just adding on of --fp16 or --bf16 or nothing.

And also using 3 set ups w/o deepspeed - just using DDP, so that would be another baseline for each of the dtypes.

to summarize,

part of this project is to do pretty precise tests with some tolerance in some dimensions
actual training where we can see the real outcome - which is very visual to see if something is not training well.

stas00 · 2023-02-04T00:09:23Z

Actually I forgot I developed a whole tool to do grid search / matrix of options runs:
https://github.com/huggingface/transformers/blob/main/scripts/benchmark/trainer-benchmark.py

CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1
--report-metric-keys train_loss --repeat-times 1 --base-variation '--tf32 0' \
--variations '|--fp16|--bf16' '--tf32 0|--tf32 1'

the last line demonstrates an example of 6 variations it will run using the same base script: fp16/bf16/fp32 vs tf32(on/off) = 3*2 = 6 variations.

this may or may not be easier to use - not sure, but we have plenty of working out of the box choices.

I think you just need to find a resource allocation for that and we can set up these jobs very quickly.

Then tensorboard all the results into the same base directory --report_to tensorboard -- logging_dir some_path and it'll produce a very easy to run tensorboard --logdir some_path and review the outcome. Perhaps this can be automated as well and the graphs emailed somewhere or posted to Slack or Teams.

xingchensong · 2023-10-30T09:44:30Z

hi teams, any updates？

brianyu-nexusflowai · 2024-01-23T17:45:11Z

Hey folks, is this still an active issue? I'm observing some differences in training between zero2 and zero3 using Llama models with the fixed rotary embedding cache init (#4932 (comment)).

stas00 · 2024-01-23T17:49:12Z

As you can see from the discussion you linked to there will be no equivalency in that particular case of LLama-2 due to how the buffers are created. I urge you to file an Issue with HF Transformers and ask them to distribute the correct buffers with the model weights and not leave them to be recalculated at model init time.

tohtana · 2024-10-11T21:13:53Z

We have the scripts to compare the DeepSpeed's results with PyTorch.
https://github.com/tohtana/validate_zero
The mixed precision support is limited but we can start here.

tjruwase added enhancement New feature or request training labels Feb 3, 2023

tjruwase self-assigned this Feb 3, 2023

tjruwase assigned HeyangQin Feb 21, 2023

tjruwase assigned jomayeri and unassigned HeyangQin Mar 9, 2023

tjruwase mentioned this issue May 15, 2023

[BUG] ZeRO 1 and ZeRO 2 produces different losses #1945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Create zero equivalency unit test #2790

[BUG] Create zero equivalency unit test #2790

tjruwase commented Feb 3, 2023

stas00 commented Feb 3, 2023

stas00 commented Feb 4, 2023 •

edited

Loading

xingchensong commented Oct 30, 2023

brianyu-nexusflowai commented Jan 23, 2024

stas00 commented Jan 23, 2024

tohtana commented Oct 11, 2024

[BUG] Create zero equivalency unit test #2790

[BUG] Create zero equivalency unit test #2790

Comments

tjruwase commented Feb 3, 2023

stas00 commented Feb 3, 2023

stas00 commented Feb 4, 2023 • edited Loading

xingchensong commented Oct 30, 2023

brianyu-nexusflowai commented Jan 23, 2024

stas00 commented Jan 23, 2024

tohtana commented Oct 11, 2024

stas00 commented Feb 4, 2023 •

edited

Loading