EMoE Language Evaluation #3

caichaoxiang · 2024-11-19T09:20:43Z

Hello, during the Language training and testing process of EMoE, when I test after training, the following is displayed:

['cola']
Namespace(adaptive_experts=False, add_expert_size=0, aux_loss_weight=0.01, cache_dir='./.cache', capacity_factor=1.5, checkpointing_steps=None, disable_peft=False, expert_repeat=1, gate_noise=1.0, gate_type='top', gradient_accumulation_steps=1, hub_model_id=None, hub_token=None, ignore_mismatched_sizes=False, include_training=False, is_gshard_loss=False, key_gate=False, learning_rates=[2e-05, 3e-05, 5e-05], load_model=None, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, max_expert_num=8, max_length=128, max_train_steps=None, model_name_or_path='/MyData/bert-large-cased', moe_drop=0.1, moe_layers=[10, 11], normalize_one_score_gate=False, num_experts=16, num_train_epochs=10, num_warmup_steps=0, one_score=False, one_score_gate_update_momentum=0.0, output_dir='test', pad_to_max_length=False, per_device_eval_batch_size=32, per_device_train_batch_size=64, push_to_hub=False, random_ cluster=False, random_init_gate=False, report_to='tensorboard', resume_from_checkpoint=None, save_model=False, seeds=[0, 1, 2], source_dir='/MyData/bert-large-cased_save/cola', task_name='cola', to_MoE=False, top_k=4, train_file=None, use_fp1 6=True, use_slow_tokenizer=False, validation_file=None, weight_decay=0.0, with_tracking=True) learn_gate_random_False_repeat16
test
No best results found

What is the problem?

As far as I can remember I only changed the following in search_glue_no_trainer.py line 544:
------------------------------------------------
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=(8 if accelerator.use_fp16 else None))
------------------------------------------------

there is an error (*** AttributeError: 'Accelerator' object has no attribute 'use_fp16'), so I changed it to:
------------------------------------------------
try:
pad_to_multiple_of = (8 if accelerator.use_fp16 else None)
except:
pad_to_multiple_of = (None)
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=pad_to_multiple_of)
------------------------------------------------

QAQdev · 2024-11-20T08:15:51Z

During training, EMoE does a grid search on seeds (0,1,2) and lr (2e-5, 3e-5, 5e-5), each combination will produce a result. And when grid search ends, a txt file will be saved at the output dir. You may see something like this:

The filename of this txt file contains the best learning rate found during training. In test_glue_no_trainer.py, you should see the follow lines of code, which extracts the best lr from the filename of the txt file.
So to find the bug, I think you need to check whether the txt file is saved successfully during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMoE Language Evaluation #3

EMoE Language Evaluation #3

caichaoxiang commented Nov 19, 2024

QAQdev commented Nov 20, 2024 •

edited

Loading

EMoE Language Evaluation #3

EMoE Language Evaluation #3

Comments

caichaoxiang commented Nov 19, 2024

QAQdev commented Nov 20, 2024 • edited Loading

QAQdev commented Nov 20, 2024 •

edited

Loading