Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMoE Language Evaluation #3

Open
caichaoxiang opened this issue Nov 19, 2024 · 1 comment
Open

EMoE Language Evaluation #3

caichaoxiang opened this issue Nov 19, 2024 · 1 comment

Comments

@caichaoxiang
Copy link

Hello, during the Language training and testing process of EMoE, when I test after training, the following is displayed:


['cola']
Namespace(adaptive_experts=False, add_expert_size=0, aux_loss_weight=0.01, cache_dir='./.cache', capacity_factor=1.5, checkpointing_steps=None, disable_peft=False, expert_repeat=1, gate_noise=1.0, gate_type='top', gradient_accumulation_steps=1, hub_model_id=None, hub_token=None, ignore_mismatched_sizes=False, include_training=False, is_gshard_loss=False, key_gate=False, learning_rates=[2e-05, 3e-05, 5e-05], load_model=None, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, max_expert_num=8, max_length=128, max_train_steps=None, model_name_or_path='/MyData/bert-large-cased', moe_drop=0.1, moe_layers=[10, 11], normalize_one_score_gate=False, num_experts=16, num_train_epochs=10, num_warmup_steps=0, one_score=False, one_score_gate_update_momentum=0.0, output_dir='test', pad_to_max_length=False, per_device_eval_batch_size=32, per_device_train_batch_size=64, push_to_hub=False, random_ cluster=False, random_init_gate=False, report_to='tensorboard', resume_from_checkpoint=None, save_model=False, seeds=[0, 1, 2], source_dir='/MyData/bert-large-cased_save/cola', task_name='cola', to_MoE=False, top_k=4, train_file=None, use_fp1 6=True, use_slow_tokenizer=False, validation_file=None, weight_decay=0.0, with_tracking=True) learn_gate_random_False_repeat16
test
No best results found

What is the problem?

As far as I can remember I only changed the following in search_glue_no_trainer.py line 544:
------------------------------------------------
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=(8 if accelerator.use_fp16 else None))
------------------------------------------------

there is an error (*** AttributeError: 'Accelerator' object has no attribute 'use_fp16'), so I changed it to:
------------------------------------------------
try:
pad_to_multiple_of = (8 if accelerator.use_fp16 else None)
except:
pad_to_multiple_of = (None)
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=pad_to_multiple_of)
------------------------------------------------

@QAQdev
Copy link
Collaborator

QAQdev commented Nov 20, 2024

During training, EMoE does a grid search on seeds (0,1,2) and lr (2e-5, 3e-5, 5e-5), each combination will produce a result. And when grid search ends, a txt file will be saved at the output dir. You may see something like this:
image
The filename of this txt file contains the best learning rate found during training. In test_glue_no_trainer.py, you should see the follow lines of code, which extracts the best lr from the filename of the txt file.
So to find the bug, I think you need to check whether the txt file is saved successfully during training.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants