Encounter RuntimeError while running with Apex #60

subercui · 2020-03-31T00:20:15Z

Running apex with allennlp train configs/contrastive.jsonnet -s tmp --include-package t2t -o "{"trainer": {"opt_level": 'O1'}}" returns exceptions as following:

Traceback (most recent call last):
  File "/h/haotian/.conda/envs/t2tCLR/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/__main__.py", line 18, in run
    main(prog="allennlp")
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/__init__.py", line 93, in main
    args.func(args)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 143, in train_model_from_args
    dry_run=args.dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 202, in train_model_from_file
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 265, in train_model
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 462, in _train_worker
    metrics = train_loop.run()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 521, in run
    return self.trainer.train()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 687, in train
    train_metrics = self._train_epoch(epoch)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 465, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 380, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch/ssd001/home/haotian/Code/t2t/t2t/models/contrastive_text_encoder.py", line 122, in forward
    contrastive_loss = self._loss(embeddings, labels)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/base_metric_loss_function.py", line 53, in forward
    loss = self.compute_loss(embeddings, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 40, in compute_loss
    return self.loss_method(mat, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 59, in pair_based_loss
    return self._compute_loss(pos_pair, neg_pair, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/ntxent_loss.py", line 20, in _compute_loss
    max_val = torch.max(pos_pairs, torch.max(neg_pairs, dim=1, keepdim=True)[0])
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'other' in call to _th_max
  0%|                                                                                                   | 0/1 [00:01<?, ?it/s]

The text was updated successfully, but these errors were encountered:

JohnGiorgi · 2020-03-31T00:23:16Z

Thanks @subercui, this error arises because the PyTorch Metric Learning library. I opened an issue on Apex here but no response :( maybe you can open an issue on PyTorch Metric Learning?

subercui · 2020-03-31T00:23:58Z

Thanks! I'll have a look

JohnGiorgi · 2020-04-04T03:40:38Z

I found a manual solution that works. Install PyTorch Metric Learning from source and change:

torch.max(neg_pairs, dim=1, keepdim=True)[0])

to

torch.max(neg_pairs, dim=1, keepdim=True)[0].half())

in NTXentLoss. Still, I think it makes sense to raise this issue on the PyTorch Metric Learning github.

KevinMusgrave · 2020-06-10T21:02:04Z

I think this happens because I create infinity values using python's float('inf'). I could have an optional half_precision flag for all loss functions, and if it's True, then cast all numbers made with float() to pytorch's half()

JohnGiorgi · 2020-06-10T21:17:50Z

Ah, I think you are right. There's a discussion on this HF Transformers PR where they end up writing an assert for a similar scenario:

masked_bias = self.masked_bias.to(w.dtype)
assert masked_bias.item() != -float("inf"), "Make sure `self.masked_bias` is not `-inf` in fp16 mode"
w = torch.where(mask, w, masked_bias)

What about replacing float('inf') with a very large value instead (see here)? That way, amp can handle it automatically and there's no need for the user to specify half_precision (update: upon closer inspection of that issue, I am not sure if this will actually work).

KevinMusgrave · 2020-06-10T21:34:21Z

At least for NTXentLoss, setting it to a large negative value (instead of float('-inf')) would be fine, because the purpose is to make particular entries 0 when passed to torch.exp. I'll have to check if it makes sense for the other places where I use float

JohnGiorgi · 2020-06-10T21:36:17Z

Awesome, thanks for weighing in!

KevinMusgrave · 2020-07-31T03:15:13Z

v0.9.90.dev0 supports half precision

pip install pytorch-metric-learning==0.9.90.dev0

JohnGiorgi · 2020-07-31T15:24:33Z

@KevinMusgrave Awesome! Thanks a lot.

KevinMusgrave mentioned this issue Jun 27, 2020

NT-Xet loss? KevinMusgrave/pytorch-metric-learning#6

Closed

JohnGiorgi added the bug Something isn't working label Jul 13, 2020

JohnGiorgi pinned this issue Jul 26, 2020

JohnGiorgi mentioned this issue Aug 12, 2020

Bump Pytorch Metric Learning #142

Merged

JohnGiorgi closed this as completed in #142 Aug 12, 2020

JohnGiorgi unpinned this issue Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encounter RuntimeError while running with Apex #60

Encounter RuntimeError while running with Apex #60

subercui commented Mar 31, 2020

JohnGiorgi commented Mar 31, 2020 •

edited

Loading

subercui commented Mar 31, 2020

JohnGiorgi commented Apr 4, 2020 •

edited

Loading

KevinMusgrave commented Jun 10, 2020

JohnGiorgi commented Jun 10, 2020 •

edited

Loading

KevinMusgrave commented Jun 10, 2020 •

edited

Loading

JohnGiorgi commented Jun 10, 2020

KevinMusgrave commented Jul 31, 2020

JohnGiorgi commented Jul 31, 2020

Encounter RuntimeError while running with Apex #60

Encounter RuntimeError while running with Apex #60

Comments

subercui commented Mar 31, 2020

JohnGiorgi commented Mar 31, 2020 • edited Loading

subercui commented Mar 31, 2020

JohnGiorgi commented Apr 4, 2020 • edited Loading

KevinMusgrave commented Jun 10, 2020

JohnGiorgi commented Jun 10, 2020 • edited Loading

KevinMusgrave commented Jun 10, 2020 • edited Loading

JohnGiorgi commented Jun 10, 2020

KevinMusgrave commented Jul 31, 2020

JohnGiorgi commented Jul 31, 2020

JohnGiorgi commented Mar 31, 2020 •

edited

Loading

JohnGiorgi commented Apr 4, 2020 •

edited

Loading

JohnGiorgi commented Jun 10, 2020 •

edited

Loading

KevinMusgrave commented Jun 10, 2020 •

edited

Loading