Boosting not working #5

fulcus · 2022-11-13T16:53:17Z

Training for 50 epochs on CIFAR-10 with

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 train.py --num_workers 4 --batch_size 128 --epochs 50

and then boosting with

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 boost.py --num_workers 4 --batch_size 128 --dataset CIFAR-10 --resume ./save/checkpoint-50.pth

Throws the following error:

Traceback (most recent call last):                                                                                                                                          
  File "boost.py", line 333, in <module>                                                                                                                                    
    main(args)                                                                                                                                                              
  File "boost.py", line 273, in main                                                                                                                                        
    train_stats, pseudo_labels = boost_one_epoch(                                                                                                                           
  File "/home/gonzales/euroffice-clustering/euroffice_clustering/clustering/models/TCL/engine.py", line 165, in boost_one_epoch                                             
    z_i, z_j, c_j = model(x_w, x_s, return_ci=False)                                                                                                                        
  File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl    
    return forward_call(*input, **kwargs)                                                                                                                                   
  File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward  
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():                                                                                                         
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss.                                                                                                 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).               
Parameter indices which did not receive grad 
for rank 0: 118 119 120 121 122 123 124 125                                                                                    
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

This occurs in the middle of the 2nd boosting epoch.
Full log: tcl_cifar_exception.txt

This first occurred while I was boosting on a custom dataset, so I tried on CIFAR to see if it was caused by the model itself or the dataset. I think #3 (comment) was referring to this too.

The text was updated successfully, but these errors were encountered:

fulcus · 2022-11-13T17:00:38Z

It is worth adding that I modified the original code as described in XLearning-SCU/2022-IJCV-TCL#3

Yunfan-Li · 2022-11-14T07:30:39Z

Hi, it seems that in that batch there are no confident predictions selected as pseudo labels. In line 367 in loss.py, the cluster loss would be set to zero in that case. Thus the cluster projector did not receive grad, which leads to the above error. You could manually check the loss value.

fulcus · 2022-11-14T15:46:16Z

Thank you for quick response!

On CIFAR it might be caused by the low number of epochs (50 vs the paper's 1000), but on my custom dataset I trained over 1000 epochs and it had the same issue.
So my question is:

Why are there no confident predictions?
Is there a way to understand if my training yielded good enough predictions to get some pseudo-labels before running boost.py?

Yunfan-Li · 2022-11-15T01:37:06Z

What is the target cluster number of your custom dataset? As pointed out in the paper, when the cluster number is large, a sharper temperature in cluster-level loss is recommended.
If there is no ground-truth label for evaluation, you may start the boosting stage when a reasonable percent of samples have confident predictions (e.g., 20% of samples having >0.9 confidence).

IKOL111 · 2023-03-26T03:00:34Z

Thank you for quick response!

On CIFAR it might be caused by the low number of epochs (50 vs the paper's 1000), but on my custom dataset I trained over 1000 epochs and it had the same issue. So my question is:

Why are there no confident predictions?

Is there a way to understand if my training yielded good enough predictions to get some pseudo-labels before running boost.py?

Hello, I have encountered the same problem. Have you solved it?

fulcus · 2023-03-26T17:13:58Z

Hi, sorry but I haven't worked on it much anymore. After learning the problem was the low confidence and high number of clusters I just reduced the number of clusters, each having more samples. It worked pretty well.

IKOL111 · 2023-03-28T06:04:12Z

Hi, sorry but I haven't worked on it much anymore. After learning the problem was the low confidence and high number of clusters I just reduced the number of clusters, each having more samples. It worked pretty well.

Thank you for your reply. I encountered this problem when replacing my dataset, and my dataset only has 10 categories. I tried to modify the confidence parameter in InstanceLossBoost, but there was an error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boosting not working #5

Boosting not working #5

fulcus commented Nov 13, 2022

fulcus commented Nov 13, 2022

Yunfan-Li commented Nov 14, 2022

fulcus commented Nov 14, 2022 •

edited

Loading

Yunfan-Li commented Nov 15, 2022

IKOL111 commented Mar 26, 2023

fulcus commented Mar 26, 2023

IKOL111 commented Mar 28, 2023

Boosting not working #5

Boosting not working #5

Comments

fulcus commented Nov 13, 2022

fulcus commented Nov 13, 2022

Yunfan-Li commented Nov 14, 2022

fulcus commented Nov 14, 2022 • edited Loading

Yunfan-Li commented Nov 15, 2022

IKOL111 commented Mar 26, 2023

fulcus commented Mar 26, 2023

IKOL111 commented Mar 28, 2023

fulcus commented Nov 14, 2022 •

edited

Loading