[23:30:42.709323] number of params (M): 22.15 [23:30:42.709344] base lr: 1.000e-04 [23:30:42.709374] effective batch size: 128 [23:30:42.853795] Resume checkpoint ./save/cifar/checkpoint-50.pth [23:30:42.968253] With optim! [23:30:42.968960] Start training for 200 epochs [23:30:45.798757] Epoch: [51] [ 0/468] eta: 0:22:03 pseudo_num: 34.0000 (34.0000) pseudo_cluster: 8.0000 (8.0000) time: 2.8288 data: 2.5033 max mem: 6391 [23:30:58.074426] Epoch: [51] [ 20/468] eta: 0:05:22 pseudo_num: 365.0000 (361.1905) pseudo_cluster: 10.0000 (9.9048) time: 0.6137 data: 0.2567 max mem: 6391 [23:31:10.550871] Epoch: [51] [ 40/468] eta: 0:04:47 pseudo_num: 985.0000 (671.1951) pseudo_cluster: 10.0000 (9.9512) time: 0.6238 data: 0.2685 max mem: 6391 [23:31:22.829548] Epoch: [51] [ 60/468] eta: 0:04:26 pseudo_num: 1606.0000 (986.4426) pseudo_cluster: 10.0000 (9.9672) time: 0.6138 data: 0.2682 max mem: 6391 [23:31:35.072337] Epoch: [51] [ 80/468] eta: 0:04:09 pseudo_num: 2246.0000 (1301.8519) pseudo_cluster: 10.0000 (9.9753) time: 0.6121 data: 0.2571 max mem: 6391 [23:31:47.057277] Epoch: [51] [100/468] eta: 0:03:53 pseudo_num: 2879.0000 (1618.2376) pseudo_cluster: 10.0000 (9.9802) time: 0.5992 data: 0.2294 max mem: 6391 [23:31:58.783524] Epoch: [51] [120/468] eta: 0:03:38 pseudo_num: 3503.0000 (1932.0165) pseudo_cluster: 10.0000 (9.9835) time: 0.5863 data: 0.2291 max mem: 6391 [23:32:10.246928] Epoch: [51] [140/468] eta: 0:03:23 pseudo_num: 4139.0000 (2246.8369) pseudo_cluster: 10.0000 (9.9858) time: 0.5731 data: 0.2178 max mem: 6391 [23:32:22.071636] Epoch: [51] [160/468] eta: 0:03:09 pseudo_num: 4774.0000 (2562.6087) pseudo_cluster: 10.0000 (9.9876) time: 0.5912 data: 0.2388 max mem: 6391 [23:32:34.284801] Epoch: [51] [180/468] eta: 0:02:57 pseudo_num: 5384.0000 (2876.8950) pseudo_cluster: 10.0000 (9.9890) time: 0.6106 data: 0.2487 max mem: 6391 [23:32:46.465618] Epoch: [51] [200/468] eta: 0:02:44 pseudo_num: 6022.0000 (3191.6368) pseudo_cluster: 10.0000 (9.9900) time: 0.6090 data: 0.2504 max mem: 6391 [23:32:58.409180] Epoch: [51] [220/468] eta: 0:02:31 pseudo_num: 6625.0000 (3504.1584) pseudo_cluster: 10.0000 (9.9910) time: 0.5971 data: 0.2292 max mem: 6391 [23:33:09.789549] Epoch: [51] [240/468] eta: 0:02:18 pseudo_num: 7262.0000 (3817.4398) pseudo_cluster: 10.0000 (9.9917) time: 0.5690 data: 0.2030 max mem: 6391 [23:33:21.668820] Epoch: [51] [260/468] eta: 0:02:06 pseudo_num: 7901.0000 (4131.2452) pseudo_cluster: 10.0000 (9.9923) time: 0.5939 data: 0.2484 max mem: 6391 [23:33:33.396374] Epoch: [51] [280/468] eta: 0:01:54 pseudo_num: 8506.0000 (4444.8505) pseudo_cluster: 10.0000 (9.9929) time: 0.5863 data: 0.2318 max mem: 6391 [23:33:45.460396] Epoch: [51] [300/468] eta: 0:01:41 pseudo_num: 9182.0000 (4760.3289) pseudo_cluster: 10.0000 (9.9934) time: 0.6032 data: 0.2329 max mem: 6391 [23:33:57.607401] Epoch: [51] [320/468] eta: 0:01:29 pseudo_num: 9828.0000 (5076.7040) pseudo_cluster: 10.0000 (9.9938) time: 0.6073 data: 0.2419 max mem: 6391 [23:34:09.618853] Epoch: [51] [340/468] eta: 0:01:17 pseudo_num: 10460.0000 (5393.3607) pseudo_cluster: 10.0000 (9.9941) time: 0.6005 data: 0.2397 max mem: 6391 [23:34:21.053351] Epoch: [51] [360/468] eta: 0:01:05 pseudo_num: 11112.0000 (5710.7258) pseudo_cluster: 10.0000 (9.9945) time: 0.5717 data: 0.2101 max mem: 6391 [23:34:32.607061] Epoch: [51] [380/468] eta: 0:00:53 pseudo_num: 11734.0000 (6027.9108) pseudo_cluster: 10.0000 (9.9948) time: 0.5776 data: 0.2136 max mem: 6391 [23:34:44.442817] Epoch: [51] [400/468] eta: 0:00:40 pseudo_num: 12345.0000 (6343.9451) pseudo_cluster: 10.0000 (9.9950) time: 0.5917 data: 0.2347 max mem: 6391 [23:34:56.127646] Epoch: [51] [420/468] eta: 0:00:28 pseudo_num: 12966.0000 (6659.6651) pseudo_cluster: 10.0000 (9.9952) time: 0.5842 data: 0.2197 max mem: 6391 [23:35:08.160455] Epoch: [51] [440/468] eta: 0:00:16 pseudo_num: 13609.0000 (6975.2290) pseudo_cluster: 10.0000 (9.9955) time: 0.6016 data: 0.2392 max mem: 6391 [23:35:20.498896] Epoch: [51] [460/468] eta: 0:00:04 pseudo_num: 14201.0000 (7289.3406) pseudo_cluster: 10.0000 (9.9957) time: 0.6169 data: 0.2561 max mem: 6391 [23:35:24.346152] Epoch: [51] [467/468] eta: 0:00:00 pseudo_num: 14416.0000 (7399.1474) pseudo_cluster: 10.0000 (9.9957) time: 0.5701 data: 0.2317 max mem: 6391 [23:35:24.412457] Epoch: [51] Total time: 0:04:41 (0.6014 s / it) [23:35:24.443135] Averaged stats: pseudo_num: 14416.0000 (7399.1474) pseudo_cluster: 10.0000 (9.9957) [23:35:28.697632] Epoch: [52] [ 0/468] eta: 0:33:10 pseudo_num: 14728.0000 (14728.0000) pseudo_cluster: 10.0000 (10.0000) loss_ins: 3.9020 (3.9020) loss_clu: 0.0069 (0.0069) time: 4.2533 data: 2.5101 max mem: 9859 [23:35:48.144208] Epoch: [52] [ 20/468] eta: 0:08:25 pseudo_num: 14757.0000 (14759.6667) pseudo_cluster: 10.0000 (10.0000) loss_ins: 3.8441 (3.8491) loss_clu: 0.0122 (0.0186) time: 0.9723 data: 0.0027 max mem: 9859 [23:36:07.576468] Epoch: [52] [ 40/468] eta: 0:07:30 pseudo_num: 14854.0000 (14807.3415) pseudo_cluster: 10.0000 (10.0000) loss_ins: 3.8614 (3.8524) loss_clu: 0.0113 (0.0202) time: 0.9716 data: 0.0022 max mem: 9859 [23:36:27.041225] Epoch: [52] [ 60/468] eta: 0:06:58 pseudo_num: 14951.0000 (14854.9180) pseudo_cluster: 10.0000 (10.0000) loss_ins: 3.8744 (3.8620) loss_clu: 0.0463 (0.0341) time: 0.9732 data: 0.0022 max mem: 9859 [23:36:46.550312] Epoch: [52] [ 80/468] eta: 0:06:33 pseudo_num: 15008.0000 (14892.3333) pseudo_cluster: 10.0000 (10.0000) loss_ins: 3.9696 (3.8980) loss_clu: 0.1252 (0.1246) time: 0.9754 data: 0.0022 max mem: 9859 [23:37:06.036749] Epoch: [52] [100/468] eta: 0:06:10 pseudo_num: 14736.0000 (14865.5743) pseudo_cluster: 10.0000 (10.0000) loss_ins: 4.4048 (4.0071) loss_clu: 0.7196 (0.2886) time: 0.9743 data: 0.0025 max mem: 9859 Traceback (most recent call last): File "boost.py", line 333, in main(args) Traceback (most recent call last): File "boost.py", line 333, in main(args) File "boost.py", line 273, in main train_stats, pseudo_labels = boost_one_epoch( File "/home/gonzales/euroffice-clustering/euroffice_clustering/clustering/models/TCL/engine.py", line 165, in boost_one_epoch z_i, z_j, c_j = model(x_w, x_s, return_ci=False) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 118 119 120 121 122 123 124 125 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 78400) of binary: /home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clusteri ng-cj7YVFwh-py3.8/bin/python Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/gonzales/.cache/pypoetry/virtualenvs/euroffice-clustering-cj7YVFwh-py3.8/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_a gent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ boost.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-11-12_23:37:25 host : tesla rank : 0 (local_rank: 0) exitcode : 1 (pid: 78400) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================