Helmet model training process using CPU "killed" #134

tiansiyuan · 2024-01-06T22:10:43Z

Describe the bug

In notebook terminal:

$ python train.py --device cpu
github: skipping check (not a git repository)
YOLOv5 🚀 9d6a4aa torch 1.8.1+cpu CPU

Namespace(adam=False, artifact_alias='latest', batch_size=32, bbox_interval=-1, bucket='', cache_images=False, cfg='models/yolov5s_hat.yaml', data='data/hat.yaml', device='cpu', entity=None, epochs=50, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=32, upload_dataset=False, weights='yolov5s.pt', workers=8, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     18879  models.yolo.Detect                      [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7066239 parameters, 7066239 gradients, 16.5 GFLOPS

Transferred 308/362 items from yolov5s.pt
Scaled weight_decay = 0.0005
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
train: Scanning 'VOCdevkit/labels/train' images and labels... 5912 found, 0 missing, 13 empty, 0 corrupted:  78%|██████████████████▋     | 5912/7578 [00:02<00:00, 3088.14it/s]/opt/conda/lib/python3.8/site-packages/PIL/TiffImagePlugin.py:845: UserWarning: Corrupt EXIF data.  Expecting to read 4 bytes but only got 0. 
  warnings.warn(str(msg))
train: Scanning 'VOCdevkit/labels/train' images and labels... 7578 found, 0 missing, 13 empty, 0 corrupted: 100%|████████████████████████| 7578/7578 [00:02<00:00, 2789.12it/s]
train: New cache created: VOCdevkit/labels/train.cache
val: Scanning 'VOCdevkit/labels/val' images and labels... 5297 found, 0 missing, 8 empty, 0 corrupted: 100%|██████████████████████████████| 5297/5297 [00:06<00:00, 831.43it/s]
val: New cache created: VOCdevkit/labels/val.cache
Plotting labels... 

autoanchor: Analyzing anchors... anchors/target = 4.25, Best Possible Recall (BPR) = 0.9999
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 50 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
      0/49        0G   0.08902   0.08044   0.01565    0.1851       678       640: 100%|██████████████████████████████████████████████████████| 237/237 [34:19<00:00,  8.69s/it]
               Class      Images      Labels           P           R      [email protected]  [email protected]:.95:  88%|██████████████████████████████████████▋     | 73/83 [05:31<00:59,  5.96s/it]
Killed

Reproduction steps

open terminal in notebook
python train.py --device cpu

...

Expected behavior

Training completes successfully.

Additional context

No response

tiansiyuan · 2024-01-10T06:51:59Z

Most probably it's caused by time out.

A more specific error message is preferred.

tiansiyuan added the bug Something isn't working label Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helmet model training process using CPU "killed" #134

Helmet model training process using CPU "killed" #134

tiansiyuan commented Jan 6, 2024

tiansiyuan commented Jan 10, 2024

Helmet model training process using CPU "killed" #134

Helmet model training process using CPU "killed" #134

Comments

tiansiyuan commented Jan 6, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

tiansiyuan commented Jan 10, 2024