-
-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HUB session with DDP training #13103
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13103 +/- ##
===========================================
- Coverage 70.40% 35.53% -34.88%
===========================================
Files 124 124
Lines 15905 15886 -19
===========================================
- Hits 11198 5645 -5553
- Misses 4707 10241 +5534
Flags with carried forward coverage won't be shown. Click here to find out more. β View full report in Codecov by Sentry. |
@glenn-jocher I'm not really familiar with the whole workflow between hub and ultralytics but I figured we can directly load model from hub so I had to keep some code of hub-session in ultralytics/ultralytics/engine/model.py Lines 136 to 140 in 654c37f
https://github.com/ultralytics/ultralytics/blob/654c37f09bc3b1e9d182a6f4ea315616bf14c643/ultralytics/engine/model.py#L180-185 |
@glenn-jocher Also I tested in my local multi-gpu machine and it seems to work properly i.e it's trying to create hub-session in ddp training. That's so far I'm able to test since I don't have any hub env and account(I'm guessing it's same as wandb, which needs an account). |
Got it, thanks @Laughing-q! @Burhan-Q can you test this PR for DDP training from HUB and then also from Ultralytics to HUB? |
@glenn-jocher @Laughing-q @sergiuwaxmann this worked with a model created from HUB Here's the post training print out of training arguments. model.session.train_args
>>> {
'batch': -1,
'cache': 'ram',
'data': 'coco128.yaml',
'device': [0, 1], # DDP enabled
'epochs': 10,
'imgsz': 640,
'patience': 50,
'time': None
} Local log
from ultralytics import YOLO, hub
hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful β
>>> True
model = YOLO('https://hub.ultralytics.com/models/oljNmlCqCllzTUL5Jwwj')
results = model.train()
Ultralytics YOLOv8.2.20 π Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=detect, mode=train, model=yolov8s.pt, data=coco128.yaml, epochs=10, time=None, patience=50, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=ultralytics/runs/detect/train
from n params module arguments
0 -1 1 928 ultralytics.nn.modules.conv.Conv [3, 32, 3, 2]
1 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
2 -1 1 29056 ultralytics.nn.modules.block.C2f [64, 64, 1, True]
3 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
4 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
5 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
6 -1 2 788480 ultralytics.nn.modules.block.C2f [256, 256, 2, True]
7 -1 1 1180672 ultralytics.nn.modules.conv.Conv [256, 512, 3, 2]
8 -1 1 1838080 ultralytics.nn.modules.block.C2f [512, 512, 1, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 1 591360 ultralytics.nn.modules.block.C2f [768, 256, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
16 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1]
19 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 1 1969152 ultralytics.nn.modules.block.C2f [768, 512, 1]
22 [15, 18, 21] 1 2147008 ultralytics.nn.modules.head.Detect [80, [128, 256, 512]]
Model summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs
Transferred 355/355 items from pretrained weights
WARNING β οΈ 'batch=-1' for AutoBatch is incompatible with Multi-GPU training, setting default 'batch=16'
DDP: debug command /home/burhan/ultra_repo/.ultra/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 54685 /home/burhan/.config/Ultralytics/DDP/_temp_o135mjif139927424095984.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Ultralytics YOLOv8.2.20 π Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW π
Transferred 355/355 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed β
train: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|ββββββββββ| 128/128 [00:00<?, ?it/s]
train: Caching images (0.1GB RAM): 100%|ββββββββββ| 128/128 [00:00<00:00, 1948.18it/s]
val: Scanning /home/shared/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|ββββββββββ| 128/128 [00:00<?, ?it/s]
val: Caching images (0.1GB RAM): 100%|ββββββββββ| 128/128 [00:00<00:00, 984.03it/s]
Plotting labels to ultralytics/runs/detect/train/labels.jpg...
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: AdamW(lr=0.000714, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to ultralytics/runs/detect/train
Starting training for 10 epochs...
Closing dataloader mosaic
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/10 2.37G 1.215 1.451 1.245 42 640: 100%|ββββββββββ| 8/8 [00:02<00:00, 3.45it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:01<00:00, 6.85it/s]
all 128 929 0.757 0.684 0.76 0.588
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
2/10 2.43G 1.21 1.482 1.245 48 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 8.90it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.82it/s]
all 128 929 0.748 0.665 0.764 0.58
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
3/10 2.44G 1.122 1.112 1.147 48 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.47it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.54it/s]
all 128 929 0.711 0.692 0.777 0.591
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
4/10 2.44G 0.9762 0.9867 1.098 62 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.60it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.94it/s]
all 128 929 0.771 0.71 0.797 0.623
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
5/10 2.49G 0.9425 0.9654 1.071 68 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.25it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 16.00it/s]
all 128 929 0.762 0.734 0.802 0.621
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
6/10 2.44G 1.026 0.899 1.084 31 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.77it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.68it/s]
all 128 929 0.826 0.733 0.809 0.629
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
7/10 2.48G 0.8806 0.8252 1.058 45 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.26it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.98it/s]
all 128 929 0.799 0.768 0.818 0.638
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
8/10 2.47G 0.8754 0.818 1.025 49 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.14it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.87it/s]
all 128 929 0.844 0.722 0.827 0.649
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
9/10 2.49G 0.9863 0.8815 1.152 36 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 9.62it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.99it/s]
all 128 929 0.883 0.72 0.836 0.663
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
10/10 2.45G 0.98 0.8182 1.072 39 640: 100%|ββββββββββ| 8/8 [00:00<00:00, 10.12it/s]
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:00<00:00, 15.65it/s]
all 128 929 0.867 0.736 0.841 0.666
10 epochs completed in 0.006 hours.
Optimizer stripped from ultralytics/runs/detect/train/weights/last.pt, 22.6MB
Optimizer stripped from ultralytics/runs/detect/train/weights/best.pt, 22.6MB
Validating ultralytics/runs/detect/train/weights/best.pt...
Ultralytics YOLOv8.2.20 π Python-3.10.12 torch-2.2.0+cu121 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
Model summary (fused): 168 layers, 11156544 parameters, 0 gradients, 28.6 GFLOPs
Class Images Instances Box(P R mAP50 mAP50-95): 100%|ββββββββββ| 8/8 [00:02<00:00, 3.27it/s]
all 128 929 0.869 0.738 0.841 0.666
person 128 254 0.964 0.632 0.853 0.652
bicycle 128 6 0.801 0.333 0.537 0.312
car 128 46 1 0.306 0.591 0.311
motorcycle 128 5 0.917 1 0.995 0.89
airplane 128 6 0.964 1 0.995 0.907
bus 128 7 1 0.791 0.995 0.883
train 128 3 0.962 1 0.995 0.808
truck 128 12 0.961 0.5 0.69 0.434
boat 128 6 0.878 0.667 0.791 0.522
traffic light 128 14 1 0.249 0.356 0.285
stop sign 128 2 0.898 1 0.995 0.848
bench 128 9 1 0.733 0.833 0.603
bird 128 16 1 0.881 0.995 0.707
cat 128 4 0.926 1 0.995 0.891
dog 128 9 0.891 0.908 0.984 0.819
horse 128 2 0.896 1 0.995 0.75
elephant 128 17 1 0.923 0.955 0.795
bear 128 1 0.695 1 0.995 0.895
zebra 128 4 0.936 1 0.995 0.959
giraffe 128 9 1 0.956 0.995 0.85
backpack 128 6 1 0.703 0.837 0.594
umbrella 128 18 0.821 0.889 0.95 0.704
handbag 128 19 0.821 0.421 0.598 0.44
tie 128 7 1 0.802 0.858 0.618
suitcase 128 4 1 0.813 0.995 0.616
frisbee 128 5 0.89 0.8 0.804 0.684
skis 128 1 0.781 1 0.995 0.796
snowboard 128 7 0.569 0.714 0.763 0.579
sports ball 128 6 1 0.554 0.67 0.442
kite 128 10 0.89 0.3 0.583 0.296
baseball bat 128 4 0.668 0.25 0.565 0.43
baseball glove 128 7 0.935 0.429 0.439 0.3
skateboard 128 5 0.623 1 0.938 0.59
tennis racket 128 7 0.745 0.571 0.607 0.401
bottle 128 18 1 0.353 0.752 0.499
wine glass 128 16 1 0.485 0.703 0.501
cup 128 36 0.849 0.806 0.857 0.58
fork 128 6 0.802 0.333 0.644 0.478
knife 128 16 0.856 0.625 0.79 0.584
spoon 128 22 0.9 0.411 0.636 0.484
bowl 128 28 0.9 0.821 0.872 0.707
banana 128 1 0.732 1 0.995 0.995
sandwich 128 2 0.863 1 0.995 0.995
orange 128 4 0.556 0.75 0.702 0.524
broccoli 128 11 0.616 0.295 0.558 0.382
carrot 128 24 0.838 0.647 0.874 0.647
hot dog 128 2 0.876 1 0.995 0.995
pizza 128 5 0.967 1 0.995 0.904
donut 128 14 0.695 1 0.936 0.862
cake 128 4 0.944 1 0.995 0.905
chair 128 35 0.765 0.559 0.742 0.545
couch 128 6 0.816 0.749 0.852 0.708
potted plant 128 14 0.873 0.929 0.955 0.781
bed 128 3 0.908 1 0.995 0.94
dining table 128 13 1 0.757 0.854 0.738
toilet 128 2 0.965 1 0.995 0.896
tv 128 2 0.898 1 0.995 0.799
laptop 128 3 0.932 1 0.995 0.907
mouse 128 2 0.588 0.5 0.545 0.413
remote 128 8 1 0.638 0.861 0.662
cell phone 128 8 1 0.542 0.63 0.442
microwave 128 3 0.74 1 0.995 0.952
oven 128 5 0.76 0.6 0.665 0.479
sink 128 6 0.901 0.5 0.783 0.658
refrigerator 128 5 0.935 1 0.995 0.8
book 128 29 0.708 0.241 0.625 0.418
clock 128 9 0.89 0.9 0.973 0.818
vase 128 2 0.631 1 0.995 0.995
scissors 128 1 1 0 0.995 0.219
teddy bear 128 21 0.838 0.81 0.862 0.625
toothbrush 128 5 0.734 1 0.995 0.836
Speed: 0.1ms preprocess, 3.3ms inference, 0.0ms loss, 3.1ms postprocess per image
Results saved to ultralytics/runs/detect/train
Ultralytics HUB: Syncing final model...
100%|ββββββββββ| 21.5M/21.5M [00:01<00:00, 12.3MB/s]
Ultralytics HUB: Done β
Ultralytics HUB: View model at https://hub.ultralytics.com/models/bqwlsS8e96a9fzpRATuW π |
Note This PR is related to ultralytics/hub#695 and ultralytics/hub#606 |
I attempted launching a local training while logged into my HUB account (both with and without DDP), but the HUB logging doesn't appear to work for either case on this branch. from ultralytics import YOLO, hub
hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful β
>>> True
model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, device=6) No model was uploaded after training completes |
does this work properly with single-gpu mode on main branch? |
@Laughing-q yes it does work when I switch to |
I tested out a modification to the if SETTINGS["hub"] and self.hub_session is None:
# Create a model in HUB
try:
from ultralytics.hub.session import HUBTrainingSession
session = HUBTrainingSession(self.args.model)
self.hub_session = session if session.client.authenticated else self.hub_session
if self.hub_session:
self.hub_session.create_model(self.args)
# Check model was created
if not self.hub_session.model:
self.hub_session = None
except (PermissionError, ModuleNotFoundError):
# Ignore PermissionError and ModuleNotFoundError which indicates hub-sdk not installed
pass With these changes, I suddenly get lots of these in the training log
Train log
from ultralytics import YOLO, hub
hub.login(API_KEY)
>>> Ultralytics HUB: New authentication successful β
>>> True
model = YOLO("yolov8s-seg.pt")
result = model.train(data="coco8-seg.yaml", epochs=10, device=3)
Ultralytics YOLOv8.2.20 π Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
engine/trainer: task=segment, mode=train, model=yolov8s-seg.pt, data=coco8-seg.yaml, epochs=10, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=3, workers=8, project=None, name=train6, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=/home/burhan/tests/ultralytics/runs/segment/train6
from n params module arguments
0 -1 1 928 ultralytics.nn.modules.conv.Conv [3, 32, 3, 2]
1 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
2 -1 1 29056 ultralytics.nn.modules.block.C2f [64, 64, 1, True]
3 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
4 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
5 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
6 -1 2 788480 ultralytics.nn.modules.block.C2f [256, 256, 2, True]
7 -1 1 1180672 ultralytics.nn.modules.conv.Conv [256, 512, 3, 2]
8 -1 1 1838080 ultralytics.nn.modules.block.C2f [512, 512, 1, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 1 591360 ultralytics.nn.modules.block.C2f [768, 256, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
16 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1]
19 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 1 1969152 ultralytics.nn.modules.block.C2f [768, 512, 1]
22 [15, 18, 21] 1 2801504 ultralytics.nn.modules.head.Segment [80, 32, 128, [128, 256, 512]]
YOLOv8s-seg summary: 261 layers, 11821056 parameters, 11821040 gradients, 42.9 GFLOPs
Transferred 417/417 items from pretrained weights
2024-05-24 09:13:41,106 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 09:13:41,109 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model.
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed β
train: Scanning /home/shared/datasets/coco8-seg/labels/train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|ββββββββββ| 4/4 [00:00<?, ?it/s]
val: Scanning /home/shared/datasets/coco8-seg/labels/val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|ββββββββββ| 4/4 [00:00<?, ?it/s]
Plotting labels to /home/burhan/tests/ultralytics/runs/segment/train6/labels.jpg...
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically...
optimizer: AdamW(lr=0.000119, momentum=0.9) with parameter groups 66 weight(decay=0.0), 77 weight(decay=0.0005), 76 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to /home/burhan/tests/ultralytics/runs/segment/train6
Starting training for 10 epochs...
Closing dataloader mosaic
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
1/10 1.52G 0.9476 2.702 1.978 1.283 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 1.31it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 4.19it/s]
all 4 17 0.822 0.898 0.94 0.679 0.822 0.898 0.939 0.592
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
2/10 1.54G 0.9262 2.695 2.596 1.247 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 9.71it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 14.64it/s]
all 4 17 0.832 0.905 0.941 0.68 0.832 0.905 0.941 0.6
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
3/10 1.57G 0.8526 2.626 2.046 1.234 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 8.52it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 17.86it/s]
all 4 17 0.838 0.913 0.942 0.68 0.838 0.913 0.935 0.599
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
4/10 1.57G 1.143 3.088 2.44 1.395 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 8.65it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 23.26it/s]
all 4 17 0.834 0.913 0.94 0.672 0.834 0.913 0.939 0.601
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
5/10 1.55G 1.009 2.831 2.98 1.271 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 3.74it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 0%| | 0/1 [00:00<?, ?it/s]2024-05-24 09:13:50,859 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 17.41it/s]
all 4 17 0.836 0.912 0.941 0.683 0.836 0.912 0.932 0.599
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
6/10 1.59G 1.268 3.083 2.17 1.654 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 5.34it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 22.36it/s]
all 4 17 0.844 0.912 0.942 0.684 0.844 0.912 0.934 0.599
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
7/10 1.67G 0.7353 2.692 1.904 1.155 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 9.19it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 22.59it/s]
all 4 17 0.889 0.911 0.942 0.684 0.889 0.911 0.934 0.602
2024-05-24 09:13:52,195 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
8/10 1.67G 1.062 2.927 2.253 1.209 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 7.85it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 22.27it/s]
all 4 17 0.846 0.913 0.942 0.674 0.846 0.913 0.934 0.587
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
9/10 1.65G 1.024 2.655 2.2 1.32 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 7.58it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 23.05it/s]
all 4 17 0.873 0.915 0.942 0.675 0.873 0.915 0.934 0.587
Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
10/10 1.65G 0.6717 1.964 1.453 1.056 13 640: 100%|ββββββββββ| 1/1 [00:00<00:00, 7.38it/s]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 22.08it/s]
all 4 17 0.863 0.925 0.942 0.687 0.863 0.925 0.934 0.603
2024-05-24 09:13:54,152 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
10 epochs completed in 0.002 hours.
2024-05-24 09:13:54,537 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/last.pt, 23.9MB
Optimizer stripped from /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt, 23.9MB
Validating /home/burhan/tests/ultralytics/runs/segment/train6/weights/best.pt...
Ultralytics YOLOv8.2.20 π Python-3.10.12 torch-2.2.0+cu121 CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
YOLOv8s-seg summary (fused): 195 layers, 11810560 parameters, 0 gradients, 42.6 GFLOPs
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|ββββββββββ| 1/1 [00:00<00:00, 26.21it/s]
2024-05-24 09:13:55,403 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
all 4 17 0.863 0.924 0.942 0.671 0.863 0.924 0.934 0.587
person 4 10 0.844 0.547 0.678 0.339 0.844 0.547 0.629 0.308
dog 4 1 0.737 1 0.995 0.895 0.737 1 0.995 0.895
horse 4 2 0.903 1 0.995 0.65 0.903 1 0.995 0.226
elephant 4 2 0.946 1 0.995 0.448 0.946 1 0.995 0.4
umbrella 4 1 0.75 1 0.995 0.895 0.75 1 0.995 0.895
potted plant 4 1 1 1 0.995 0.796 1 1 0.995 0.796
2024-05-24 09:13:57,840 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:13:58,935 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Speed: 0.1ms preprocess, 2.2ms inference, 0.0ms loss, 0.8ms postprocess per image
Results saved to /home/burhan/tests/ultralytics/runs/segment/train6
2024-05-24 09:14:00,430 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
Ultralytics HUB: Syncing final model...
2024-05-24 09:14:01,795 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
2024-05-24 09:14:02,201 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
59%|ββββββ | 13.5M/22.8M [00:00<00:00, 15.2MB/s]2024-05-24 09:14:04,112 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance.
100%|ββββββββββ| 22.8M/22.8M [00:01<00:00, 14.2MB/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/model.py", line 660, in train
self.trainer.train()
File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 205, in train
self._do_train(world_size)
File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 468, in _do_train
self.run_callbacks("on_train_end")
File "/home/burhan/ultra_repo/ultralytics/ultralytics/engine/trainer.py", line 165, in run_callbacks
callback(self)
File "/home/burhan/ultra_repo/ultralytics/ultralytics/utils/callbacks/hub.py", line 69, in on_train_end
LOGGER.info(f"{PREFIX}Done β
\n" f"{PREFIX}View model at {session.model_url} π")
AttributeError: 'HUBTrainingSession' object has no attribute 'model_url'
>>> 2024-05-24 09:14:08,415 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
Ultralytics HUB: Received no response from the request. If this issue persists please visit https://github.com/ultralytics/hub/issues for assistance. |
@glenn-jocher @Burhan-Q @sergiuwaxmann Guys I think I found a bug of hub-sdk package here...when using HUB logging for local training. ultralytics/ultralytics/hub/session.py Line 96 in 654c37f
which means HUB logging fails when we pass an arg to override cache (True or False) i.e using following script to launch a training locally on main branch won't get any logging on HUB:
from ultralytics import YOLO, hub
hub.login(API_KEY)
model = YOLO("yolov8s-pose.pt")
results = model.train(data="coco8-pose.yaml", epochs=10, cache=True) meanwhile it throws hub error log: 2024-05-24 22:06:46,208 - hub_sdk.helpers.logger - ERROR - Unknown error occurred.
2024-05-24 22:06:46,210 - hub_sdk.helpers.logger - ERROR - Received no response from the server while creating the model Based on the case, and the fact I updated the parameter of passing ultralytics/ultralytics/engine/trainer.py Line 785 in 570f894
By default cache equals to False from self.args , hence no HUB logging.It seems to me root is the cache issue from hub-sdk package.
|
I could easily fix the issue in this PR by excluding |
@Laughing-q @Burhan-Q @sergiuwaxmann strange. I think this PR might need some extra study, if we rush a solution we might just end up with more bugs. What we need is a solution that will log to HUB correctly when training from both:
We should really strive to implement the logging identically to W&B, which works correctly with our callbacks in all scenarios. I don't have much time today but I will look into this this weekend. |
@glenn-jocher I resolved the conflicts and eliminated the FYI I used to consider updating |
@Laughing-q wow this is great, much simpler! |
ultralytics 8.2.41
fix HUB session with DDP training
Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
@Laughing-q I'm getting errors when training DDP from HUB to local. I see there are dataset download issues also, this appears to be happening twice, so I think we need to only autodownload datasets on RANK -1, 0, but that might be a separate issue. ![]() |
Signed-off-by: Glenn Jocher <[email protected]> Co-authored-by: UltralyticsAssistant <[email protected]>
@Laughing-q I've been testing this some more and something strange is happening on dataset download from HUB (both single and multi-GPU), where coco8 is not unzipping correctly to ../datasets/coco8, it's unzipping to ../datasets/coco8/coco8. I'm going to merge this PR and try to figure out what's happening on the dataset unzip issue. |
ultralytics 8.2.41
fix HUB session with DDP trainingSigned-off-by: Glenn Jocher <[email protected]> Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: Burhan <[email protected]> Co-authored-by: Ultralytics Assistant <[email protected]> Co-authored-by: UltralyticsAssistant <[email protected]>
π οΈ PR Summary
Made with β€οΈ by Ultralytics Actions
π Summary
Improved distributed training and HUB session handling in the Ultralytics training workflow. π οΈ
π Key Changes
torch_distributed_zero_first
decorator to ensure all distributed processes wait for the main process to complete certain tasks.torch_distributed_zero_first
to prevent multiple auto-downloads of the dataset in distributed settings.π― Purpose & Impact