Our ToL Hierarchical GUI region detection model is based on mmdetection. We have finetuned DINO with a customized configuration on Android Screen Hierarchical Layout (ASHL) dataset and inference on Screen Point-and-Read (ScreenPR) Benchmark. This guide covers how to set up environment, training and inference details.
You need to prepare mmdetection environment based on our cloned source code.
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
- Step 2: Install MMDetection from our source repository
cd <the root of repo tol_gui_region_detection>
pip install -v -e . -r requirements/tracking.txt
- Step 3: Install extra components to support sync results on wandb.io:
pip install future tensorboard
pip install wandb
- Step 1 [Optional]: prepare training data with coco style using the migration script configs/dino/convert_mobile_segement_to_multilabel_coco.py. Supposed the training data has been put into ../data/screendata folder. As we also put the generated files configs/dino/data/train/annotation_multilabel_coco.json and configs/dino/data/val/annotation_multilabel_coco.json into our source code, this step can be optional if you don't need configuration different from us.
cd configs/dino/
python convert_mobile_segement_to_multilabel_coco.py
- Step 2, Using ./tools/dist_train_custom_multi_bbox.sh to train model on multiple GPUs using Rest backbone. The model configuration file is configs/dino/dino-4scale_r50_8xb2-90e_mobile_multi_bbox.py. For our cases, 4 * A6000 are used and you can change the dist_train_custom_multi_bbox.sh based on your own machine settings.
Run the following script to train on 4 * A6000:
# distributed training
./tools/dist_train_custom_multi_bbox.sh configs/dino/dino-4scale_r50_8xb2-90e_mobile_multi_bbox.py 4
On wandb.ai, the result after 90 epoch as follow:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.941
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.962
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.947
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.702
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.897
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.943
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.959
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.961
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.961
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.814
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.916
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.963
mmengine - INFO - bbox_mAP_copypaste: 0.941 0.962 0.947 0.702 0.897 0.943
mmengine - INFO - Epoch(val) [90][11/11] coco/bbox_mAP: 0.9410 coco/bbox_mAP_50: 0.9620 coco/bbox_mAP_75: 0.9470 coco/bbox_mAP_s: 0.7020 coco/bbox_mAP_m: 0.8970 coco/bbox_mAP_l: 0.9430 data_time: 0.0137 time: 0.2778
You can use the following script to run test.py for test data and the visualization result will be saved in the folder dino-4scale_r50_8xb2-90e_mobile_multi_bbox_imgs/.
python tools/test.py configs/dino/dino-4scale_r50_8xb2-90e_mobile_multi_bbox.py ./work_dirs/dino-4scale_r50_8xb2-90e_mobile_multi_bbox/epoch_90.pth --show-dir dino-4scale_r50_8xb2-90e_mobile_multi_bbox_imgs/
- Step 3 [Optional]: use Swin-l as backbone to train for 12 epoch with configuration file configs/dino/dino-5scale_swin-l_8xb2-36e_mobile_multi_bbox.py. In comparison, the loss curve is much worse than the one of Rest backbone.
python tools/train.py configs/dino/dino-5scale_swin-l_8xb2-36e_mobile_multi_bbox.py --train_batch_size 2 --val_batch_size 2 --lr 0.001 --epoch 12 # 12 out of memory during 16
# distributed training
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train_custom_multi_bbox.sh configs/dino/dino-5scale_swin-l_8xb2-36e_mobile_multi_bbox.py 4
- Step 1: Data preparation
Put ScreenPR dataset under the src folder of Screen-Point-and-Read github folder, having the relative path of ../../../data/mobile_pc_web_osworld to the root of current github project.
- Step 2: Using our trained ToL model
The pretrained LoT weight has been shared in DINO weights trained by 90 epoch, save it to ./work_dirs/dino-4scale_r50_8xb2-90e_mobile_multi_bbox/epoch_90.pth and use the following script to trigger inference. A output folder will be generated with the name output_dino-4scale_r50_8xb2-90e_mobile_multi_bbox_mobile_pc_web_osworld under the same parent folder ../../../data/.
python inference_test_screendata.py --input_folder ../../../data/mobile_pc_web_osworld --model_config configs/dino/dino-4scale_r50_8xb2-90e_mobile_multi_bbox.py --checkpoint ./work_dirs/dino-4scale_r50_8xb2-90e_mobile_multi_bbox/epoch_90.pth
- Step 3: Using original Dino model
Download the original Dino weights and save it to ./work_dirs/dino-4scale_r50_improved_8xb2-12e_coco/dino-4scale_r50_improved_8xb2-12e_coco_20230818_162607-6f47a913.pth and use the following script to trigger inference.
python inference_test_screendata_by_dino_original.py --input_folder ../../../data/mobile_pc_web_osworld
- mmdetection preparation
- Customize Datasets
- Dataset customization
- Prepare dataset
- Finetune model
- Train Object Detector with MMDetection and W&B
- Logging analysis
- Inferencer on mmdetection DINO
- Deal with the issue "data['category_id'] = self.cat_ids[label] IndexError: list index out of range #4243"